Skip to content

Note bundled flash-linear-attention kernels for gated-deltanet models#6850

Open
danielhanchen wants to merge 3 commits into
mainfrom
fla-gated-deltanet-advisory
Open

Note bundled flash-linear-attention kernels for gated-deltanet models#6850
danielhanchen wants to merge 3 commits into
mainfrom
fla-gated-deltanet-advisory

Conversation

@danielhanchen

@danielhanchen danielhanchen commented Jul 3, 2026

Copy link
Copy Markdown
Member

Summary

Gated-deltanet models (Qwen3.5, Qwen3.6, Qwen3-Next, kimi-linear) run their linear-attention layers on flash-linear-attention (fla) Triton kernels when available, and otherwise fall back to a several-times-slower pure PyTorch recurrence (measured 978 to 5573 tok/s on Qwen3.6-35B-A3B 4-bit at 4k, a 5.7x gap).

unsloth-zoo now bundles those kernels (see unslothai/unsloth-zoo#865) and injects them automatically, so no pip install flash-linear-attention is required. This updates the loader note accordingly.

Behavior

_maybe_advise_fla_install now keys on is_flash_linear_attention_available() (which the bundled injection sets true) rather than the mere presence of an fla install. As a result it stays silent whenever the fast kernels are active, and prints a one-time note only when a gated-deltanet model is loaded and the bundled kernels could not be enabled on the current setup (no CUDA, or torch < 2.7 / triton < 3.3), which is exactly when transformers uses the slow pure PyTorch path. The message no longer tells users to install anything, since the kernels ship with Unsloth.

Notes

Requires an unsloth-zoo build that includes the bundled kernels (unslothai/unsloth-zoo#865); on older unsloth-zoo without them, is_flash_linear_attention_available() is false unless the user installed fla, so the note still fires correctly.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a one-time advisory warning (_maybe_advise_fla_install) when loading gated-deltanet models without the flash-linear-attention (fla) package installed. The review feedback suggests making this helper function more robust by handling edge cases where model_types is None, empty, or a single string, and recommends calling it earlier in FastLanguageModel.from_pretrained to ensure it triggers across all loading paths.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread unsloth/models/loader.py
Comment on lines +212 to +225
def _maybe_advise_fla_install(model_types):
"""Print a one-time advisory when a gated-deltanet model is loaded without fla."""
global _fla_advised
if _fla_advised:
return
try:
if not any(
isinstance(t, str) and t.startswith(FLA_MODEL_TYPE_PREFIXES) for t in model_types
):
return
if importlib.util.find_spec("fla") is not None:
return
except Exception:
return

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To make _maybe_advise_fla_install more robust and defensive, we should handle cases where model_types is None, empty, or a single string (which would otherwise cause character-by-character iteration and fail to match).

Additionally, to ensure this advisory is triggered regardless of the loading path, consider calling _maybe_advise_fla_install(model_types) in FastLanguageModel.from_pretrained (around line 620) right after model_types is resolved.

def _maybe_advise_fla_install(model_types):
    """Print a one-time advisory when a gated-deltanet model is loaded without fla."""
    global _fla_advised
    if _fla_advised or not model_types:
        return
    if isinstance(model_types, str):
        model_types = [model_types]
    try:
        if not any(
            isinstance(t, str) and t.startswith(FLA_MODEL_TYPE_PREFIXES) for t in model_types
        ):
            return
        if importlib.util.find_spec("fla") is not None:
            return
    except Exception:
        return

…dels

Unsloth Zoo now bundles the flash-linear-attention (fla) gated-delta Triton
kernels and injects them automatically, so gated-deltanet models (Qwen3-Next,
Qwen3.5, Kimi-Linear) get the fast path with no pip install. Replace the old
install advisory with a one-time note that fires only when the bundled kernels
could not be enabled on the current setup (no CUDA, or torch < 2.7 / triton < 3.3),
i.e. exactly when transformers falls back to the slow pure PyTorch path.
@danielhanchen danielhanchen force-pushed the fla-gated-deltanet-advisory branch from 6cccff8 to 63590c7 Compare July 4, 2026 05:01
@danielhanchen danielhanchen changed the title Advise installing flash-linear-attention for gated-deltanet models Note bundled flash-linear-attention kernels for gated-deltanet models Jul 4, 2026
@danielhanchen

Copy link
Copy Markdown
Member Author

@codex review

@danielhanchen

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to warn users when loading gated-deltanet models without the fast flash-linear-attention (fla) kernels enabled. Feedback on the implementation highlights that the broad exception handling in _maybe_advise_fla_install could silently suppress the warning if the import of is_flash_linear_attention_available fails. It is recommended to isolate the import check, catch specific exceptions like ModuleNotFoundError, and log debug information instead of silently returning.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread unsloth/models/loader.py
Comment on lines +225 to +234
try:
if not any(
isinstance(t, str) and t.startswith(FLA_MODEL_TYPE_PREFIXES) for t in model_types
):
return
from transformers.utils.import_utils import is_flash_linear_attention_available
if is_flash_linear_attention_available():
return # bundled (or user-installed) fast kernels are active
except Exception:
return

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If is_flash_linear_attention_available fails to import (for example, on older transformers versions or if there is an ImportError), the entire try block raises an exception, which is caught by except Exception: and silently returns. This prevents the warning from being printed, even though the fast kernels are not active and the model will run on the slower pure PyTorch path.

To make this more robust, we should isolate the import and check of is_flash_linear_attention_available so that any failure/absence of the import is treated as fla_available = False, allowing the warning to be printed correctly. Additionally, we should avoid broad, silent exception handlers (like except Exception:) and instead catch specific exceptions (like ModuleNotFoundError for optional imports) and log them at a debug level to aid in troubleshooting.

Suggested change
try:
if not any(
isinstance(t, str) and t.startswith(FLA_MODEL_TYPE_PREFIXES) for t in model_types
):
return
from transformers.utils.import_utils import is_flash_linear_attention_available
if is_flash_linear_attention_available():
return # bundled (or user-installed) fast kernels are active
except Exception:
return
try:
if not any(
isinstance(t, str) and t.startswith(FLA_MODEL_TYPE_PREFIXES) for t in model_types
):
return
except Exception as e:
import logging
logging.getLogger(__name__).debug("Error checking model types: %s", e)
return
try:
from transformers.utils.import_utils import is_flash_linear_attention_available
fla_available = is_flash_linear_attention_available()
except ModuleNotFoundError as e:
import logging
logging.getLogger(__name__).debug("transformers.utils.import_utils or is_flash_linear_attention_available not found: %s", e)
fla_available = False
if fla_available:
return # bundled (or user-installed) fast kernels are active
References
  1. Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.
  2. When catching an ImportError for an optional dependency, prefer catching the more specific ModuleNotFoundError and check the module name to avoid suppressing unrelated import errors.
  3. When handling exceptions, avoid broad except Exception: pass clauses. Instead, catch specific exceptions and log them (at least at a debug level) to aid in troubleshooting. If a failure is expected, log the specific exception type and its details.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Nice work!

Reviewed commit: 63590c7c9a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@danielhanchen

Copy link
Copy Markdown
Member Author

Normalized model_types so a None or a lone string is handled before the prefix check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant