Note bundled flash-linear-attention kernels for gated-deltanet models by danielhanchen · Pull Request #6850 · unslothai/unsloth

danielhanchen · 2026-07-03T18:41:15Z

Summary

Gated-deltanet models (Qwen3.5, Qwen3.6, Qwen3-Next, kimi-linear) run their linear-attention layers on flash-linear-attention (fla) Triton kernels when available, and otherwise fall back to a several-times-slower pure PyTorch recurrence (measured 978 to 5573 tok/s on Qwen3.6-35B-A3B 4-bit at 4k, a 5.7x gap).

unsloth-zoo now bundles those kernels (see unslothai/unsloth-zoo#865) and injects them automatically, so no pip install flash-linear-attention is required. This updates the loader note accordingly.

Behavior

_maybe_advise_fla_install now keys on is_flash_linear_attention_available() (which the bundled injection sets true) rather than the mere presence of an fla install. As a result it stays silent whenever the fast kernels are active, and prints a one-time note only when a gated-deltanet model is loaded and the bundled kernels could not be enabled on the current setup (no CUDA, or torch < 2.7 / triton < 3.3), which is exactly when transformers uses the slow pure PyTorch path. The message no longer tells users to install anything, since the kernels ship with Unsloth.

Notes

Requires an unsloth-zoo build that includes the bundled kernels (unslothai/unsloth-zoo#865); on older unsloth-zoo without them, is_flash_linear_attention_available() is false unless the user installed fla, so the note still fires correctly.

chatgpt-codex-connector · 2026-07-03T18:41:20Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

gemini-code-assist

Code Review

This pull request introduces a one-time advisory warning (_maybe_advise_fla_install) when loading gated-deltanet models without the flash-linear-attention (fla) package installed. The review feedback suggests making this helper function more robust by handling edge cases where model_types is None, empty, or a single string, and recommends calling it earlier in FastLanguageModel.from_pretrained to ensure it triggers across all loading paths.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-07-03T18:42:46Z

+def _maybe_advise_fla_install(model_types):
+    """Print a one-time advisory when a gated-deltanet model is loaded without fla."""
+    global _fla_advised
+    if _fla_advised:
+        return
+    try:
+        if not any(
+            isinstance(t, str) and t.startswith(FLA_MODEL_TYPE_PREFIXES) for t in model_types
+        ):
+            return
+        if importlib.util.find_spec("fla") is not None:
+            return
+    except Exception:
+        return


To make _maybe_advise_fla_install more robust and defensive, we should handle cases where model_types is None, empty, or a single string (which would otherwise cause character-by-character iteration and fail to match).

Additionally, to ensure this advisory is triggered regardless of the loading path, consider calling _maybe_advise_fla_install(model_types) in FastLanguageModel.from_pretrained (around line 620) right after model_types is resolved.

def _maybe_advise_fla_install(model_types): """Print a one-time advisory when a gated-deltanet model is loaded without fla.""" global _fla_advised if _fla_advised or not model_types: return if isinstance(model_types, str): model_types = [model_types] try: if not any( isinstance(t, str) and t.startswith(FLA_MODEL_TYPE_PREFIXES) for t in model_types ): return if importlib.util.find_spec("fla") is not None: return except Exception: return

…dels Unsloth Zoo now bundles the flash-linear-attention (fla) gated-delta Triton kernels and injects them automatically, so gated-deltanet models (Qwen3-Next, Qwen3.5, Kimi-Linear) get the fast path with no pip install. Replace the old install advisory with a one-time note that fires only when the bundled kernels could not be enabled on the current setup (no CUDA, or torch < 2.7 / triton < 3.3), i.e. exactly when transformers falls back to the slow pure PyTorch path.

danielhanchen · 2026-07-04T12:51:10Z

@codex review

danielhanchen · 2026-07-04T12:51:11Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a mechanism to warn users when loading gated-deltanet models without the fast flash-linear-attention (fla) kernels enabled. Feedback on the implementation highlights that the broad exception handling in _maybe_advise_fla_install could silently suppress the warning if the import of is_flash_linear_attention_available fails. It is recommended to isolate the import check, catch specific exceptions like ModuleNotFoundError, and log debug information instead of silently returning.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-07-04T12:52:24Z

+    try:
+        if not any(
+            isinstance(t, str) and t.startswith(FLA_MODEL_TYPE_PREFIXES) for t in model_types
+        ):
+            return
+        from transformers.utils.import_utils import is_flash_linear_attention_available
+        if is_flash_linear_attention_available():
+            return  # bundled (or user-installed) fast kernels are active
+    except Exception:
+        return


If is_flash_linear_attention_available fails to import (for example, on older transformers versions or if there is an ImportError), the entire try block raises an exception, which is caught by except Exception: and silently returns. This prevents the warning from being printed, even though the fast kernels are not active and the model will run on the slower pure PyTorch path.

To make this more robust, we should isolate the import and check of is_flash_linear_attention_available so that any failure/absence of the import is treated as fla_available = False, allowing the warning to be printed correctly. Additionally, we should avoid broad, silent exception handlers (like except Exception:) and instead catch specific exceptions (like ModuleNotFoundError for optional imports) and log them at a debug level to aid in troubleshooting.

Suggested change

try:

if not any(

isinstance(t, str) and t.startswith(FLA_MODEL_TYPE_PREFIXES) for t in model_types

):

return

from transformers.utils.import_utils import is_flash_linear_attention_available

if is_flash_linear_attention_available():

return # bundled (or user-installed) fast kernels are active

except Exception:

return

try:

if not any(

isinstance(t, str) and t.startswith(FLA_MODEL_TYPE_PREFIXES) for t in model_types

):

return

except Exception as e:

import logging

logging.getLogger(__name__).debug("Error checking model types: %s", e)

return

try:

from transformers.utils.import_utils import is_flash_linear_attention_available

fla_available = is_flash_linear_attention_available()

except ModuleNotFoundError as e:

import logging

logging.getLogger(__name__).debug("transformers.utils.import_utils or is_flash_linear_attention_available not found: %s", e)

fla_available = False

if fla_available:

return # bundled (or user-installed) fast kernels are active

References

Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.

When catching an ImportError for an optional dependency, prefer catching the more specific ModuleNotFoundError and check the module name to avoid suppressing unrelated import errors.

When handling exceptions, avoid broad except Exception: pass clauses. Instead, catch specific exceptions and log them (at least at a debug level) to aid in troubleshooting. If a failure is expected, log the specific exception type and its details.

chatgpt-codex-connector · 2026-07-04T12:54:43Z

Codex Review: Didn't find any major issues. Nice work!

Reviewed commit: 63590c7c9a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

danielhanchen · 2026-07-04T13:45:39Z

Normalized model_types so a None or a lone string is handled before the prefix check.

gemini-code-assist Bot reviewed Jul 3, 2026

View reviewed changes

danielhanchen force-pushed the fla-gated-deltanet-advisory branch from 6cccff8 to 63590c7 Compare July 4, 2026 05:01

danielhanchen changed the title ~~Advise installing flash-linear-attention for gated-deltanet models~~ Note bundled flash-linear-attention kernels for gated-deltanet models Jul 4, 2026

gemini-code-assist Bot reviewed Jul 4, 2026

View reviewed changes

danielhanchen added 2 commits July 4, 2026 12:57

Tighten comments

531a191

Normalize model_types in fla install advisory for None and single string

b47f806

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Note bundled flash-linear-attention kernels for gated-deltanet models#6850

Note bundled flash-linear-attention kernels for gated-deltanet models#6850
danielhanchen wants to merge 3 commits into
mainfrom
fla-gated-deltanet-advisory

danielhanchen commented Jul 3, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot commented Jul 3, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jul 3, 2026

Uh oh!

danielhanchen commented Jul 4, 2026

Uh oh!

danielhanchen commented Jul 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jul 4, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jul 4, 2026

Uh oh!

danielhanchen commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

danielhanchen commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Behavior

Notes

Uh oh!

chatgpt-codex-connector Bot commented Jul 3, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented Jul 4, 2026

Uh oh!

danielhanchen commented Jul 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jul 4, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot commented Jul 4, 2026

Uh oh!

danielhanchen commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

danielhanchen commented Jul 3, 2026 •

edited

Loading