Note bundled flash-linear-attention kernels for gated-deltanet models#6850
Note bundled flash-linear-attention kernels for gated-deltanet models#6850danielhanchen wants to merge 3 commits into
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request introduces a one-time advisory warning (_maybe_advise_fla_install) when loading gated-deltanet models without the flash-linear-attention (fla) package installed. The review feedback suggests making this helper function more robust by handling edge cases where model_types is None, empty, or a single string, and recommends calling it earlier in FastLanguageModel.from_pretrained to ensure it triggers across all loading paths.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| def _maybe_advise_fla_install(model_types): | ||
| """Print a one-time advisory when a gated-deltanet model is loaded without fla.""" | ||
| global _fla_advised | ||
| if _fla_advised: | ||
| return | ||
| try: | ||
| if not any( | ||
| isinstance(t, str) and t.startswith(FLA_MODEL_TYPE_PREFIXES) for t in model_types | ||
| ): | ||
| return | ||
| if importlib.util.find_spec("fla") is not None: | ||
| return | ||
| except Exception: | ||
| return |
There was a problem hiding this comment.
To make _maybe_advise_fla_install more robust and defensive, we should handle cases where model_types is None, empty, or a single string (which would otherwise cause character-by-character iteration and fail to match).
Additionally, to ensure this advisory is triggered regardless of the loading path, consider calling _maybe_advise_fla_install(model_types) in FastLanguageModel.from_pretrained (around line 620) right after model_types is resolved.
def _maybe_advise_fla_install(model_types):
"""Print a one-time advisory when a gated-deltanet model is loaded without fla."""
global _fla_advised
if _fla_advised or not model_types:
return
if isinstance(model_types, str):
model_types = [model_types]
try:
if not any(
isinstance(t, str) and t.startswith(FLA_MODEL_TYPE_PREFIXES) for t in model_types
):
return
if importlib.util.find_spec("fla") is not None:
return
except Exception:
return…dels Unsloth Zoo now bundles the flash-linear-attention (fla) gated-delta Triton kernels and injects them automatically, so gated-deltanet models (Qwen3-Next, Qwen3.5, Kimi-Linear) get the fast path with no pip install. Replace the old install advisory with a one-time note that fires only when the bundled kernels could not be enabled on the current setup (no CUDA, or torch < 2.7 / triton < 3.3), i.e. exactly when transformers falls back to the slow pure PyTorch path.
6cccff8 to
63590c7
Compare
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to warn users when loading gated-deltanet models without the fast flash-linear-attention (fla) kernels enabled. Feedback on the implementation highlights that the broad exception handling in _maybe_advise_fla_install could silently suppress the warning if the import of is_flash_linear_attention_available fails. It is recommended to isolate the import check, catch specific exceptions like ModuleNotFoundError, and log debug information instead of silently returning.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| try: | ||
| if not any( | ||
| isinstance(t, str) and t.startswith(FLA_MODEL_TYPE_PREFIXES) for t in model_types | ||
| ): | ||
| return | ||
| from transformers.utils.import_utils import is_flash_linear_attention_available | ||
| if is_flash_linear_attention_available(): | ||
| return # bundled (or user-installed) fast kernels are active | ||
| except Exception: | ||
| return |
There was a problem hiding this comment.
If is_flash_linear_attention_available fails to import (for example, on older transformers versions or if there is an ImportError), the entire try block raises an exception, which is caught by except Exception: and silently returns. This prevents the warning from being printed, even though the fast kernels are not active and the model will run on the slower pure PyTorch path.
To make this more robust, we should isolate the import and check of is_flash_linear_attention_available so that any failure/absence of the import is treated as fla_available = False, allowing the warning to be printed correctly. Additionally, we should avoid broad, silent exception handlers (like except Exception:) and instead catch specific exceptions (like ModuleNotFoundError for optional imports) and log them at a debug level to aid in troubleshooting.
| try: | |
| if not any( | |
| isinstance(t, str) and t.startswith(FLA_MODEL_TYPE_PREFIXES) for t in model_types | |
| ): | |
| return | |
| from transformers.utils.import_utils import is_flash_linear_attention_available | |
| if is_flash_linear_attention_available(): | |
| return # bundled (or user-installed) fast kernels are active | |
| except Exception: | |
| return | |
| try: | |
| if not any( | |
| isinstance(t, str) and t.startswith(FLA_MODEL_TYPE_PREFIXES) for t in model_types | |
| ): | |
| return | |
| except Exception as e: | |
| import logging | |
| logging.getLogger(__name__).debug("Error checking model types: %s", e) | |
| return | |
| try: | |
| from transformers.utils.import_utils import is_flash_linear_attention_available | |
| fla_available = is_flash_linear_attention_available() | |
| except ModuleNotFoundError as e: | |
| import logging | |
| logging.getLogger(__name__).debug("transformers.utils.import_utils or is_flash_linear_attention_available not found: %s", e) | |
| fla_available = False | |
| if fla_available: | |
| return # bundled (or user-installed) fast kernels are active |
References
- Avoid using broad, silent exception handlers like
except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging. - When catching an
ImportErrorfor an optional dependency, prefer catching the more specificModuleNotFoundErrorand check the module name to avoid suppressing unrelated import errors. - When handling exceptions, avoid broad
except Exception: passclauses. Instead, catch specific exceptions and log them (at least at a debug level) to aid in troubleshooting. If a failure is expected, log the specific exception type and its details.
|
Codex Review: Didn't find any major issues. Nice work! Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
Normalized |
Summary
Gated-deltanet models (Qwen3.5, Qwen3.6, Qwen3-Next, kimi-linear) run their linear-attention layers on flash-linear-attention (fla) Triton kernels when available, and otherwise fall back to a several-times-slower pure PyTorch recurrence (measured 978 to 5573 tok/s on Qwen3.6-35B-A3B 4-bit at 4k, a 5.7x gap).
unsloth-zoo now bundles those kernels (see unslothai/unsloth-zoo#865) and injects them automatically, so no
pip install flash-linear-attentionis required. This updates the loader note accordingly.Behavior
_maybe_advise_fla_installnow keys onis_flash_linear_attention_available()(which the bundled injection sets true) rather than the mere presence of anflainstall. As a result it stays silent whenever the fast kernels are active, and prints a one-time note only when a gated-deltanet model is loaded and the bundled kernels could not be enabled on the current setup (no CUDA, or torch < 2.7 / triton < 3.3), which is exactly when transformers uses the slow pure PyTorch path. The message no longer tells users to install anything, since the kernels ship with Unsloth.Notes
Requires an unsloth-zoo build that includes the bundled kernels (unslothai/unsloth-zoo#865); on older unsloth-zoo without them,
is_flash_linear_attention_available()is false unless the user installed fla, so the note still fires correctly.