Skip to content

Honor an explicit sdpa or flex_attention request when flash is disabled#6847

Open
danielhanchen wants to merge 3 commits into
mainfrom
attn-impl-honor-explicit
Open

Honor an explicit sdpa or flex_attention request when flash is disabled#6847
danielhanchen wants to merge 3 commits into
mainfrom
attn-impl-honor-explicit

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Summary

When flash attention is disabled for a model, the fallback selection could downgrade a caller who explicitly passed attn_implementation="sdpa" or "flex_attention" to a different backend, because the disable reason is flash-specific.

What this does

Keeps an explicit non-flash request as-is; flash requests still fall back as before. Includes a unit test.

danielhanchen and others added 2 commits July 3, 2026 14:16
When flash attention is disabled for a model, the fallback selection could
downgrade a caller who explicitly passed attn_implementation='sdpa' or
'flex_attention' to a different backend, because the disable reason is
flash-specific. Keep an explicit non-flash request as-is; flash requests
still fall back as before.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request ensures that explicitly requested non-flash attention implementations, such as "sdpa" or "flex_attention", are honored and not downgraded to fallback options when flash attention is disabled. A comprehensive set of unit tests has been added to verify this behavior. There are no review comments, and I have no additional feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@danielhanchen

Copy link
Copy Markdown
Member Author

@codex review

@danielhanchen

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request ensures that explicit requests for non-flash attention implementations, specifically 'sdpa' and 'flex_attention', are honored and not downgraded when flash attention is disabled. A new test suite has been added to verify this behavior. There are no review comments, and I have no additional feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5cd1ab7706

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread unsloth/models/_utils.py
Comment on lines +624 to +625
if requested_attn_implementation in ("sdpa", "flex_attention"):
return _set_attn_impl(config, requested_attn_implementation)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Only honor flex when it was explicitly requested

When this helper is reached without an explicit attn_implementation argument (the default resolver path), requested_attn_implementation is populated from the config, so a stale/default _attn_implementation='flex_attention' is now returned even when supports_flex_attention is false (for example because flex was disabled by env, dropout, model exclusions, or unavailable imports). Prior code would fall back to SDPA/eager in that scenario; gate this new branch on the argument actually being passed explicitly, or unsupported flex can be selected during model loading.

Useful? React with 👍 / 👎.

Comment on lines +17 to +18
attn_implementation = "sdpa",
supports_sdpa = False, # conservative flag would have skipped sdpa

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor SDPA through the resolver path

This test covers the private helper with supports_sdpa=False, but real callers go through resolve_attention_implementation; after the helper returns sdpa, the resolver's final if not supports_sdpa and final_attn_impl == "sdpa" guard immediately switches it back to eager. In the flash-disabled + conservative supports_sdpa=False scenario described here, an explicit attn_implementation="sdpa" still is not honored despite this test passing, so the resolver path needs to be fixed or covered directly.

Useful? React with 👍 / 👎.

@danielhanchen

Copy link
Copy Markdown
Member Author

The helper already honors an explicit sdpa or flex_attention request ahead of the supports_* fallback, and only falls back for a flash request or an implicit choice, so an explicit non-flash implementation is not downgraded. Left as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant