Skip to content

Auto-enable grouped MoE on loaded / PEFT'd models via loader hook#6727

Open
danielhanchen wants to merge 6 commits into
mainfrom
moe-grouped-autohook
Open

Auto-enable grouped MoE on loaded / PEFT'd models via loader hook#6727
danielhanchen wants to merge 6 commits into
mainfrom
moe-grouped-autohook

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Change

Wires the grouped-GEMM MoE forward into the loader. After from_pretrained / get_peft_model build the model (and its compiled module), the loader installs the grouped forward on the live MoE block instances via wrap_loader_for_grouped_moe, so it wins over the compiled-cache class patch and survives cache regeneration.

Applied to the leaf loaders in models/llama.py (FastLlamaModel) and models/vision.py (FastBaseModel), which the FastLanguageModel / FastModel / FastVisionModel entry points delegate to.

Safety

  • Gated by UNSLOTH_MOE_GROUPED (default on) and the strict eligibility in unsloth_zoo (frozen bnb-4bit ModuleList experts, no shared expert, no LoRA on experts, torch._grouped_mm support), so it is a no-op on transformers v5, non-MoE models, and unsupported hardware.
  • The import is wrapped in try/except, so this is a safe no-op if the unsloth_zoo module is not present.

Depends on

The grouped forward itself: unslothai/unsloth-zoo#837. This loader change can land independently of that one (no-op until it is present).

Wraps the FastLlamaModel and FastBaseModel from_pretrained / get_peft_model leaves with wrap_loader_for_grouped_moe so the grouped-GEMM MoE forward is installed on the live instance after the model and its compiled module are built. Gated by UNSLOTH_MOE_GROUPED and wrapped in try/except, so it is a no-op when the unsloth_zoo module is absent or no eligible MoE block exists.
@danielhanchen danielhanchen requested a review from Datta0 as a code owner June 28, 2026 09:37

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request automatically enables grouped-GEMM MoE (Mixture of Experts) on built or PEFT'd models for both Llama and Vision architectures by wrapping their from_pretrained and get_peft_model methods. The feedback suggests applying this patch before calling PatchFastRL in llama.py to ensure that downstream registrations do not bypass the wrapped methods, which would silently disable the MoE optimization.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread unsloth/models/llama.py Outdated
Comment on lines +3741 to +3748
# Auto-enable grouped-GEMM MoE (transformers<5 ModuleList experts) on built / PEFT'd
# models. Wraps the loader leaves once; guarded so it never breaks model loading.
try:
from unsloth_zoo.temporary_patches.moe_grouped_modulelist import wrap_loader_for_grouped_moe
FastLlamaModel.from_pretrained = staticmethod(
wrap_loader_for_grouped_moe(FastLlamaModel.from_pretrained)
)
FastLlamaModel.get_peft_model = staticmethod(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To ensure that the grouped-GEMM MoE loader wrapper is always active and cannot be bypassed, it is highly recommended to apply this patch before calling PatchFastRL(FastLanguageModel = FastLlamaModel). If PatchFastRL or any other downstream registration copies or wraps from_pretrained or get_peft_model from FastLlamaModel, applying this patch afterwards might result in the RL-patched entry points using the unwrapped/original methods, silently missing the MoE optimization.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f9d010dada

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread unsloth/models/llama.py Outdated
# models. Wraps the loader leaves once; guarded so it never breaks model loading.
try:
from unsloth_zoo.temporary_patches.moe_grouped_modulelist import wrap_loader_for_grouped_moe
FastLlamaModel.from_pretrained = staticmethod(wrap_loader_for_grouped_moe(FastLlamaModel.from_pretrained))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Enable grouped MoE only after PEFT adapters attach

When FastLanguageModel.from_pretrained loads a PEFT adapter repo, loader.py calls this leaf loader first (dispatch_model.from_pretrained around line 788) and only attaches the adapter later with PeftModel.from_pretrained around lines 865-878. Wrapping the leaf here enables grouped MoE on the base model before any LoRA modules exist, so the eligibility check cannot reject LoRA-on-expert adapters and there is no later recheck; adapters targeting MoE expert projections such as the default gate/up/down targets then keep the grouped forward built from base weights and silently ignore those LoRA weights. Move the enable step to after PEFT attachment or rerun/restore the grouped patch after adapters are loaded.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 51f07a6656

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread unsloth/models/vision.py
Comment on lines +2127 to +2129
FastBaseModel.from_pretrained = staticmethod(
wrap_loader_for_grouped_moe(FastBaseModel.from_pretrained)
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Move grouped MoE enable after adapter attachment

When FastModel/FastVisionModel loads a PEFT adapter repo (is_peft), loader.py first calls FastBaseModel.from_pretrained and only attaches the adapter later via PeftModel.from_pretrained. Wrapping this leaf enables grouped MoE on the base model before any LoRA modules exist, so the no-LoRA-on-experts eligibility check can pass and there is no later recheck after adapter attachment; expert-targeted MoE adapters can then run through the grouped forward built from base weights and silently ignore those adapter weights. Please move or rerun the grouped-MoE hook after the is_peft adapter attach path as well.

Useful? React with 👍 / 👎.

@danielhanchen

Copy link
Copy Markdown
Member Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 51f07a6656

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread unsloth/models/llama.py
# models. Wrap the loader leaves before PatchFastRL so any patcher that captures these
# entry points sees the wrapped versions. Guarded so it never breaks model loading.
try:
from unsloth_zoo.temporary_patches.moe_grouped_modulelist import wrap_loader_for_grouped_moe

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Point the grouped-MoE import at a shipped module

In normal installs using the declared unsloth_zoo>=2026.6.7 dependency, this module path does not resolve, and the broad except silently skips the whole wrapping block. That means transformers<5 ModuleList MoE models loaded through this path (and the matching vision.py block) keep using the old expert-loop implementation instead of the intended grouped-GEMM patch, so the new auto-enable behavior is effectively a no-op unless users happen to have an unpublished zoo build.

Useful? React with 👍 / 👎.

Comment thread unsloth/models/llama.py
Comment on lines +3743 to +3745
FastLlamaModel.from_pretrained = staticmethod(
wrap_loader_for_grouped_moe(FastLlamaModel.from_pretrained)
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Patch PEFT adapter loads after adapters attach

When a user loads an existing LoRA/PEFT adapter repo through FastLanguageModel.from_pretrained, this wrapped leaf only returns the base model; loader.py attaches the adapter afterward via PeftModel.from_pretrained and dispatch_model.patch_peft_model at lines 865-878. The grouped-MoE wrapper therefore never sees the final PEFT model for that common adapter-loading path, while wrapping get_peft_model only covers newly-created adapters, so loaded MoE adapter checkpoints miss the intended PEFT-aware grouped-GEMM conversion.

Useful? React with 👍 / 👎.

danielhanchen and others added 2 commits July 1, 2026 12:49
When loading an existing adapter through FastLanguageModel.from_pretrained,
the base model is evaluated for grouped MoE when the wrapped from_pretrained
leaf returns, but the adapter is attached afterwards via PeftModel and
patch_peft_model. Re-run auto_enable_grouped_moe on the final model so
blocks whose experts gained LoRA are restored to the original loop,
attention-only adapters keep the grouped path on their frozen experts, and
recompute is re-derived from the final gradient-checkpointing state. Guarded
so it never blocks adapter loading.
@danielhanchen

Copy link
Copy Markdown
Member Author

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Chef's kiss.

Reviewed commit: ef8d27bc69

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Shorten the loader re-eval and llama.py wrapper comments; code is unchanged
(verified comment-only).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant