Auto-enable grouped MoE on loaded / PEFT'd models via loader hook#6727
Auto-enable grouped MoE on loaded / PEFT'd models via loader hook#6727danielhanchen wants to merge 6 commits into
Conversation
Wraps the FastLlamaModel and FastBaseModel from_pretrained / get_peft_model leaves with wrap_loader_for_grouped_moe so the grouped-GEMM MoE forward is installed on the live instance after the model and its compiled module are built. Gated by UNSLOTH_MOE_GROUPED and wrapped in try/except, so it is a no-op when the unsloth_zoo module is absent or no eligible MoE block exists.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Code Review
This pull request automatically enables grouped-GEMM MoE (Mixture of Experts) on built or PEFT'd models for both Llama and Vision architectures by wrapping their from_pretrained and get_peft_model methods. The feedback suggests applying this patch before calling PatchFastRL in llama.py to ensure that downstream registrations do not bypass the wrapped methods, which would silently disable the MoE optimization.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| # Auto-enable grouped-GEMM MoE (transformers<5 ModuleList experts) on built / PEFT'd | ||
| # models. Wraps the loader leaves once; guarded so it never breaks model loading. | ||
| try: | ||
| from unsloth_zoo.temporary_patches.moe_grouped_modulelist import wrap_loader_for_grouped_moe | ||
| FastLlamaModel.from_pretrained = staticmethod( | ||
| wrap_loader_for_grouped_moe(FastLlamaModel.from_pretrained) | ||
| ) | ||
| FastLlamaModel.get_peft_model = staticmethod( |
There was a problem hiding this comment.
To ensure that the grouped-GEMM MoE loader wrapper is always active and cannot be bypassed, it is highly recommended to apply this patch before calling PatchFastRL(FastLanguageModel = FastLlamaModel). If PatchFastRL or any other downstream registration copies or wraps from_pretrained or get_peft_model from FastLlamaModel, applying this patch afterwards might result in the RL-patched entry points using the unwrapped/original methods, silently missing the MoE optimization.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f9d010dada
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # models. Wraps the loader leaves once; guarded so it never breaks model loading. | ||
| try: | ||
| from unsloth_zoo.temporary_patches.moe_grouped_modulelist import wrap_loader_for_grouped_moe | ||
| FastLlamaModel.from_pretrained = staticmethod(wrap_loader_for_grouped_moe(FastLlamaModel.from_pretrained)) |
There was a problem hiding this comment.
Enable grouped MoE only after PEFT adapters attach
When FastLanguageModel.from_pretrained loads a PEFT adapter repo, loader.py calls this leaf loader first (dispatch_model.from_pretrained around line 788) and only attaches the adapter later with PeftModel.from_pretrained around lines 865-878. Wrapping the leaf here enables grouped MoE on the base model before any LoRA modules exist, so the eligibility check cannot reject LoRA-on-expert adapters and there is no later recheck; adapters targeting MoE expert projections such as the default gate/up/down targets then keep the grouped forward built from base weights and silently ignore those LoRA weights. Move the enable step to after PEFT attachment or rerun/restore the grouped patch after adapters are loaded.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 51f07a6656
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| FastBaseModel.from_pretrained = staticmethod( | ||
| wrap_loader_for_grouped_moe(FastBaseModel.from_pretrained) | ||
| ) |
There was a problem hiding this comment.
Move grouped MoE enable after adapter attachment
When FastModel/FastVisionModel loads a PEFT adapter repo (is_peft), loader.py first calls FastBaseModel.from_pretrained and only attaches the adapter later via PeftModel.from_pretrained. Wrapping this leaf enables grouped MoE on the base model before any LoRA modules exist, so the no-LoRA-on-experts eligibility check can pass and there is no later recheck after adapter attachment; expert-targeted MoE adapters can then run through the grouped forward built from base weights and silently ignore those adapter weights. Please move or rerun the grouped-MoE hook after the is_peft adapter attach path as well.
Useful? React with 👍 / 👎.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 51f07a6656
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # models. Wrap the loader leaves before PatchFastRL so any patcher that captures these | ||
| # entry points sees the wrapped versions. Guarded so it never breaks model loading. | ||
| try: | ||
| from unsloth_zoo.temporary_patches.moe_grouped_modulelist import wrap_loader_for_grouped_moe |
There was a problem hiding this comment.
Point the grouped-MoE import at a shipped module
In normal installs using the declared unsloth_zoo>=2026.6.7 dependency, this module path does not resolve, and the broad except silently skips the whole wrapping block. That means transformers<5 ModuleList MoE models loaded through this path (and the matching vision.py block) keep using the old expert-loop implementation instead of the intended grouped-GEMM patch, so the new auto-enable behavior is effectively a no-op unless users happen to have an unpublished zoo build.
Useful? React with 👍 / 👎.
| FastLlamaModel.from_pretrained = staticmethod( | ||
| wrap_loader_for_grouped_moe(FastLlamaModel.from_pretrained) | ||
| ) |
There was a problem hiding this comment.
Patch PEFT adapter loads after adapters attach
When a user loads an existing LoRA/PEFT adapter repo through FastLanguageModel.from_pretrained, this wrapped leaf only returns the base model; loader.py attaches the adapter afterward via PeftModel.from_pretrained and dispatch_model.patch_peft_model at lines 865-878. The grouped-MoE wrapper therefore never sees the final PEFT model for that common adapter-loading path, while wrapping get_peft_model only covers newly-created adapters, so loaded MoE adapter checkpoints miss the intended PEFT-aware grouped-GEMM conversion.
Useful? React with 👍 / 👎.
When loading an existing adapter through FastLanguageModel.from_pretrained, the base model is evaluated for grouped MoE when the wrapped from_pretrained leaf returns, but the adapter is attached afterwards via PeftModel and patch_peft_model. Re-run auto_enable_grouped_moe on the final model so blocks whose experts gained LoRA are restored to the original loop, attention-only adapters keep the grouped path on their frozen experts, and recompute is re-derived from the final gradient-checkpointing state. Guarded so it never blocks adapter loading.
for more information, see https://pre-commit.ci
|
@codex review |
|
Codex Review: Didn't find any major issues. Chef's kiss. Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Shorten the loader re-eval and llama.py wrapper comments; code is unchanged (verified comment-only).
Change
Wires the grouped-GEMM MoE forward into the loader. After
from_pretrained/get_peft_modelbuild the model (and its compiled module), the loader installs the grouped forward on the live MoE block instances viawrap_loader_for_grouped_moe, so it wins over the compiled-cache class patch and survives cache regeneration.Applied to the leaf loaders in
models/llama.py(FastLlamaModel) andmodels/vision.py(FastBaseModel), which theFastLanguageModel/FastModel/FastVisionModelentry points delegate to.Safety
UNSLOTH_MOE_GROUPED(default on) and the strict eligibility inunsloth_zoo(frozen bnb-4bit ModuleList experts, no shared expert, no LoRA on experts,torch._grouped_mmsupport), so it is a no-op on transformers v5, non-MoE models, and unsupported hardware.unsloth_zoomodule is not present.Depends on
The grouped forward itself: unslothai/unsloth-zoo#837. This loader change can land independently of that one (no-op until it is present).