Auto-enable grouped MoE on loaded / PEFT'd models via loader hook by danielhanchen · Pull Request #6727 · unslothai/unsloth

danielhanchen · 2026-06-28T09:37:50Z

Change

Wires the grouped-GEMM MoE forward into the loader. After from_pretrained / get_peft_model build the model (and its compiled module), the loader installs the grouped forward on the live MoE block instances via wrap_loader_for_grouped_moe, so it wins over the compiled-cache class patch and survives cache regeneration.

Applied to the leaf loaders in models/llama.py (FastLlamaModel) and models/vision.py (FastBaseModel), which the FastLanguageModel / FastModel / FastVisionModel entry points delegate to.

Safety

Gated by UNSLOTH_MOE_GROUPED (default on) and the strict eligibility in unsloth_zoo (frozen bnb-4bit ModuleList experts, no shared expert, no LoRA on experts, torch._grouped_mm support), so it is a no-op on transformers v5, non-MoE models, and unsupported hardware.
The import is wrapped in try/except, so this is a safe no-op if the unsloth_zoo module is not present.

Depends on

The grouped forward itself: unslothai/unsloth-zoo#837. This loader change can land independently of that one (no-op until it is present).

Wraps the FastLlamaModel and FastBaseModel from_pretrained / get_peft_model leaves with wrap_loader_for_grouped_moe so the grouped-GEMM MoE forward is installed on the live instance after the model and its compiled module are built. Gated by UNSLOTH_MOE_GROUPED and wrapped in try/except, so it is a no-op when the unsloth_zoo module is absent or no eligible MoE block exists.

for more information, see https://pre-commit.ci

gemini-code-assist

Code Review

This pull request automatically enables grouped-GEMM MoE (Mixture of Experts) on built or PEFT'd models for both Llama and Vision architectures by wrapping their from_pretrained and get_peft_model methods. The feedback suggests applying this patch before calling PatchFastRL in llama.py to ensure that downstream registrations do not bypass the wrapped methods, which would silently disable the MoE optimization.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-28T09:39:15Z

+# Auto-enable grouped-GEMM MoE (transformers<5 ModuleList experts) on built / PEFT'd
+# models. Wraps the loader leaves once; guarded so it never breaks model loading.
+try:
+    from unsloth_zoo.temporary_patches.moe_grouped_modulelist import wrap_loader_for_grouped_moe
+    FastLlamaModel.from_pretrained = staticmethod(
+        wrap_loader_for_grouped_moe(FastLlamaModel.from_pretrained)
+    )
+    FastLlamaModel.get_peft_model = staticmethod(


To ensure that the grouped-GEMM MoE loader wrapper is always active and cannot be bypassed, it is highly recommended to apply this patch before calling PatchFastRL(FastLanguageModel = FastLlamaModel). If PatchFastRL or any other downstream registration copies or wraps from_pretrained or get_peft_model from FastLlamaModel, applying this patch afterwards might result in the RL-patched entry points using the unwrapped/original methods, silently missing the MoE optimization.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f9d010dada

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-28T09:44:52Z

+# models. Wraps the loader leaves once; guarded so it never breaks model loading.
+try:
+    from unsloth_zoo.temporary_patches.moe_grouped_modulelist import wrap_loader_for_grouped_moe
+    FastLlamaModel.from_pretrained = staticmethod(wrap_loader_for_grouped_moe(FastLlamaModel.from_pretrained))


Enable grouped MoE only after PEFT adapters attach

When FastLanguageModel.from_pretrained loads a PEFT adapter repo, loader.py calls this leaf loader first (dispatch_model.from_pretrained around line 788) and only attaches the adapter later with PeftModel.from_pretrained around lines 865-878. Wrapping the leaf here enables grouped MoE on the base model before any LoRA modules exist, so the eligibility check cannot reject LoRA-on-expert adapters and there is no later recheck; adapters targeting MoE expert projections such as the default gate/up/down targets then keep the grouped forward built from base weights and silently ignore those LoRA weights. Move the enable step to after PEFT attachment or rerun/restore the grouped patch after adapters are loaded.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 51f07a6656

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-28T10:53:01Z

+    FastBaseModel.from_pretrained = staticmethod(
+        wrap_loader_for_grouped_moe(FastBaseModel.from_pretrained)
+    )


Move grouped MoE enable after adapter attachment

When FastModel/FastVisionModel loads a PEFT adapter repo (is_peft), loader.py first calls FastBaseModel.from_pretrained and only attaches the adapter later via PeftModel.from_pretrained. Wrapping this leaf enables grouped MoE on the base model before any LoRA modules exist, so the no-LoRA-on-experts eligibility check can pass and there is no later recheck after adapter attachment; expert-targeted MoE adapters can then run through the grouped forward built from base weights and silently ignore those adapter weights. Please move or rerun the grouped-MoE hook after the is_peft adapter attach path as well.

Useful? React with 👍 / 👎.

danielhanchen · 2026-07-01T12:24:12Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 51f07a6656

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-07-01T12:33:26Z

+# models. Wrap the loader leaves before PatchFastRL so any patcher that captures these
+# entry points sees the wrapped versions. Guarded so it never breaks model loading.
+try:
+    from unsloth_zoo.temporary_patches.moe_grouped_modulelist import wrap_loader_for_grouped_moe


Point the grouped-MoE import at a shipped module

In normal installs using the declared unsloth_zoo>=2026.6.7 dependency, this module path does not resolve, and the broad except silently skips the whole wrapping block. That means transformers<5 ModuleList MoE models loaded through this path (and the matching vision.py block) keep using the old expert-loop implementation instead of the intended grouped-GEMM patch, so the new auto-enable behavior is effectively a no-op unless users happen to have an unpublished zoo build.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-07-01T12:33:26Z

+    FastLlamaModel.from_pretrained = staticmethod(
+        wrap_loader_for_grouped_moe(FastLlamaModel.from_pretrained)
+    )


Patch PEFT adapter loads after adapters attach

When a user loads an existing LoRA/PEFT adapter repo through FastLanguageModel.from_pretrained, this wrapped leaf only returns the base model; loader.py attaches the adapter afterward via PeftModel.from_pretrained and dispatch_model.patch_peft_model at lines 865-878. The grouped-MoE wrapper therefore never sees the final PEFT model for that common adapter-loading path, while wrapping get_peft_model only covers newly-created adapters, so loaded MoE adapter checkpoints miss the intended PEFT-aware grouped-GEMM conversion.

Useful? React with 👍 / 👎.

When loading an existing adapter through FastLanguageModel.from_pretrained, the base model is evaluated for grouped MoE when the wrapped from_pretrained leaf returns, but the adapter is attached afterwards via PeftModel and patch_peft_model. Re-run auto_enable_grouped_moe on the final model so blocks whose experts gained LoRA are restored to the original loop, attention-only adapters keep the grouped path on their frozen experts, and recompute is re-derived from the final gradient-checkpointing state. Guarded so it never blocks adapter loading.

for more information, see https://pre-commit.ci

danielhanchen · 2026-07-02T12:24:29Z

@codex review

chatgpt-codex-connector · 2026-07-02T12:29:48Z

Codex Review: Didn't find any major issues. Chef's kiss.

Reviewed commit: ef8d27bc69

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Shorten the loader re-eval and llama.py wrapper comments; code is unchanged (verified comment-only).

danielhanchen requested a review from Datta0 as a code owner June 28, 2026 09:37

[pre-commit.ci] auto fixes from pre-commit.com hooks

c0bc6ef

for more information, see https://pre-commit.ci

gemini-code-assist Bot reviewed Jun 28, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 28, 2026

View reviewed changes

Install grouped-MoE loader wrappers before PatchFastRL

51f07a6

chatgpt-codex-connector Bot reviewed Jun 28, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jul 1, 2026

View reviewed changes

danielhanchen and others added 2 commits July 1, 2026 12:49

[pre-commit.ci] auto fixes from pre-commit.com hooks

ef8d27b

for more information, see https://pre-commit.ci

Trim comments in the grouped MoE loader hooks

ec6fe59

Shorten the loader re-eval and llama.py wrapper comments; code is unchanged (verified comment-only).

Uh oh!

Uh oh!

Conversation

danielhanchen commented Jun 28, 2026

Change

Safety

Depends on

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented Jul 1, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented Jul 2, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant