Skip to content

Keep DeepSeek-V4 hyper-connection mixers eager to stop backward inf#859

Open
danielhanchen wants to merge 2 commits into
mainfrom
dsv4-mhc-eager-compile
Open

Keep DeepSeek-V4 hyper-connection mixers eager to stop backward inf#859
danielhanchen wants to merge 2 commits into
mainfrom
dsv4-mhc-eager-compile

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Summary

Training DeepSeek-V4 with compiled modules produced inf grad_norm at step 2 and NaN weights by step 3 while the forward loss stayed finite. Bisecting the compiled cache per function isolated the manifold-constrained hyper-connection stream mixers (DeepseekV4HyperConnection / DeepseekV4HyperHead): their Sinkhorn-Knopp normalization chains twenty comb/(sum+eps) divisions with sigmoid and softmax mixing, and Inductor's fused backward of that chain overflows to inf at realistic gradient magnitudes. A random-init repro stays finite; the real model's gradient scale is required. Compiling everything except these two modules is stable.

What this does

Adds both classes to DISABLE_COMPILE_MODULES alongside the other numerically sensitive exclusions. RMSNorm, MLP, router, MLA attention, MoE, and the fused cross entropy stay compiled; the mixers are tiny, so throughput is unchanged.

Testing

  • tiny-DeepseekV4: 15 finite steps with a fresh cache and default env, loss matching a fully uncompiled reference run to 0.0002.
  • tiny-DeepseekV3 (no hyper-connection modules): no regression.

Training DeepSeek-V4 with compiled modules produced inf grad_norm at step 2
and NaN weights by step 3 while the forward loss stayed finite. Bisecting the
compiled cache per function isolated the manifold-constrained hyper-connection
stream mixers (DeepseekV4HyperConnection / DeepseekV4HyperHead): their Sinkhorn
Knopp normalization chains twenty comb/(sum+eps) divisions with sigmoid and
softmax mixing, and Inductor's fused backward of that chain overflows to inf at
realistic gradient magnitudes (a random-init repro stays finite; the real
model's gradient scale is required). Compiling everything except these two
modules is stable.

Add both classes to DISABLE_COMPILE_MODULES alongside the other numerically
sensitive exclusions. RMSNorm, MLP, router, MLA attention, MoE, and the fused
cross entropy stay compiled; the mixers are tiny, so throughput is unchanged.
Validated on tiny-DeepseekV4: 15 finite steps with a fresh cache and default
env, loss matching a fully uncompiled reference run to 0.0002, and no
regression on tiny-DeepseekV3.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates unsloth_zoo/compiler.py to add DeepseekV4HyperConnection and DeepseekV4HyperHead to the list of modules that bypass Inductor compilation in favor of eager execution. This change prevents numerical overflow issues (overflowing to infinity) during the fused backward pass of their Sinkhorn-Knopp division chain at real gradient scales. There are no review comments, so we have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@danielhanchen

Copy link
Copy Markdown
Member Author

@codex review

@danielhanchen

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates unsloth_zoo/compiler.py to add DeepseekV4HyperConnection and DeepseekV4HyperHead to the list of modules that bypass autotuning. This prevents PyTorch Inductor's fused backward pass for their Sinkhorn-Knopp division chain from overflowing to infinity at real gradient scales. Since these are tiny modules, running them in eager mode has negligible cost. There are no review comments, and I have no additional feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. More of your lovely PRs please.

Reviewed commit: 07bb291709

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant