Keep DeepSeek-V4 hyper-connection mixers eager to stop backward inf#859
Keep DeepSeek-V4 hyper-connection mixers eager to stop backward inf#859danielhanchen wants to merge 2 commits into
Conversation
Training DeepSeek-V4 with compiled modules produced inf grad_norm at step 2 and NaN weights by step 3 while the forward loss stayed finite. Bisecting the compiled cache per function isolated the manifold-constrained hyper-connection stream mixers (DeepseekV4HyperConnection / DeepseekV4HyperHead): their Sinkhorn Knopp normalization chains twenty comb/(sum+eps) divisions with sigmoid and softmax mixing, and Inductor's fused backward of that chain overflows to inf at realistic gradient magnitudes (a random-init repro stays finite; the real model's gradient scale is required). Compiling everything except these two modules is stable. Add both classes to DISABLE_COMPILE_MODULES alongside the other numerically sensitive exclusions. RMSNorm, MLP, router, MLA attention, MoE, and the fused cross entropy stay compiled; the mixers are tiny, so throughput is unchanged. Validated on tiny-DeepseekV4: 15 finite steps with a fresh cache and default env, loss matching a fully uncompiled reference run to 0.0002, and no regression on tiny-DeepseekV3.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request updates unsloth_zoo/compiler.py to add DeepseekV4HyperConnection and DeepseekV4HyperHead to the list of modules that bypass Inductor compilation in favor of eager execution. This change prevents numerical overflow issues (overflowing to infinity) during the fused backward pass of their Sinkhorn-Knopp division chain at real gradient scales. There are no review comments, so we have no feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request updates unsloth_zoo/compiler.py to add DeepseekV4HyperConnection and DeepseekV4HyperHead to the list of modules that bypass autotuning. This prevents PyTorch Inductor's fused backward pass for their Sinkhorn-Knopp division chain from overflowing to infinity at real gradient scales. Since these are tiny modules, running them in eager mode has negligible cost. There are no review comments, and I have no additional feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
Codex Review: Didn't find any major issues. More of your lovely PRs please. Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Summary
Training DeepSeek-V4 with compiled modules produced inf grad_norm at step 2 and NaN weights by step 3 while the forward loss stayed finite. Bisecting the compiled cache per function isolated the manifold-constrained hyper-connection stream mixers (
DeepseekV4HyperConnection/DeepseekV4HyperHead): their Sinkhorn-Knopp normalization chains twentycomb/(sum+eps)divisions with sigmoid and softmax mixing, and Inductor's fused backward of that chain overflows to inf at realistic gradient magnitudes. A random-init repro stays finite; the real model's gradient scale is required. Compiling everything except these two modules is stable.What this does
Adds both classes to
DISABLE_COMPILE_MODULESalongside the other numerically sensitive exclusions. RMSNorm, MLP, router, MLA attention, MoE, and the fused cross entropy stay compiled; the mixers are tiny, so throughput is unchanged.Testing