Studio: exclude mlx-lm 0.31.3 (broke gemma4/qwen3_5 QK-norm load on Apple Silicon)#6803
Studio: exclude mlx-lm 0.31.3 (broke gemma4/qwen3_5 QK-norm load on Apple Silicon)#6803danielhanchen wants to merge 5 commits into
Conversation
mlx-lm 0.31.3 regressed the QK-norm archs: its strict load_weights rejects the q_norm/k_norm tensors with "Received N parameters not in model", so gemma4 and qwen3_5 checkpoints fail to load. Studio installs the MLX stack unpinned at latest, which pulls 0.31.3. Verified on a real macos-14 runner: gemma4 fails to load on 0.31.3 but loads and generates coherently on 0.31.2 and on git-main (future 0.31.4). See mlx-lm #1242. Exclude just that release (!=0.31.3) in the installer and the self-heal floor so --upgrade still resolves to the newest good build, and treat an already-installed 0.31.3 as unsatisfied so the self-heal replaces it.
There was a problem hiding this comment.
Code Review
This pull request excludes the regressed mlx-lm version 0.31.3 (which broke QK-norm architectures) from installation and triggers a self-heal repair if it is already installed. The reviewer suggests using packaging.version.Version objects instead of raw string matching when checking for bad installed versions to prevent fragility with semantically identical version strings.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| installed = _dist_version(name) | ||
| if Version(installed) < Version(minimum): | ||
| return False | ||
| # A known-broken build (e.g. mlx-lm 0.31.3, QK-norm load) counts as | ||
| # unsatisfied so the self-heal reinstalls a good one. | ||
| if installed in _MLX_BAD_VERSIONS.get(name, ()): | ||
| return False |
There was a problem hiding this comment.
Comparing package versions using raw string matching (installed in _MLX_BAD_VERSIONS.get(name, ())) can be fragile. Version strings can have different representations (such as trailing zeros, local build tags, or post-releases) that are semantically identical but syntactically different. Since packaging.version.Version is already imported and used in this function, it is safer and more robust to parse the installed version once and perform the comparison using Version objects.
| installed = _dist_version(name) | |
| if Version(installed) < Version(minimum): | |
| return False | |
| # A known-broken build (e.g. mlx-lm 0.31.3, QK-norm load) counts as | |
| # unsatisfied so the self-heal reinstalls a good one. | |
| if installed in _MLX_BAD_VERSIONS.get(name, ()): | |
| return False | |
| installed = _dist_version(name) | |
| installed_version = Version(installed) | |
| if installed_version < Version(minimum): | |
| return False | |
| # A known-broken build (e.g. mlx-lm 0.31.3, QK-norm load) counts as | |
| # unsatisfied so the self-heal reinstalls a good one. | |
| if any(installed_version == Version(bad) for bad in _MLX_BAD_VERSIONS.get(name, ())): | |
| return False |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c36248e9a8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "mlx", | ||
| "mlx-metal", | ||
| "mlx-lm", | ||
| f"mlx-lm{MLX_LM_BAD_VERSION_EXCLUSION}", |
There was a problem hiding this comment.
Cover the fresh install path too
On the normal install.sh Apple Silicon install path, install.sh first runs uv pip install --python "$_VENV_PY" -- "$PACKAGE_NAME" (lines 2899-2900) and then invokes setup.sh with SKIP_STUDIO_BASE=1 (lines 3010 and 3042), so this guarded MLX stack step is skipped entirely. That means the new mlx-lm!=0.31.3 constraint only protects standalone unsloth studio update / non-skip-base runs; fresh installs can still resolve the transitive MLX dependency to the known-bad 0.31.3 and hit the same QK-norm load failure until the best-effort runtime autorepair succeeds. Please add the same exclusion to the install.sh unsloth install path or ensure this MLX repair step runs when skip_base is set on Apple Silicon.
Useful? React with 👍 / 👎.
Address PR review: - Fresh install.sh (Apple Silicon) runs the base 'uv pip install unsloth' with SKIP_STUDIO_BASE=1, skipping the guarded MLX-stack step, so transitive resolution could still pull mlx-lm 0.31.3. install.sh already exports UV_OVERRIDE -> overrides-darwin-arm64.txt before that install, so exclude mlx-lm 0.31.3 there too; this also strengthens the self-heal (same override). - Match the known-bad version with parsed packaging.Version so 0.31.3 == 0.31.3.0 (trailing-zero normalization) instead of raw string equality.
|
@codex review |
|
Thanks, both addressed in aaee1b2:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aaee1b249e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # resolution too -- the fresh install.sh path pulls mlx-lm via unsloth-zoo before | ||
| # the guarded MLX-stack install step runs. 0.31.2 and >=0.31.4 are fine; see | ||
| # mlx-lm #1242. | ||
| mlx-lm!=0.31.3 |
There was a problem hiding this comment.
Cover fresh curl installs when excluding mlx-lm
This override does not protect the fresh Apple Silicon installer path described in the comment: install.sh only exports UV_OVERRIDE when the repo-relative overrides file already exists before the initial uv pip install -- "$PACKAGE_NAME", which is not true for the common curl-piped/downloaded installer, and the later install_python_stack.py MLX step is skipped because install.sh sets SKIP_STUDIO_BASE=1. In that environment the transitive mlx-lm resolution can still pick 0.31.3 and leave Gemma/Qwen MLX loads broken until autorepair succeeds, so the exclusion needs to be generated/applied directly in the fresh install.sh install path too.
Useful? React with 👍 / 👎.
The overrides file only applies via UV_OVERRIDE when it exists relative to the script, which is not true for a curl-piped install, and the guarded MLX step in install_python_stack.py is skipped there (SKIP_STUDIO_BASE=1). So the base install could still resolve the transitive mlx-lm to the broken 0.31.3. Append mlx-lm!=0.31.3 to the base install on Apple Silicon (empty elsewhere), so the fresh path pins away from 0.31.3 without waiting for the runtime self-heal.
|
Covered the fresh curl-install path in ce773ed: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ce773ed3ff
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| else | ||
| run_install_cmd_retry "install unsloth" uv pip install --python "$_VENV_PY" \ | ||
| --upgrade-package unsloth -- "$PACKAGE_NAME" | ||
| --upgrade-package unsloth -- "$PACKAGE_NAME" ${_MLX_LM_EXCLUDE_ARG:-} |
There was a problem hiding this comment.
Apply the MLX exclusion on migrations too
This appends mlx-lm!=0.31.3 only on the fresh non-migrated install path. In the legacy migration path above, a healthy $STUDIO_HOME/.venv is moved into $VENV_DIR, _MIGRATED=true, and the installer runs the separate install unsloth (migrated) command without this extra requirement before invoking setup with SKIP_STUDIO_BASE=1; for curl-piped Apple Silicon migrations there is also no repo-relative overrides file to export. That path can still resolve a missing/transitive mlx-lm to the known-bad 0.31.3 and leave QK-norm MLX models broken until runtime autorepair succeeds.
Useful? React with 👍 / 👎.
| # resolution too -- the fresh install.sh path pulls mlx-lm via unsloth-zoo before | ||
| # the guarded MLX-stack install step runs. 0.31.2 and >=0.31.4 are fine; see | ||
| # mlx-lm #1242. | ||
| mlx-lm!=0.31.3 |
There was a problem hiding this comment.
Preserve the MLX floor in the override
UV overrides are absolute replacements rather than additive constraints, so this entry replaces any transitive mlx-lm>=... requirement with only mlx-lm!=0.31.3. On Apple Silicon paths that set UV_OVERRIDE from this file, an already-installed or resolver-backtracked mlx-lm<0.22.0 can now satisfy the overridden requirement even though mlx_repair.py still treats that as below the usable MLX stack floor, leaving Train/Export chat-only until runtime repair runs. Please include the same floor here, e.g. mlx-lm>=0.22.0,!=0.31.3, so the override only removes the bad release.
Useful? React with 👍 / 👎.
…2.0 floor
The with-deps migrated install did not append ${_MLX_LM_EXCLUDE_ARG:-}, so a
curl-piped Apple Silicon migration (no repo overrides file, UV_OVERRIDE unset)
could resolve mlx-lm 0.31.3 transitively. Append the exclusion there, matching
the fresh install path. The no-torch migration is left alone since --no-deps
never resolves mlx-lm (same as the fresh no-torch path).
Also restore the >=0.22.0 floor in overrides-darwin-arm64.txt: a uv override
replaces the transitive constraint, so a bare !=0.31.3 could let the resolver
drop below the supported minimum that mlx_repair.py enforces at runtime.
|
@codex review |
|
Codex Review: Didn't find any major issues. Bravo. Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
@codex review |
|
Codex Review: Didn't find any major issues. Breezy! Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
@codex review |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
@codex review |
|
Codex Review: Didn't find any major issues. 🚀 Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
@codex review |
|
Codex Review: Didn't find any major issues. Keep them coming! Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
@codex review |
|
Codex Review: Didn't find any major issues. More of your lovely PRs please. Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
@codex review |
|
Codex Review: Didn't find any major issues. Nice work! Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
@codex review |
|
Codex Review: Didn't find any major issues. More of your lovely PRs please. Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Problem
MLX loads of
gemma4/qwen3_5checkpoints (e.g.unsloth/gemma-4-E2B-it-UD-MLX-4bit) fail with:These are QK-norm architectures (
q_norm/k_norm).mlx-lm0.31.3 regressed their strictload_weights, and Studio installs the MLX stack unpinned at latest, so it pulls the broken release.Root cause (reproduced on real Apple Silicon)
A
macos-14sweep acrossmlx-lmbuilds:parameters not in model)The failure is exactly
mlx-lm0.31.3; 0.31.2 and git-main both load and generate coherently. Matches mlx-lm #1242.Fix
Exclude only the known-bad release (
!=0.31.3) so--upgradestill resolves to the newest good build (0.31.2 today, >=0.31.4 later):install_python_stack.py: the Apple Silicon MLX install now requestsmlx-lm!=0.31.3.mlx_repair.py: the self-heal floor excludes 0.31.3, and an already-installed 0.31.3 is treated as unsatisfied so the self-heal replaces it.Validation
macos-14pinnedjob proves 0.31.2 loads + generates coherently for gemma4, qwen3_5 and the llama control.macos-14job installing with the shipped constraint (--upgrade mlx mlx-metal "mlx-lm!=0.31.3" mlx-vlm) confirms it resolves to a good build and all three models load.pytest tests/test_mlx_repair.py(21 passing, 2 new): the install spec excludes 0.31.3, and an installed 0.31.3 triggers the self-heal.GGUF equivalents of the same checkpoints load and generate fine, so the models are healthy; this is purely an
mlx-lmarch/version gap.