[staging] torch-index override (mirror of unslothai/unsloth#6692) by danielhanchen · Pull Request #227 · danielhanchen/unsloth-staging-2

danielhanchen · 2026-06-26T11:27:06Z

Staging mirror of unslothai#6692 to exercise the cross-OS install matrix on real Windows / macOS / Linux runners.

Changes under test: UNSLOTH_TORCH_INDEX_URL / UNSLOTH_TORCH_INDEX_FAMILY override wired through install.sh, install.ps1, studio/setup.ps1, studio/install_python_stack.py, plus pinned ROCm/CUDA edge cases. No override set means the default detection path is unchanged.

Goal: confirm normal Studio install + smoke still passes on all three OSes, and the override + CUDA-spoof unit suites stay green.

…tection get_torch_index_url (and the studio-update mirror _detect_cuda_torch_index_url) chose the torch wheel family solely by probing the host GPU, with no override. In a headless / container / CI build the host driver is visible via the /proc/driver/nvidia/gpus fallback but nvidia-smi cannot report a CUDA version, so the function fell back to its cu126 default and installed the wrong wheels (e.g. a cu128 image got cu126 torch). Add an explicit override checked before any probing, in both the shell installer and the Python studio-update path: - UNSLOTH_TORCH_INDEX_URL full index URL, used verbatim (wins) - UNSLOTH_TORCH_INDEX_FAMILY family (cpu, cu128, rocm6.4, ...) appended to the mirror base (UNSLOTH_PYTORCH_MIRROR still honoured) This matches how the published GPU images select CUDA -- vLLM and SGLang take the CUDA version from an explicit build ARG rather than detecting it, and the Unsloth Docker base image already pins the cu128 index directly. Desktop installs are unchanged: with no override set, detection runs exactly as before. Adds test_get_torch_index_url.sh cases for the override (family, full URL, precedence, mirror base, trailing-slash strip, empty-ignored).

Address review feedback on the override added in this PR so a pinned index is honoured everywhere, not just in get_torch_index_url: - Skip the WSL ROCm bootstrap (root privilege + large downloads, probes /dev/dxg) when UNSLOTH_TORCH_INDEX_URL / _FAMILY is set; it previously ran before the override was consulted. - Skip the Radeon/Strix rerouting (which re-probes the GPU and overwrites the resolved URL with repo.radeon.com / repo.amd.com) when the index is pinned, so an explicit ROCm override (e.g. UNSLOTH_TORCH_INDEX_FAMILY=rocm6.4) is kept. - install_python_stack.py: derive _TORCH_BACKEND from the override when UNSLOTH_TORCH_BACKEND is unset (standalone studio update), so _ensure_rocm_torch / _ensure_cuda_torch repair to the requested family instead of re-detecting. - Strip ALL leading/trailing slashes in the shell override to match the Python side (avoids 404s on strict pip proxies). Adds test cases for double-slash and leading/trailing-slash overrides.

Follow-up to the override work in this PR: the get_torch_index_url / install.sh reroute already respect a pinned UNSLOTH_TORCH_INDEX_URL / _FAMILY, but the Python repair helpers in install_python_stack.py still re-probed the GPU and could overwrite the pinned family. Make the pin authoritative there too: - _ensure_cuda_torch: an explicit cu* pin commits to CUDA wheels, so repair a ROCm-poisoned venv even when no NVIDIA GPU is visible here (headless / container / CI cross-install), instead of bailing on the GPU-presence gate. - _ensure_rocm_torch: skip the AMD per-gfx (Strix) reroute when a ROCm index is pinned, and in the generic reinstall path install from the pinned URL verbatim rather than re-detecting the host ROCm version. gfx*/rocm7.2 indexes serve torch 2.11+, so select the 2.11 package specs for a gfx leaf. - install.sh: raise the torch constraint to 2.11 for */gfx* indexes too, matching rocm7.2, so a pinned full-URL/family override that returns early keeps a valid constraint. Add _explicit_torch_index_url / _explicit_rocm_torch_index_url helpers and tests covering the no-GPU CUDA pin repair and the explicit gfx index honored verbatim.

for more information, see https://pre-commit.ci

…ride

The pinned-index work landed for install.sh and install_python_stack.py, but the Windows installers still picked the wheel index from GPU probing. Extend the same UNSLOTH_TORCH_INDEX_URL / _FAMILY contract so a pinned index wins on every platform: - install.ps1: Get-TorchIndexUrl returns the pinned URL/family before nvidia-smi probing; the AMD ROCm reroute is skipped when the index is pinned, so an explicit cpu/cu* pin on an AMD host is not overwritten. - studio/setup.ps1: add shared Get-PinnedTorchIndexUrl / Get-TorchIndexLeaf helpers; the stale-venv check, the install selection and the AMD reroute all honor the pin, and the CPU/CUDA install pulls from the resolved index URL. - tests: parity test that all four installers read both override vars and the two Windows installers gate the AMD reroute on the pinned flag.

for more information, see https://pre-commit.ci

Follow-ups to the override work flagged in review: - install.ps1: a pinned gfx*/rocm>=7.2 index previously skipped the AMD reroute that sets the torch>=2.11 floor, so the generic install used torch>=2.4,<2.11 and could resolve the known-bad _grouped_mm wheel. Route a pinned ROCm index through the ROCm install path with the 2.11 floor + companions, and guard the companion-spec lookup so a skipped reroute block cannot null-deref. - studio/setup.ps1: the stale-venv check compared the installed flavor (cuXXX/cpu, with +rocm misread as cpu) against the raw pinned leaf (gfx1151 / rocm6.4), so a correct pinned ROCm venv was always marked stale. Classify +rocm wheels as the generic 'rocm' flavor and normalize a pinned rocm*/gfx* leaf to 'rocm' before comparing (cu* stays specific so cu126-vs-cu128 still rebuilds). - install_python_stack.py: _ensure_cuda_torch now also reinstalls from a pinned CUDA index when the venv carries a CPU wheel (headless CPU-venv-to-CUDA cross-install via 'studio update'), not only when it finds a ROCm build. - tests: parity assertions already cover all four installers honoring the override.

Follow-ups to the previous round: - studio/setup.ps1: a pinned gfx*/rocm>=7.2 index now routes through the ROCm install path with the 2.11 floor + companions (it previously fell through to the CUDA branch with bare torch/torchvision/torchaudio against the ROCm index). The CPU/CUDA fallback index is forced to the CPU wheel index when a ROCm index is active, so a failed pinned-ROCm install does not retry the ROCm mirror. - studio/setup.ps1: the stale-venv check no longer treats an unrecognized pinned URL leaf (e.g. a PEP 503 mirror ending in /simple) as a torch flavor tag, which was marking a correct venv stale; cu*/cpu/rocm/gfx leaves are still compared. - install.ps1: the post-failure CPU fallback uses an explicit CPU index instead of , which for a pinned ROCm index was the ROCm mirror itself (so the 'fallback' just retried the failing index and aborted the installer). - install_python_stack.py: _ensure_cuda_torch now also reinstalls when the venv's CUDA family differs from a pinned one (installed cu126 vs pinned cu128), not only CPU->CUDA; the probe reports the installed cuXXX tag for the comparison.

gemini-code-assist

Code Review

This pull request introduces support for explicit PyTorch wheel index overrides via the UNSLOTH_TORCH_INDEX_URL and UNSLOTH_TORCH_INDEX_FAMILY environment variables across all installation scripts (install.sh, install.ps1, setup.ps1, and install_python_stack.py). This allows headless, container, and CI environments to bypass automatic GPU probing and Radeon/Strix rerouting. Feedback on the changes suggests using .ToLowerInvariant() instead of .ToLower() in install.ps1 to prevent potential locale-specific string comparison issues and ensure consistency with other scripts.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-26T11:29:29Z

+    # bug). Route a pinned ROCm index through the ROCm install path with the same
+    # 2.11 floor/companions the unpinned reroute derives from the gfx arch.
+    if ($TorchIndexPinned -and -not $ROCmIndexUrl -and -not $SkipTorch) {
+        $_pinLeaf = ($TorchIndexUrl.TrimEnd('/') -split '/')[-1].ToLower()


Use .ToLowerInvariant() instead of .ToLower() to prevent potential locale-specific issues (such as the Turkish 'I' bug) when parsing the wheel index URL leaf. This also ensures consistency with the implementation of Get-TorchIndexLeaf in setup.ps1.

$_pinLeaf = ($TorchIndexUrl.TrimEnd('/') -split '/')[-1].ToLowerInvariant()

…r window The pinned-ROCm CPU fallback computes an explicit CPU index, but the comment explaining why it cannot reuse $TorchIndexUrl pushed the actual Invoke-InstallCommandRetry / --force-reinstall call more than 600 chars past the "ROCm PyTorch install failed" message, so test_pr5940_followups's window check no longer saw the retry helper. Move the CPU-index computation and its comment above the failure substep so the retrying force-reinstall stays adjacent to the message. No behavior change: same explicit CPU index, same retry, same --force-reinstall.

danielhanchen and others added 9 commits June 26, 2026 04:59

[pre-commit.ci] auto fixes from pre-commit.com hooks

0c0e2cb

for more information, see https://pre-commit.ci

Merge remote-tracking branch 'origin/main' into cuda-torch-index-over…

5d69193

…ride

[pre-commit.ci] auto fixes from pre-commit.com hooks

5a017ec

for more information, see https://pre-commit.ci

gemini-code-assist Bot reviewed Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[staging] torch-index override (mirror of unslothai/unsloth#6692)#227

[staging] torch-index override (mirror of unslothai/unsloth#6692)#227
danielhanchen wants to merge 10 commits into
cuda-override-basefrom
cuda-torch-index-override

danielhanchen commented Jun 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danielhanchen commented Jun 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant