Skip to content

fix: clearer Studio setup error when GPU driver is too old for the installed CUDA toolkit#94

Open
danielhanchen wants to merge 10 commits into
mainfrom
pr-5993-head
Open

fix: clearer Studio setup error when GPU driver is too old for the installed CUDA toolkit#94
danielhanchen wants to merge 10 commits into
mainfrom
pr-5993-head

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Staging mirror of unslothai#5993

Original PR: unslothai#5993
Author: mvanhorn

This is a staging copy for review and editing. Once finalized, changes will be pushed back to the original PR.


Original description

Summary

When studio/setup.sh runs on a host whose installed CUDA toolkit (for example 13.3) is newer than the GPU driver can run, the installer now names the real blocker: the toolkit is supported, but the current driver is too old to run it. Both studio/setup.sh and studio/setup.ps1 emit the same guidance, pointing the user at a driver update or the prebuilt CUDA bundle.

Why this matters

Issue unslothai#5917 (with duplicate feedback in unslothai#5918) reported that a CUDA 13.3 toolkit paired with an older NVIDIA driver made the Studio installer behave as though 13.3 was unsupported, when Unsloth does support 13.3 and the actual gap was the driver. PR unslothai#5826 already taught setup.sh to cope with fresh toolkits, but the diagnostic still could not distinguish "toolkit too new for the project" from "driver too old for this toolkit." This compares the driver-reported CUDA capability from nvidia-smi against the installed toolkit version near the _nvcc_meets_llama_minimum check and surfaces the explanatory message only when the source build would otherwise abort, keeping the prebuilt-bundle path non-fatal.

Testing

Covered by the new cases in tests/studio/install/test_selection_logic.py: driver supports the toolkit (no message), driver too old (message names both versions), no nvidia-smi present (branch skipped), and unparseable nvidia-smi output (graceful fallback to the prior generic message). Full suite runs in CI.

Fixes unslothai#5917


This PR tracks the moving review branch (pr-5993-head). Iteration fix commits land here directly. Review-added tests are in a separate PR.

Changed files:

  • .github/workflows/consolidated-tests-ci.yml
  • .github/workflows/lint-ci.yml
  • .github/workflows/lockfile-audit.yml
  • .github/workflows/mlx-ci.yml
  • .github/workflows/notebooks-ci.yml
  • .github/workflows/release-desktop.yml
  • .github/workflows/security-audit.yml
  • .github/workflows/stale.yml
  • .github/workflows/studio-api-smoke.yml
  • .github/workflows/studio-backend-ci.yml
  • .github/workflows/studio-frontend-ci.yml
  • .github/workflows/studio-inference-smoke.yml
  • .github/workflows/studio-load-orchestrator-ci.yml
  • .github/workflows/studio-mac-api-smoke.yml
  • .github/workflows/studio-mac-inference-smoke.yml
  • .github/workflows/studio-mac-install-matrix.yml
  • .github/workflows/studio-mac-ui-smoke.yml
  • .github/workflows/studio-mac-update-smoke.yml
  • .github/workflows/studio-tauri-smoke.yml
  • .github/workflows/studio-ui-smoke.yml
  • .github/workflows/studio-update-smoke.yml
  • .github/workflows/studio-windows-api-smoke.yml
  • .github/workflows/studio-windows-inference-smoke.yml
  • .github/workflows/studio-windows-ui-smoke.yml
  • .github/workflows/studio-windows-update-smoke.yml
  • .github/workflows/version-compat-ci.yml
  • .github/workflows/wheel-smoke.yml
  • studio/setup.ps1
  • studio/setup.sh
  • tests/studio/install/test_selection_logic.py

mvanhorn and others added 10 commits June 5, 2026 18:45
setup.ps1 now routes the too-new-toolkit case through
Write-CudaDriverToolkitMismatch instead of the old 'is installed but
INCOMPATIBLE' banner. Extract that helper alongside Resolve-CudaToolkit so the
child pwsh can run it, and assert the new driver-too-old guidance (and the
one-line source-build error) instead of the removed INCOMPATIBLE text.
…lkit/driver edge tests

setup.ps1: note that only a forced source build reaches the hard-exit branch
(the prebuilt path returned above), unlike setup.sh which degrades to CPU.
test_selection_logic.py: cover the CUDA UMD Version variant, the empty nvcc
version guard, and the too_old (< 12.4) short-circuit.
…ts before CPU fallback

The driver check now compares CUDA major versions only, per NVIDIA
minor-version compatibility, and when the selected nvcc is still too
new the setup iterates other installed toolkits and uses the newest
driver-compatible one before falling back to a CPU llama.cpp build.
Same rule mirrored in setup.ps1.
…llback

The major-only compatibility fix updated the side-by-side scan (Find-Nvcc
-MaxVersion) and the CUDA_PATH check, but the fallback that runs when
Find-Nvcc -MaxVersion returns null still recorded any plain Find-Nvcc result
as an incompatible toolkit without re-checking the major. A same-major
toolkit discoverable only via PATH, process CUDA_PATH, or a custom location
(e.g. toolkit 13.3 with a driver supporting CUDA 13.2) was therefore rejected
even though it is compatible.

Re-apply the same major-only rule in the fallback: use the toolkit when its
major is within the driver's, otherwise record it as too-new. Adds a
regression test covering the PATH-only same-major case.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 13137021b8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@@ -586,7 +159,7 @@ jobs:
> Linux AppImage on Ubuntu 24.04+ may require: `sudo apt install libfuse2t64`
> First-run system dependency elevation is supported on Ubuntu/Debian. Other Linux distributions should install system packages manually.
releaseDraft: ${{ inputs.draft }}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore desktop-latest publishing for updater releases

For non-draft desktop releases this workflow now only creates the versioned release; I searched .github/workflows and there is no remaining step that uploads desktop-latest/latest.json, while studio/src-tauri/tauri.conf.json still points the updater endpoint at .../releases/download/desktop-latest/latest.json. That means shipped desktop clients will not discover new releases even though the release workflow succeeds.

Useful? React with 👍 / 👎.

Comment on lines +85 to +86
const appVersion = match[1];
const response = await fetch(`https://pypi.org/pypi/unsloth/${appVersion}/json`);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Verify the backend version the desktop actually requires

This checks PyPI for the Tauri/Cargo app version, but the backend package version the desktop requires is MIN_DESKTOP_BACKEND_VERSION in studio/src-tauri/src/preflight/version.rs (currently 2026.5.3, while Cargo.toml is 2026.4.8). When those diverge, the release can pass after checking the wrong unsloth artifact and ship a desktop build whose required backend version was never published.

Useful? React with 👍 / 👎.

@@ -1,2289 +0,0 @@
# SPDX-License-Identifier: AGPL-3.0-only

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore PR and push CI coverage

This delete removes one of the PR/push CI workflows, and after checking the current .github/workflows directory only release-desktop.yml (workflow_dispatch) and stale.yml (schedule) remain, so future PRs and pushes no longer run lint, tests, smoke tests, lockfile audits, or security audits at all. That makes regressions in experiments and release-critical paths land without any automated validation.

Useful? React with 👍 / 👎.

Comment on lines 43 to 44

# ── Linux dependencies ──

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid persisting a write token into release build steps

With contents: write now granted at workflow scope, this checkout uses the default persisted credentials, so the repository token remains in the local git config while later npm install, Tauri, signing, and build steps execute dependency code. The previous workflow explicitly disabled credential persistence; keep that here so any compromised package script or build hook cannot read and exfiltrate a write-capable GitHub token from the release runner.

Useful? React with 👍 / 👎.

> First-run system dependency elevation is supported on Ubuntu/Debian. Other Linux distributions should install system packages manually.
releaseDraft: ${{ inputs.draft }}
prerelease: ${{ needs.prepare-version.outputs.prerelease }}
prerelease: false

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve prerelease status for beta desktop builds

This hard-codes every desktop release as non-prerelease, whereas the removed preparation step derived this from versions like v0.1.39-beta. If the Cargo/Tauri version contains a prerelease suffix for a beta build, the workflow will still publish the GitHub release as a stable release, making beta artifacts indistinguishable from normal desktop releases.

Useful? React with 👍 / 👎.

@danielhanchen

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the CUDA compatibility logic in both setup.ps1 and setup.sh to only enforce that the CUDA Toolkit's major version is less than or equal to the driver's supported major version, rather than checking minor versions. It also introduces robust fallback logic to search for alternative compatible toolkits and adds comprehensive unit tests. The review comments are highly constructive and should all be addressed: they suggest skipping the Bash-dependent tests on Windows to prevent failures, using the standard-compliant head -n 1 instead of head -1 in the shell script, and resolving bash dynamically from the system path instead of hardcoding /bin/bash to improve portability.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

# ===========================================================================


class TestCudaDriverToolkitMismatchMessage:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since setup.sh and these test cases depend on bash, they will fail with FileNotFoundError when run on Windows systems. It is recommended to skip this test class on Windows using pytest.mark.skipif.

Suggested change
class TestCudaDriverToolkitMismatchMessage:
@pytest.mark.skipif(sys.platform == "win32", reason="Bash tests are not supported on Windows")
class TestCudaDriverToolkitMismatchMessage:

Comment thread studio/setup.sh
command -v nvidia-smi >/dev/null 2>&1 || return 0
nvidia-smi 2>/dev/null \
| sed -nE 's/.*CUDA( UMD)? Version:[[:space:]]*([0-9]+)\.([0-9]+).*/\2.\3/p' \
| head -1 || true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

Using head -1 is deprecated in POSIX and may not be supported on all platforms or could trigger warnings. It is more portable and standard-compliant to use head -n 1 instead.

Suggested change
| head -1 || true
| head -n 1 || true


def _run_bash(self, script, *, env = None):
proc = subprocess.run(
["/bin/bash", "-c", script],

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

Hardcoding "/bin/bash" can cause test failures on systems where bash is located elsewhere (e.g., /usr/local/bin/bash on BSD systems, or in the PATH on Windows/macOS). Using "bash" is more portable as it resolves the executable from the system PATH.

Suggested change
["/bin/bash", "-c", script],
["bash", "-c", script],

@danielhanchen

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the CUDA toolkit compatibility check in both the PowerShell (setup.ps1) and Bash (setup.sh) setup scripts to allow toolkits with a compatible major version (toolkit major <= driver major) rather than strictly enforcing minor version limits. It also introduces helper functions for mismatch diagnostics, fallback logic to find alternative compatible toolkits, and comprehensive unit tests. The review feedback suggests optimizing the Bash script by replacing subshell invocations of sed with pure Bash parameter expansion when splitting multi-line command outputs, which avoids process creation overhead inside loops.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread studio/setup.sh
Comment on lines +231 to +232
_status="$(printf '%s\n' "$_check" | sed -n '1p')"
_version="$(printf '%s\n' "$_check" | sed -n '2p')"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Spawning subshells and invoking external sed processes twice for every candidate in the loop is inefficient. Since _check is guaranteed to contain two lines separated by a newline, you can use pure Bash parameter expansion to split the status and version. This is significantly faster and avoids any process creation overhead.

Suggested change
_status="$(printf '%s\n' "$_check" | sed -n '1p')"
_version="$(printf '%s\n' "$_check" | sed -n '2p')"
_status="${_check%%$'\\n'*}"\n _version="${_check#*$'\\n'}"

Comment thread studio/setup.sh
Comment on lines +1273 to +1274
NVCC_PATH="$(printf '%s\n' "$_ALT_NVCC_CHECK" | sed -n '1p')"
_NVCC_VER="$(printf '%s\n' "$_ALT_NVCC_CHECK" | sed -n '2p')"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similarly, you can avoid spawning subshells and invoking sed twice here by using pure Bash parameter expansion to split the path and version from _ALT_NVCC_CHECK.

Suggested change
NVCC_PATH="$(printf '%s\n' "$_ALT_NVCC_CHECK" | sed -n '1p')"
_NVCC_VER="$(printf '%s\n' "$_ALT_NVCC_CHECK" | sed -n '2p')"
NVCC_PATH="${_ALT_NVCC_CHECK%%$'\\n'*}"\n _NVCC_VER="${_ALT_NVCC_CHECK#*$'\\n'}"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Unsloth Studio installer rejects CUDA 13.3 on Windows despite supported GPU/driver

2 participants