Add Unsloth Docker images (base + Studio) for any NVIDIA GPU host, Ampere through Blackwell#5748
Add Unsloth Docker images (base + Studio) for any NVIDIA GPU host, Ampere through Blackwell#5748danielhanchen wants to merge 101 commits into
Conversation
Adds a multi-stage Dockerfile producing an image that works on Ampere through Blackwell (sm_80 through sm_120: A100, RTX 30/40, H100, B100/B200, RTX 50-series, RTX 6000 Pro Blackwell). The build itself requires no GPU at all and runs on a free GitHub-hosted ubuntu-latest runner. How the GPU-less build works: 1. cu128 PyTorch wheels are fat binaries. torch._C._cuda_getArchFlags() returns 'sm_70 sm_75 sm_80 sm_86 sm_90 sm_100 sm_120' regardless of which GPU compiled the image, because the wheels are cross-compiled upstream by the PyTorch team. 2. All deps resolve in a single uv pip install pass with explicit pins (torch==2.10.0, --extra-index-url cu128, no --torch-backend=auto, no install.sh). This prevents the silent cu cascade where bitsandbytes' transitive cuda-toolkit==13 dep upgrades torch to 2.12+cu130 in a later resolver pass, leaving xformers and other cu128 wheels stranded. 3. Build-time verification uses package metadata (importlib.metadata.version) and the raw torch._C._cuda_getArchFlags() accessor. We deliberately avoid import unsloth at build time because unsloth.__init__ calls torch.cuda.get_device_properties(0), which requires an actual CUDA device and is not bypassable. Import-time correctness is exercised at deploy time by smoke_test.py with --gpus all. 4. UNSLOTH_COMPILE_DISABLE=1 and CUDA_VISIBLE_DEVICES="" during the build stage prevent any code path from JIT-compiling kernels for the build host's compute capability and baking the resulting cache into the image. The deploy GPU produces its own cache on first use. Other notes: - --index-strategy unsafe-best-match is needed because the PyTorch wheel index serves an old requests==2.28.1 that conflicts with datasets>=2.32.2, which the default first-index-wins strategy rejects. - Extra is cu128-ampere-torch2100 (ampere precedes the torch version in the pyproject ordering). - No flash-attn in the base image. FA3 is hard-refused on Blackwell upstream and unsloth gracefully falls back to xformers + SDPA. Users on Ampere / Ada / Hopper who want FA2 can pip install flash-attn on top. - Two stages: nvidia/cuda:12.8.1-cudnn-devel-ubuntu24.04 for the build, -cudnn-runtime for the deploy image. No nvcc in the published image. - A lockfile is emitted at /opt/unsloth-venv/requirements.lock.txt inside the image and can be extracted with docker/freeze.sh for byte-identical rebuilds even after PyPI moves on. CI workflow .github/workflows/docker-publish.yml: - Builds on ubuntu-latest on every push to main, every tag, weekly via cron, and manually via workflow_dispatch. Pushes to docker.io/unsloth/unsloth with cache via type=gha. - Optional smoke-test job runs on a self-hosted GPU runner if vars.HAS_GPU_RUNNER is set; skipped otherwise. End-to-end verification on sm_120 hardware is a nice-to-have, not a publish blocker. Validation: - Install path validated on a B200 host with CUDA_VISIBLE_DEVICES="" set (simulating the GPU-less CI runner): torch 2.10.0+cu128 holds, xformers 0.0.34, bitsandbytes 0.49.2, triton 3.6.0, transformers 5.5.0, trl 0.24.0, peft 0.19.1, accelerate 1.13.0. Arch flags include sm_100 and sm_120. - Runtime path validated end-to-end on B200: smoke_test.py imports unsloth, loads Llama-3.2-1B-Instruct-bnb-4bit in 4-bit, completes 5 LoRA steps with loss decreasing 4.11 -> 3.75. xformers fallback active as designed. Files: - docker/Dockerfile multi-stage cu128 build - docker/build.sh local build wrapper - docker/freeze.sh extract lockfile from a built image - docker/smoke_test.py runtime verification, run with --gpus all - docker/.dockerignore - .github/workflows/docker-publish.yml
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c6d92160f6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| docker pull ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest | ||
| docker run --rm --gpus all \ | ||
| ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest \ |
There was a problem hiding this comment.
Smoke-test the image built in this run, not
latest
The smoke-test job always pulls :latest, but this workflow also runs on tag pushes where latest is not guaranteed to be among the tags produced by metadata-action (it is only enabled on the default branch in this workflow). In that case, the smoke test can validate an older image and miss regressions in the freshly built tag from the current run.
Useful? React with 👍 / 👎.
| UNSLOTH_REF=${{ github.event.inputs.unsloth_ref || 'main' }} | ||
| UNSLOTH_ZOO_REF=${{ github.event.inputs.unsloth_zoo_ref || 'main' }} |
There was a problem hiding this comment.
Build from triggering ref instead of hardcoding
main
For non-workflow_dispatch events (including tag pushes), github.event.inputs.* is unset, so these build args always resolve to main. That means images produced for v* tags can contain unsloth and unsloth-zoo code from main rather than the release ref that triggered the run, which breaks release correctness and reproducibility.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request introduces a Dockerized environment for Unsloth and unsloth-zoo, specifically optimized for NVIDIA Blackwell GPUs (sm_100 and sm_120). The changes include a multi-stage Dockerfile, build and freeze scripts, and a comprehensive smoke test to verify GPU compatibility and training functionality. Review feedback suggests optimizing the Dockerfile by removing a redundant installation of the uv tool, correcting a version mismatch for torchaudio to ensure consistency with the PyTorch stack, and relocating cache directories outside of the workspace to prevent issues when mounting host volumes at runtime.
| RUN ${VENV}/bin/pip install uv \ | ||
| && ${VENV}/bin/uv pip install \ | ||
| --python ${VENV}/bin/python \ |
There was a problem hiding this comment.
The uv tool is already installed at the system level in line 63. Installing it again inside the virtual environment at line 99 is redundant. Using the system-wide uv binary to install packages into the venv is more efficient and avoids unnecessary layers.
RUN uv pip install \
--python ${VENV}/bin/python \
| --python ${VENV}/bin/python \ | ||
| --index-strategy unsafe-best-match \ | ||
| --extra-index-url https://download.pytorch.org/whl/cu128 \ | ||
| "torch==2.10.0" "torchvision==0.25.0" "torchaudio==2.11.0" \ |
There was a problem hiding this comment.
There appears to be a version mismatch for torchaudio. While torch is pinned to 2.10.0 and torchvision to 0.25.0 (which correctly follows the standard 0.(Y+15) mapping for Torch 2.10), torchaudio is set to 2.11.0. Typically, PyTorch and Torchaudio versions are released in sync (e.g., Torch 2.6.0 with Torchaudio 2.6.0). Using 2.10.0 ensures consistency across the stack.
"torch==2.10.0" "torchvision==0.25.0" "torchaudio==2.10.0" \
| HF_HOME=/workspace/.cache/huggingface \ | ||
| TRITON_CACHE_DIR=/workspace/.cache/triton \ |
There was a problem hiding this comment.
Setting HF_HOME and TRITON_CACHE_DIR to subdirectories of /workspace (the WORKDIR) can lead to issues when users mount a host directory to /workspace at runtime. The mount will obscure the directories created during the build, forcing the application to recreate them at runtime, which can cause permission issues or redundant downloads. Moving these caches to a location outside of the workspace, such as /opt/cache, avoids these issues.
HF_HOME=/opt/cache/huggingface \
TRITON_CACHE_DIR=/opt/cache/triton \
| COPY --from=builder /opt/unsloth-venv /opt/unsloth-venv | ||
|
|
||
| WORKDIR /workspace | ||
| RUN mkdir -p ${HF_HOME} ${TRITON_CACHE_DIR} |
There was a problem hiding this comment.
When someone launches the unsloth container, the common failure modes are not unsloth bugs -- they're Docker / nvidia-container-toolkit / driver issues that surface as cryptic CUDA errors deep in torch. The entrypoint catches the three that cover ~95% of "it doesn't work" reports up front: 1. nvidia-smi inside the container sees no GPU -> user forgot --gpus all, or host is missing nvidia-container-toolkit -> entrypoint prints the exact docker run flag and the toolkit install URL 2. nvidia-smi works but torch.cuda.is_available() is False -> host driver is older than CUDA 12.8 supports -> entrypoint prints the minimum driver version per architecture 3. compute capability < sm_80 -> entrypoint prints the supported architecture table and exits Each check fails with a clear, actionable message rather than a stack trace. Set UNSLOTH_SKIP_GPU_CHECK=1 to bypass (for docs builds, offline tooling, CI). run.sh wraps `docker run` with the flags people most often forget: --gpus all (without it, the new entrypoint refuses to start) --ipc=host (DataLoader workers need >64MB shm) --ulimit memlock=-1 (NCCL + CUDA pinned host buffers) --ulimit stack=64MB (some torch kernels OOM the default 8MB stack) Plus it mounts the host HF cache + Triton JIT cache so model downloads and compiled kernels persist across container runs, and forwards HF_TOKEN / WANDB_API_KEY / UNSLOTH_LICENSE only when they are set on the host. Usage: bash docker/run.sh # interactive python REPL bash docker/run.sh bash # shell in container bash docker/run.sh python /workspace/smoke_test.py bash docker/run.sh python /workspace/host/train.py # $PWD mounted at /workspace/host Verified locally: - No GPU visible: entrypoint refuses with driver-version message, exit 1 - B200 sm_100 visible: entrypoint prints GPU banner, exits cleanly into the user command (rc=0)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 58693c4c73
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| [[ -n "${HF_TOKEN:-}" ]] && ENV_FORWARD+=(-e "HF_TOKEN=${HF_TOKEN}") | ||
| [[ -n "${WANDB_API_KEY:-}" ]] && ENV_FORWARD+=(-e "WANDB_API_KEY=${WANDB_API_KEY}") | ||
| [[ -n "${UNSLOTH_LICENSE:-}" ]] && ENV_FORWARD+=(-e "UNSLOTH_LICENSE=${UNSLOTH_LICENSE}") | ||
|
|
||
| set -x |
There was a problem hiding this comment.
Remove xtrace before invoking docker run with secrets
This script conditionally forwards HF_TOKEN, WANDB_API_KEY, and UNSLOTH_LICENSE, then enables set -x right before docker run, which prints the fully expanded command line. In any environment where those variables are set (local terminals with history/log capture or CI logs), their raw values are exposed in plaintext, creating an avoidable credential leak.
Useful? React with 👍 / 👎.
| import transformers | ||
|
|
||
| print(f"transformers {transformers.__version__}") |
There was a problem hiding this comment.
Import unsloth before transformers in smoke test
The default execution path runs training (check_tiny_train) but check_imports imports transformers first, which contradicts the later requirement that Unsloth be imported first for patched training behavior. This means the smoke test is not validating the intended Unsloth training path and can produce misleading pass/fail results (e.g., patches not applied or warnings/fallback paths during the LoRA step).
Useful? React with 👍 / 👎.
Single bash script that runs the full validation flow against the image:
1. Host pre-flight: docker version, nvidia-smi, nvidia-container-toolkit
runtime registered with docker.
2. Build the image (auto-detects the build context -- current dir,
docker/ subdir, or clones the docker-blackwell-build branch into
/tmp/unsloth-pr/).
3a. Smoke test: 5-step LoRA on Llama-3.2-1B-Instruct-bnb-4bit.
3b. Real workload: gpt-oss-20B fine-tuning notebook from
unslothai/notebooks, patched to max_steps=10, with the three
pre-train demo generations dropped for brevity. Auto-installs
triton_kernels at the SHA the upstream notebook pins for MXFP4.
All output is teed to /tmp/unsloth-docker-test/ (or --log-dir).
Usage:
bash docker/test_locally.sh # full run, ~15 min
bash docker/test_locally.sh --skip-notebook # blocks 1-3a only, ~3 min
bash docker/test_locally.sh --skip-build # reuse existing TAG
TAG=my:tag HF_TOKEN=hf_xxx bash docker/test_locally.sh
Each block fails fast with the exact log path to paste back.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: acbb16c8a1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| --python ${VENV}/bin/python \ | ||
| --index-strategy unsafe-best-match \ | ||
| --extra-index-url https://download.pytorch.org/whl/cu128 \ | ||
| "torch==2.10.0" "torchvision==0.25.0" "torchaudio==2.11.0" \ |
There was a problem hiding this comment.
Align torchaudio pin with torch 2.10 stack
The install set mixes torch==2.10.0 with torchaudio==2.11.0, but TorchAudio wheels are release-coupled to specific Torch versions; this can make dependency resolution fail or force an unexpected Torch change, which breaks the Docker build’s stated guarantee that Torch stays on 2.10.0. This is especially risky here because the same layer also installs many transitive deps from multiple indexes, so one incompatible pin can fail the image build on CI.
Useful? React with 👍 / 👎.
| if ! command -v nvidia-smi >/dev/null 2>&1; then | ||
| err "nvidia-smi not found inside the container." | ||
| err "The CUDA runtime in this image is broken. Re-pull the image." | ||
| exit 1 |
There was a problem hiding this comment.
Avoid hard-failing when nvidia-smi binary is unavailable
Startup currently exits before any CUDA check if nvidia-smi is missing, but there are valid GPU runtimes (for example compute-only capability profiles) where CUDA is usable while nvidia-smi/NVML tools are not mounted. In those environments the container will refuse to start even though torch.cuda.is_available() could succeed, causing false negatives in production deployments that intentionally limit driver capabilities.
Useful? React with 👍 / 👎.
The Dockerfile uses BuildKit-only features (the # syntax=docker/dockerfile:1.7 parser directive and RUN ... <<'PY' heredocs added in dockerfile 1.3+). The legacy builder rejects the --progress flag at the CLI level and would fail later at the heredocs anyway. Detect docker buildx and use it when available (preserves --progress=plain output). Otherwise fall back to plain `docker build` with DOCKER_BUILDKIT=1 exported, which gets the BuildKit features without buildx's nicer formatting. Reproduces the failure path seen on Docker 28.2.2 without buildx installed: unknown flag: --progress ERROR docker build exited 125
Docker 28 removed the legacy image builder entirely. Setting
DOCKER_BUILDKIT=1 no longer falls back to a builtin builder -- it
delegates to buildx, which then errors out if buildx isn't installed:
ERROR: BuildKit is enabled but the buildx component is missing
or broken.
The Ubuntu docker.io package omits buildx by default, so users on
that path hit this immediately. Detect missing buildx up front and
print exact install commands for apt / dnf / manual binary instead
of attempting a fallback that cannot work.
If the user is not in the 'docker' group, every docker command after the pre-flight returns "permission denied while trying to connect to the Docker daemon socket at /var/run/docker.sock". This used to surface as a confusing buildx failure mid-Block-2, but the actual problem is a host permissions issue that's settable up front. Detect by running 'docker info' and checking its exit code (not just grep on its output -- a permission failure prints to stderr and returns non-zero, so the old grep-based check was a silent skip). Also clarify the nvidia-runtime WARN: on Docker 28+ with CDI mode this is a false positive most of the time. The real GPU-attach test is the smoke run in Block 3a, where the container entrypoint catches missing GPUs with an actionable message.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 23a5b43180
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| cd /workspace/host | ||
|
|
||
| echo "=== install triton_kernels (MXFP4 support for unsloth/gpt-oss-20b) ===" | ||
| pip install -q 'git+https://github.com/triton-lang/triton.git@0add68262ab0a2e33b84524346cb27cbb2787356#subdirectory=python/triton_kernels' 2>&1 | tail -5 |
There was a problem hiding this comment.
Propagate pip install failures through the tail pipeline
The generated run_notebook.sh script runs with set -e but not pipefail, and this line pipes pip install into tail -5; in Bash that means the pipeline exits with tail's status, so a failed pip install can be treated as success and the script continues until later steps fail with misleading errors (for example missing triton_kernels imports). This causes false-positive notebook validation and makes debugging CI/local failures much harder whenever installation fails (network hiccups, dependency conflicts, or bad commit hash).
Useful? React with 👍 / 👎.
Ubuntu 24.04 (noble) marks the system Python interpreter as externally-managed per PEP 668, so: curl get-pip.py | python python -m pip install -U pip uv fails inside the builder image with: error: externally-managed-environment This environment is externally managed The system-level pip and uv were never used: the very next RUN creates the venv at /opt/unsloth-venv, which bootstraps its own pip via the ensurepip module (provided by the python3.12-venv apt package). uv is then installed INTO the venv with the venv's pip, and used from there. Drop the two system-pip bootstrap lines. The venv path is unchanged. Reproduces on any Docker build of the unsloth-blackwell image against a noble base image (which our nvidia/cuda:12.8.1-cudnn-devel-ubuntu24.04 is).
… / peft
unsloth_zoo/__init__.py guards against being imported standalone:
if "UNSLOTH_IS_PRESENT" not in os.environ:
raise ImportError("Please install Unsloth via `pip install unsloth`!")
The env var is set by unsloth/__init__.py at import time, so importing
unsloth must happen first. The old check_imports() imported xformers,
bnb, transformers, trl, peft, then unsloth_zoo -- which fired the guard
because unsloth had not been imported yet.
Reorder check_imports() to import unsloth (and unsloth_zoo) first, then
the rest. check_unsloth_import() becomes a thin re-import to keep the
"FastLanguageModel reachable" banner in the output.
Same fix the unsloth README has been recommending for years: "import
unsloth at the top of your file, before transformers/trl/peft."
Triton's nvidia backend lazily JIT-compiles a small C extension
(CudaUtils, in triton/backends/nvidia/driver.py) on first GPU access.
Without a C compiler and Python headers in the runtime image, the
very first forward pass of any Unsloth model dies with:
RuntimeError: Failed to find C compiler.
Please specify via CC environment variable.
The builder stage has build-essential and python3.12-dev so this
worked during the build's verification step (no GPU = no Triton kernel
call = no C extension build). But the runtime stage stripped those
out for size, so the failure only surfaces when a real user runs
training inside the container.
Add gcc + g++ + python3.12-dev to the runtime stage. Increases the
runtime image by ~250MB, which is the cost of letting Triton JIT
correctly. Pre-compiling CudaUtils at build time would need a real
CUDA device (the constructor calls cuda runtime functions), so
shipping the toolchain is the right trade-off.
Smoke-test validation on a fresh deploy host (AWS B200, not the build host)End-to-end validated What was validated
Unsloth's own banner inside the container: Loss progressionBit-for-bit identical to the internal validation on a GCP B200 (different host, same image): Bugs caught and fixed during validationEach was a real defect surfaced only by running the image on a fresh host:
Full gpt-oss-20B fine-tuning notebook run still pending; will post follow-up. |
…ia.com/cuda/gpus TORCH_CUDA_ARCH_LIST now covers the full set of compute capabilities NVIDIA publishes on https://developer.nvidia.com/cuda/gpus for x86_64 hardware, from Turing onward: sm_75 Turing T4, RTX 20-series, Quadro RTX sm_80 Ampere DC A100, A30 sm_86 Ampere A40, RTX A6000, RTX 30-series sm_89 Ada L4, L40, L40S, RTX 40-series sm_90 Hopper H100, H200, GH200 sm_100 Blackwell DC B100, B200, GB200 sm_103 Blackwell DC B300, GB300 sm_120 Blackwell RTX 50-series, RTX PRO 6000 Blackwell sm_121 Blackwell GB10 (DGX Spark) with +PTX on the highest entry so future arch revisions can JIT. Setting TORCH_CUDA_ARCH_LIST only affects nvcc invocations for any source build the user adds on top of this image (e.g. flash-attn, a custom CUDA op). The prebuilt cu128 wheels already include SASS for sm_70/75/80/86/90/100/120 (verified at build time via torch._C._cuda_getArchFlags()). Ada (sm_89), B300 (sm_103) and DGX Spark (sm_121) GPUs run via JIT-PTX from the nearest available arch. Jetson archs (sm_87 Orin, sm_110 Thor) are intentionally NOT included -- they require aarch64 wheels and this image is linux/amd64 only. Also lower the entrypoint's compute-capability gate from sm_80 to sm_75. Turing GPUs work, with the caveat that bfloat16 is unavailable; the entrypoint prints a NOTE in that case so Unsloth's fp16 fallback isn't a surprise.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dde5170e7a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ("sm_120", "Blackwell", "RTX 50-series, RTX PRO 6000 Blackwell"), | ||
| ("sm_121", "Blackwell", "GB10 (DGX Spark)"), | ||
| ) | ||
| if major < 7 or (major == 7 and minor < 5): |
There was a problem hiding this comment.
Reject pre-Ampere GPUs in entrypoint check
The entrypoint claims to fail fast on GPUs older than Ampere, but this condition only rejects devices below sm_75, so Turing (sm_75) is allowed through. That contradicts both the script’s own header (Unsloth requires sm_80+) and smoke_test.py, which exits on any cap[0] < 8; on T4/RTX20 hosts the container starts successfully and then fails later during validation/workloads instead of giving the intended immediate, actionable error.
Useful? React with 👍 / 👎.
HF Hub does not act as a generic OCI registry for arbitrary Docker images -- the registry.hf.space endpoint only serves images that Spaces have built, not images pushed by `docker push`. So we cannot do `docker push huggingface.co/user/repo:tag` for an Unsloth image. For cross-host testing where we want one canonical place to pull from (and Docker Hub credentials are not yet configured), wrap the manual flow into push/pull-shaped commands: hf_push.sh: docker save | pigz | huggingface-cli upload hf_pull.sh: huggingface-cli download | gunzip | docker load This is approximation, not real OCI semantics -- every push uploads the full ~4 GB blob, no layer dedup, no manifest negotiation. Good for testing across A100 / H100 / RTX 6000 boxes; the real release should go through .github/workflows/docker-publish.yml to Docker Hub, which gets layer dedup, multi-arch manifest support, and standard `docker pull` UX for users. Usage: bash docker/hf_push.sh unsloth-blackwell:test danielhanchen/unsloth-blackwell-docker bash docker/hf_pull.sh danielhanchen/unsloth-blackwell-docker unsloth-blackwell-test.tar.gz unsloth-blackwell:test
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4bfb4b891a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| with: | ||
| images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }} | ||
| tags: | | ||
| type=raw,value=latest,enable={{is_default_branch}} |
There was a problem hiding this comment.
Prevent workflow_dispatch builds from overwriting
latest
This tag rule publishes latest whenever the run is on the default branch, but workflow_dispatch also allows overriding UNSLOTH_REF/UNSLOTH_ZOO_REF to arbitrary refs. A manual test run from main can therefore push a non-release image as latest, which makes downstream users pull an unintended build. Gate latest to trusted release flows (or only when both refs are the defaults) to avoid accidental retagging.
Useful? React with 👍 / 👎.
| set -euo pipefail | ||
|
|
||
| REPO="${1:?usage: hf_pull.sh <hf_repo> [<blob>] [<verify_tag>]}" | ||
| BLOB="${2:-unsloth-blackwell.tar.gz}" |
There was a problem hiding this comment.
Match
hf_pull default blob name to push naming
hf_push.sh uploads archives as <image-name>-<tag>.tar.gz, but hf_pull.sh defaults to unsloth-blackwell.tar.gz. If users run the documented short form bash docker/hf_pull.sh <hf_repo> after a normal push, the download will target a filename that was never uploaded and fail. Keeping both scripts on the same default naming convention avoids this broken default path.
Useful? React with 👍 / 👎.
…face-cli` In huggingface_hub >= 0.27 the `huggingface-cli` binary is deprecated and prints a "Use hf instead" notice then exits without doing the operation. The previous wrappers ran `huggingface-cli upload/download` silently, treated the deprecation exit as success, and uploaded nothing. Detect the new `hf` binary first and use that. If only the legacy `huggingface-cli` is on PATH (older installs), fall back with a WARN so users know the failure mode if anything goes sideways. Also: hf_pull.sh now asserts the downloaded file is non-empty (`test -s`) so we catch silent download failures before the `docker load` step.
…t call
`jupyter nbconvert --to script nb.ipynb --output nb 2>/dev/null` was
silently exiting 0 without producing the output file in some
environments (likely because jupyter/jupyter_core wasn't on PATH or
nbconvert's --output handling differed across versions). The 2>/dev/null
hid the underlying error, and `set -e` did not catch the missing-output
case because nbconvert itself returned 0.
Switch to a direct nbformat-based conversion:
pip install -q nbformat
python -c "import nbformat; nb=nbformat.read('nb.ipynb', as_version=4);
code='\n\n'.join(c.source for c in nb.cells if c.cell_type == 'code')
open('nb.py','w').write(code + '\n')"
Smaller dep set, no shell-out to a jupyter wrapper script, and an
explicit `test -s nb.py` afterwards catches any silent failure
before downstream steps try to read the file.
Reproduces the failure on RTX PRO 6000 Blackwell (sm_120, docker
29.2.1, ubuntu 24.04) where nbconvert's CLI silently no-op'd.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8344fa0a56
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| TRITON_CACHE_DIR=/workspace/.cache/triton \ | ||
| # Keep the arch list visible at runtime in case the user source-builds anything | ||
| # extra inside the container (e.g. a custom CUDA op). | ||
| TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0;10.0;12.0+PTX" |
There was a problem hiding this comment.
Include sm_75 in runtime arch list
The runtime image advertises support from Turing onward (see entrypoint.sh supported list), but TORCH_CUDA_ARCH_LIST here starts at 8.0. Any CUDA extension compiled inside the running container (the exact use case this env var comment describes) will be built without sm_75, so on T4/RTX20 hosts those kernels can fail at runtime with “no kernel image is available” despite the container claiming that architecture is supported.
Useful? React with 👍 / 👎.
The previous nbformat-based conversion dumped raw cell.source for every code cell. The gpt-oss-20B notebook's first cell uses Jupyter !shell magic to install dependencies: !pip install --upgrade -qqq uv !uv pip install -qqq ... \ git+https://github.com/triton-lang/triton.git@0add68... ... Dumped verbatim, the `@0add68...` token tripped the Python parser with "SyntaxError: invalid decimal literal" before training could even start. The container already has unsloth, triton, transformers, etc. baked in, so we don't need the notebook's install cell. Skip any cell whose source contains pip/install markers, and comment out stray !cmd / %magic lines in any other cells. Then assert nb.py parses with ast.parse() before trying to run it -- catches conversion failures up front instead of at training time. Reproduces on RTX PRO 6000 Blackwell (sm_120, fresh Docker 29.2.1 host) where the previous conversion produced an invalid nb.py.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 391532c031
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| git clone --depth 1 -b docker-blackwell-build \ | ||
| https://github.com/unslothai/unsloth.git /tmp/unsloth-pr 2>&1 | tail -3 |
There was a problem hiding this comment.
Clone stable ref in fallback build-context path
When this script is run outside the repo tree, the fallback path clones a hardcoded docker-blackwell-build branch. That branch name is PR-specific and can disappear after merge, so the fallback clone will fail and Block 2 cannot build at all. This breaks the script’s advertised “clone if needed” flow for users validating from a clean host; use a stable default ref (or a configurable ref input) instead of a transient PR branch.
Useful? React with 👍 / 👎.
Cross-host validation #2: sm_120 (RTX PRO 6000 Blackwell, 96GB) — full gpt-oss-20B fine-tuningEnd-to-end validated on a fresh GCP RTX PRO 6000 Blackwell Server Edition host. Image pulled from HF (huggingface.co/danielhanchen/unsloth-blackwell-docker), loaded into Docker, and exercised with both the smoke test AND the full gpt-oss-20B fine-tuning notebook. Setup
Smoke test (5-step LoRA on Llama-3.2-1B-bnb-4bit)``` Loss is ~0.4% off the B200 (sm_100) run (4.1108 → 3.7511); expected because sm_100 vs sm_120 Triton kernels produce slightly different bf16 rounding paths, deterministically. Full gpt-oss-20B fine-tuning (10 LoRA steps, MXFP4 + MoE)10 SFT steps on `HuggingFaceH4/Multilingual-Thinking` with the gpt-oss-20B MXFP4 model. Real workload, real MoE expert LoRA, real Harmony format inference at multiple reasoning_effort levels: ``` Unsloth: Detected MoE model with num_experts = 32 and target_modules = [...]. Trainable parameters = 3,981,312 of 20,918,738,496 (0.02% trained) step 1: loss=1.071 grad_norm=2.805 train_runtime = 141.7s (2.36 minutes) Post-train inference at `reasoning_effort="medium"` and `"high"` produced coherent French reasoning output via the Harmony format -- confirming MXFP4 weights are loading correctly, MoE expert routing works, and the Triton kernels JIT-compile for sm_120 at first use. What this validates
Additional bugs caught and fixed during this validation
|
Make the docker image multi-arch so DGX Spark (GB10, sm_121, aarch64) and
the Grace-Hopper / Grace-Blackwell SoCs (GH200 arm64, GB200 arm64) pull a
natively-built arm64 child from the same manifest. Runtime emulation is
NOT involved -- QEMU is used only for the cross-compile step on x86_64
CI runners; consumers on aarch64 hosts get a normal arm64 image and CUDA
works as on any other host.
Dockerfile:
* ARG TARGETARCH; switch unsloth extras between cu128-ampere-torch2100
(amd64, with xformers) and huggingface (arm64, no xformers -- there
is no cu128 aarch64 xformers wheel as of 0.0.34, so we fall back to
Unsloth's native SDPA path; ~5-10% slowdown but functionally complete).
* Build-time torch._C._cuda_getArchFlags() assertion: amd64 still
requires sm_120, arm64 accepts sm_120 or sm_121.
* Same TORCH_CUDA_ARCH_LIST on both arches; nvcc emits whatever's listed.
docker/setup_qemu.sh (new):
One-time host setup -- registers binfmt_misc handlers via
tonistiigi/binfmt and creates a 'unsloth-multiarch' docker-container
buildx builder. Required only on x86_64 build hosts targeting arm64.
docker/test_locally.sh:
--platform amd64|arm64 flag. Cross-builds verify QEMU is registered,
then build through the in-image arch-flags assertion. Smoke + notebook
blocks auto-skip when image arch != host arch (CUDA cannot run under
user-space QEMU + nvidia-container-toolkit cannot bridge a QEMU guest
to a real GPU).
.github/workflows/docker-publish.yml:
platforms: linux/amd64,linux/arm64 (single manifest, two children).
Timeout bumped 60 -> 150 min for the slower arm64-under-QEMU leg.
docker/setup-qemu-action@v3 with platforms: arm64 (was implicit before).
|
DGX Spark / linux/arm64 support added via QEMU at build time (e7cfcea). The image is now multi-arch: one Docker manifest with linux/amd64 + linux/arm64 children. What changed:
Arm64 GPU SoCs supported by the new variant:
End-to-end arm64 validation on actual hardware is pending -- the build-time assertion exercises the wheel resolution and the cu128 aarch64 fat binary, but the final proof is |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e7cfceadab
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| import unsloth_zoo | ||
|
|
||
| print(f"unsloth_zoo {unsloth_zoo.__version__}") | ||
| import xformers |
There was a problem hiding this comment.
Gate xformers import by platform in smoke test
The arm64 image path intentionally omits xformers (UNSLOTH_EXTRA="huggingface" in docker/Dockerfile) because no cu128 aarch64 wheel is expected, but check_imports() unconditionally does import xformers here. On arm64 runs this makes /workspace/smoke_test.py fail before training, so the published arm64 variant cannot pass the repository’s own runtime validation despite being a supported target.
Useful? React with 👍 / 👎.
| print(f"cuda build {torch.version.cuda}") | ||
| print(f"arches {arches}") | ||
| assert "sm_100" in arches, f"sm_100 missing: {arches}" | ||
| assert "sm_120" in arches, f"sm_120 missing: {arches}" |
There was a problem hiding this comment.
Accept sm_121 in smoke-test arch validation
This assertion hard-requires sm_120, but the same commit’s Dockerfile build validation explicitly treats arm64 Blackwell as valid when either sm_120 or sm_121 is present. As written, a valid arm64 build that reports only sm_121 will fail the smoke test with a false negative, even though the image is intended to support GB10/DGX Spark.
Useful? React with 👍 / 👎.
…st tag, arm64 decord - pip shim: count editable/local/url/vcs targets (-e ., ., git+https, wheel URLs) as install targets, not just canonical package names, so they are no longer silently skipped inside notebooks - notebook sync: never overwrite a pre-existing user notebook on first boot (match the refresh path's ownership rule); skip .unsloth_sync_state.tmp when recording state so it is not tracked as a managed file - docker-publish: set flavor latest=false on the base image metadata so a v* tag push cannot publish :latest from the base image (the Studio image owns it) - notebook deps: pin to tested versions and install decord on its own, hard on amd64 and fail-soft on arm64 (no aarch64 wheel) so the arm64 base build works
The base build-args never passed LLAMA_PREBUILT_TAG, so the Dockerfile fell back to latest and each matrix leg resolved whatever unslothai/llama.cpp release was current at its own build time. If latest moved between the amd64 and arm64 legs, one published manifest could carry different GGUF binaries per arch. Resolve the release once in a new prepare job (explicit llama_prebuilt_tag dispatch input for a frozen build, else follow the /releases/latest redirect to a concrete tag, mirroring docker/build.sh) and pass that single tag to both legs.
|
Thanks, all four addressed. Gate decord on amd64 builds (8f693c6): decord now installs on its own line, hard on amd64 and fail-soft on arm64 (no aarch64 wheel, no sdist), so the arm64 base build no longer dies in that layer. Invoke pip for requirement-file installs (8f693c6): the final guard now counts any non-flag token as an install target ( Disable auto latest for tag refs (8f693c6): Pin llama.cpp prebuilt in release builds (8402dce): the base build-args never passed |
|
@codex review |
|
Went through the latest review pass (the 5-reviewer run plus the Codex review) and fixed the genuinely-open items on this branch:
Also re-merged current Most of the older comments in this PR were already resolved by the branch's history (build-args prose moved out of the One follow-up that is intentionally out of scope here: the reviewers flagged asymmetric CPU-mode CUDA-probe guards in |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8402dcebdd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| fi | ||
| cp -a "$TEMPLATE/$rel" "$DEST/$rel" 2>/dev/null || true | ||
| done | ||
| record_state |
There was a problem hiding this comment.
Preserve pre-existing notebooks outside the sync state
When /workspace/unsloth-notebooks already contains a user-edited file on first boot, the branch above skips copying the template, but this record_state call records every file under DEST as if the sync wrote it. On the subsequent GitHub refresh, that file's current hash matches the recorded hash, so it is treated as pristine and can be overwritten by the upstream notebook despite being pre-existing user data.
Useful? React with 👍 / 👎.
| # not just tokens with a canonical pkg name: editable / local / url / vcs | ||
| # installs (`-e .`, `.`, `git+https://...`, a wheel URL) carry no canonical | ||
| # name but must still run, and a `-r`/`-c` file pulls in real requirements. | ||
| has_install_target = any(not t.startswith("-") for t in keep_args) |
There was a problem hiding this comment.
Ignore option values when deciding whether pip has targets
For notebook install cells that only request baked packages plus an index/link flag, e.g. pip install --extra-index-url https://download.pytorch.org/whl/cu128 torch transformers==5.3.0, the shim drops the package specs but leaves the flag and its URL in keep_args. This test then treats the URL as an install target and execs pip install --extra-index-url <url> with no requirement, causing the cell to fail instead of no-oping after preserving the baked stack.
Useful? React with 👍 / 👎.
| # The full Studio image owns the unprefixed namespace, headed by | ||
| # :latest. Same :latest gating rationale as the base job. | ||
| type=raw,value=latest,enable=${{ github.ref == format('refs/heads/{0}', github.event.repository.default_branch) && github.event.inputs.unsloth_ref == '' }} | ||
| type=ref,event=tag |
There was a problem hiding this comment.
Disable implicit latest tags for Studio tag builds
This Studio metadata block omits flavor: latest=false, so type=ref,event=tag can still emit an implicit :latest tag under metadata-action's default latest=auto behavior (docs), bypassing the explicit branch-only gate above on v* tag pushes. The mirrored smoke-test metadata block has the same config, so it can also pull :latest instead of the tag that was just published.
Useful? React with 👍 / 👎.
| # (or any branch/tag/sha); otherwise take the latest PyPI release. | ||
| if [ -n "$REF" ]; then | ||
| SPECS="git+https://github.com/unslothai/unsloth.git@${REF}#egg=unsloth" | ||
| SPECS="$SPECS git+https://github.com/unslothai/unsloth-zoo.git@${REF}#egg=unsloth_zoo" |
There was a problem hiding this comment.
Resolve unsloth-zoo separately for ref updates
When --ref is an Unsloth release tag or commit SHA, this installs unsloth-zoo from the same ref even though that repo is not guaranteed to have matching tags or SHAs; the publish workflow already has separate zoo-ref resolution for this reason. In those cases the advertised unsloth-studio-update --ref <tag|sha> path fails before updating Studio, so the script should resolve the zoo ref independently or leave it on a known default.
Useful? React with 👍 / 👎.
| # omegaconf TTS families + both NeMo-Gym RL notebooks' config objects | ||
| # einx TTS codec tensor-rearrange (Llasa / Oute / Spark TTS) | ||
| # librosa Whisper audio feature extraction (pairs with soundfile + torchcodec) | ||
| # ftfy Oute TTS text normalisation |
There was a problem hiding this comment.
Remove unsupported sm_103 from the CUDA 12.8 arch list
This runtime arch list is consumed by PyTorch/CUDA extension builds inside the container, but the image only installs CUDA 12.8 nvcc on amd64 while NVIDIA documents compiler target support for sm_103 as added in CUDA 12.9 (CUDA features archive). Any pip install/JIT path that honors TORCH_CUDA_ARCH_LIST will pass an unsupported compute_103 target to nvcc 12.8 and fail, even on non-B300 hosts; drop 10.3 here or ship a 12.9+/13 compiler wherever it is advertised.
Useful? React with 👍 / 👎.
…gs, zoo ref, arch list) - pip shim: do not treat the value of an index-url / find-links / constraint flag as an install target. A cell like 'pip install --extra-index-url <url> torch' now no-ops after keeping the baked stack instead of exec'ing a bare 'pip install --extra-index-url <url>' that fails. Positional . / url / vcs and -r/--requirement files still count as targets. - notebook sync: on first boot, record only files we actually wrote (or that are byte-identical to the template), never a kept pre-existing user file; and on the GitHub refresh, treat a file present in DEST but absent from the sync state as user-owned and keep it. Previously a bind-mounted notebook was recorded as managed and then overwritten by upstream. - docker-publish: add flavor latest=false to the Studio metadata steps too, so a v* tag push cannot emit an implicit :latest via metadata-action's latest=auto; :latest stays default-branch-only, and the smoke test pulls the published tag. - unsloth-studio-update: resolve the unsloth-zoo ref independently of --ref (new --zoo-ref, else use the ref only when the zoo repo has it, else fall back to main) so 'update --ref <unsloth-tag/sha>' does not fail on a missing zoo ref. - Dockerfile: drop 10.3 (compute_103) from TORCH_CUDA_ARCH_LIST in both the builder and runtime stages. B300 runs sm_100 SASS, and the bundled CUDA 12.8 nvcc cannot compile compute_103 (added in 12.9), which broke arch-list-honoring source / JIT builds.
|
All five are valid; fixed:
Validation: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0ebbdbb9cc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if prev_flag in _REQ_FILE_FLAGS: | ||
| has_target = True |
There was a problem hiding this comment.
Filter requirement files before invoking pip
When a notebook install cell uses pip install -r requirements.txt, this branch marks the requirements file as a real target and later execs pip with the file unchanged. Any protected packages inside that file, such as torch, transformers, vLLM, or nvidia wheels, bypass _KEEP and the Transformers sidecar marker logic, so a requirements file with common training deps can overwrite the baked cu128 stack or install Transformers into the base venv. Parse/filter requirement files or reject files that mention protected packages before passing them through.
Useful? React with 👍 / 👎.
| # out of build-args -- forwarded lines must be KEY=VALUE only.) | ||
| build-args: | | ||
| BASE_IMAGE=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}@${{ needs.merge.outputs.digest }} | ||
| UNSLOTH_STUDIO_REF=${{ github.event.inputs.unsloth_ref || (startsWith(github.ref, 'refs/tags/') && github.ref_name) || github.sha || 'main' }} |
There was a problem hiding this comment.
Forward the resolved zoo ref into Studio builds
For workflow dispatches that set unsloth_zoo_ref (or future tag pushes where the zoo resolver returns something other than main), the base image bakes steps.zoo_ref.outputs.ref but the Studio job only forwards the Unsloth ref. Dockerfile.studio runs install.sh --local, and that local-install path overlays unsloth-zoo from git main, so the published full image can run a Studio backend with a different zoo than the base image and the operator-requested ref; pass the resolved zoo ref through this build and install that ref in the Studio venv.
Useful? React with 👍 / 👎.
| # see unsloth_sync_notebooks.sh + unsloth_nb_content_sig.py. Inherited as-is by | ||
| # the studio image (FROM base). | ||
| RUN set -eux \ | ||
| && git clone --depth 1 https://github.com/unslothai/notebooks /opt/unsloth-notebooks \ |
There was a problem hiding this comment.
Pin baked notebooks to one resolved commit
Each architecture leg runs this clone independently against unslothai/notebooks HEAD, so if that repo advances between the amd64 and arm64 builds (or between release reruns), the same Docker tag seeds different baked notebook templates and .unsloth_template_commit state depending on the pulled platform. Resolve the notebooks commit once in the workflow/build script and clone that ref, like the llama.cpp prebuilt tag, so the multi-arch image contents stay consistent.
Useful? React with 👍 / 👎.
…inned notebooks commit) - unsloth_pip_shim.py: filter protected packages out of a notebook `pip install -r requirements.txt`. The -r value was passed to the real pip unchanged, so torch / transformers / vLLM / nvidia pins inside the file could overwrite the baked cu128 stack or push transformers into the base venv. _filter_requirements_file() applies the same _KEEP / transformers-sidecar rules per line, writes the survivors to a temp file, keeps comments, option lines, nested includes and urls verbatim, and records a pinned transformers version for the sidecar. - install.sh + Dockerfile.studio + docker-publish.yml: forward the resolved unsloth-zoo ref into the Studio build. install.sh --local overlaid unsloth-zoo from git main regardless of the operator-requested or base-image ref, so the full image could run a different zoo than the base. install.sh now honors UNSLOTH_ZOO_REF across all four --local overlays, Dockerfile.studio passes UNSLOTH_STUDIO_ZOO_REF through to it, and the workflow resolves one zoo ref in the prepare job and shares it with both the base and Studio builds. - Dockerfile + docker-publish.yml: pin unslothai/notebooks to one resolved commit. Each arch leg cloned HEAD independently, so the same tag could seed different baked templates and .unsloth_template_commit depending on the pulled platform. The prepare job freezes notebooks to one sha (like the llama.cpp prebuilt tag) and the Dockerfile fetches that single ref at depth 1.
|
All three were valid. Fixed in
Validation: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d476c7764b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if re.match(r"^[a-z]+\+", token) or "://" in token or token.startswith((".", "/")): | ||
| return None # vcs / url / local path -> let it pass through |
There was a problem hiding this comment.
Keep protected direct-reference installs out
When a notebook uses a quoted PEP 508 direct reference for a protected package, e.g. pip install "torch @ https://.../torch.whl" or "unsloth @ git+https://...", this early URL check returns None before extracting the distribution name. The token is then kept and treated as a real target, so the shim can reinstall torch/Unsloth into the base venv even though _KEEP is supposed to preserve the baked CUDA stack.
Useful? React with 👍 / 👎.
| if tok in _VALUE_FLAGS: | ||
| keep_args.append(tok) | ||
| skip_next = True | ||
| prev_flag = tok | ||
| continue |
There was a problem hiding this comment.
Handle equals-form requirement files
For a valid pip invocation like pip install --requirement=requirements.txt (the same --requirement <file> option also accepts the standard --option=value form), this exact-token check does not recognize the requirements file. The argument starts with -, so it is kept as an option but has_target remains false; cells whose only install target is that file return as a no-op and skip all of its dependencies.
Useful? React with 👍 / 👎.
…value req files Two more notebook-shim gaps from review: - A quoted PEP 508 direct reference for a protected package, e.g. `pip install "torch @ https://.../torch.whl"` or `"unsloth @ git+https://..."`, bypassed _KEEP: _canon hit the url guard and returned None before pulling the distribution name, so the token was treated as a real target and reinstalled into the base venv. _canon now extracts the name from the `name [extras] @ url` form first, so a protected package pinned through a URL/VCS is still dropped; a non-protected direct reference returns its name and is kept exactly as before. - The `--requirement=reqs.txt` equals-form (pip accepts `--option=value` for any value-taking flag) was not recognized: the token starts with `-`, so it was kept as an opaque option, the file was never filtered, and has_target stayed false -- a cell whose only target was that file silently no-op'd. The scan now splits `--flag=value`, filters the requirements file for `-r`/`--requirement`, and counts it as a target; other inline-value options stay options.
|
Both valid. Fixed in
Validation: |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2ee7f4b644
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| && git checkout -q FETCH_HEAD \ | ||
| && UNSLOTH_STUDIO_HOME="${UNSLOTH_STUDIO_HOME}" \ | ||
| UNSLOTH_TORCH_INDEX_FAMILY="${TORCH_FAMILY}" \ | ||
| UNSLOTH_ZOO_REF="${UNSLOTH_STUDIO_ZOO_REF}" \ |
There was a problem hiding this comment.
Use the resolved zoo ref during Studio setup
When a Studio build passes UNSLOTH_STUDIO_ZOO_REF other than main, this only pins the first install.sh overlay; the same --local flow then runs studio/setup.sh, whose install_python_stack.py local branch force-reinstalls unsloth-zoo @ git+https://github.com/unslothai/unsloth-zoo with no @ref, so the Studio venv ends up on zoo main while the base venv used the resolved ref. This still affects workflow_dispatch/tag publishes that intentionally resolve a non-main zoo ref; thread this env through setup/install_python_stack before the final overlay.
Useful? React with 👍 / 👎.
| if not stripped or stripped.startswith(("#", "-")): | ||
| out.append(line) # comment / blank / option / nested include -> keep |
There was a problem hiding this comment.
Filter nested requirement includes too
For a notebook cell like pip install -r reqs.txt where reqs.txt contains -r torch-reqs.txt, this branch copies the nested include verbatim; pip recursively processes those requirement includes, so protected specs in the nested file never pass through _KEEP or the Transformers marker logic and can still reinstall torch/transformers into the baked venv. The fresh issue in the current helper is the one-level filter: nested -r/--requirement lines are explicitly preserved instead of being filtered or rejected.
Useful? React with 👍 / 👎.
| fi | ||
| fi | ||
| mkdir -p "$(dirname "$dst")" 2>/dev/null || true | ||
| if cp -a "$f" "$dst" 2>/dev/null; then |
There was a problem hiding this comment.
Honor deleted notebooks during refresh
When UNSLOTH_KEEP_DELETED_NOTEBOOKS=1 is set, the offline restore block is skipped, but the GitHub refresh still falls through here for any tracked notebook the user deleted and copies it back from the fresh clone whenever upstream advances. That makes the documented deletion opt-out only work until the next remote update; skip paths that are recorded in the state but absent from DEST when this option is enabled.
Useful? React with 👍 / 👎.
|
reviewed on windows 11 / WSL2 (ubuntu 24.04, docker 29.6.1 + nvidia-container-toolkit). built the image came out with ran it with |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2ee7f4b644
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ok_studio=0; ok_jupyter=0 | ||
| for i in $(seq 1 60); do | ||
| if curl -fsS http://localhost:18000/api/health >/dev/null 2>&1; then ok_studio=1; fi | ||
| if curl -fsS http://localhost:18888/api >/dev/null 2>&1; then ok_jupyter=1; fi |
There was a problem hiding this comment.
Probe a public Jupyter endpoint
In the full-image smoke test this request hits Jupyter's /api without any token or login cookie. The launcher always configures Jupyter with a password hash when booting the image, so unauthenticated API calls return 403; because this uses curl -f, ok_jupyter never flips and any HAS_GPU_RUNNER publish run reports the full image unhealthy even when JupyterLab is up. The same /api probe is duplicated in docker/docker_confirm.sh, so use a public endpoint such as /login or authenticate the request.
Useful? React with 👍 / 👎.
| # setup is not silently ignored. | ||
| if [[ "${UNSLOTH_ALLOW_CPU:-0}" == "1" ]]; then | ||
| if ! command -v nvidia-smi >/dev/null 2>&1 || ! nvidia-smi -L 2>/dev/null | grep -q '^GPU'; then | ||
| warn "UNSLOTH_ALLOW_CPU=1 and no GPU visible -- continuing on CPU." |
There was a problem hiding this comment.
Gate CPU mode before Studio model loads
When the image is started on a CPU-only host with UNSLOTH_ALLOW_CPU=1, this branch lets the full Studio image continue and the comments/docs advertise Studio chat as usable. However UNSLOTH_ALLOW_CPU makes Unsloth report DEVICE_TYPE == "cuda", and Studio inference then calls FastLanguageModel.from_pretrained(...), whose CUDA path still unconditionally executes torch.cuda.get_device_properties(0) (see unsloth/models/llama.py:2310 and unsloth/models/vision.py:747). On Docker Desktop/macOS or Windows+AMD, loading a chat model therefore raises instead of falling back to CPU; either keep CPU mode to tooling/Jupyter or guard those CUDA probes.
Useful? React with 👍 / 👎.
| absolute path so there is no recursion. `python -m pip` / `%pip` bypass PATH and | ||
| are not intercepted -- the driven `unsloth-run` handles those by parsing the |
There was a problem hiding this comment.
Intercept %pip before it mutates the baked stack
For notebooks that use %pip or python -m pip, this explicitly bypasses the PATH shim, but neither the IPython startup hook nor unsloth-run rewrites those cells or installs a pip module wrapper. In that scenario a cell like %pip install transformers==... or %pip install torch... runs the real pip inside /opt/unsloth-venv and can overwrite the cu128 torch/transformers stack that the shim is meant to protect, so the safe notebook execution path is only safe for !pip/!uv shell commands.
Useful? React with 👍 / 👎.
…XDEV, %pip shim) - docker-publish smoke + docker_confirm.sh probe Jupyter /login, not /api: the launcher always configures a password hash so /api returns 403 and curl -f would never flip the health flag (false build failure). - entrypoint.sh CPU messaging: CPU mode covers Jupyter, GGUF tooling and llama.cpp (GGUF) Studio chat; training AND loading an Unsloth model (FastLanguageModel) still need a GPU, since from_pretrained runs CUDA probes. - install_llama_prebuilt.py: rollback/activation moves used bare os.replace, which fails with EXDEV across overlayfs in a Docker build and fell back to a broken source build (no nvcc). Add is_cross_device_error + move_install_dir_aside (os.replace fast path, copy+remove on EXDEV; busy errors still re-raise). - notebooks: %pip / %uv line magics and the `!python -m pip` form bypassed the PATH pip/uv shim and could overwrite the baked cu128 torch/vLLM stack. Add unsloth_nb_pip_magic.py to re-point them at the shim, wired via the IPython startup hook and installed into the venv site-packages.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
https://github.com/unslothai/unsloth/blob/f1525695e55fe5c85d3f33efb585d4bff3dcadb9/docker/unsloth_nb_content_sig.py#L256-L257
Do not drop captured body cells from signatures
For any real tutorial cell that starts with %%capture or %%bash to hide noisy output or run shell preprocessing, this helper excludes the entire cell from the middle digest even if it does not install packages. If upstream later changes that cell, middle_unchanged reports SAME, the refresh treats the notebook as only header/footer churn, and users keep stale executable content; only classify these magics as boilerplate when the cell is actually an install/setup cell.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| $okStudio = $false; $okJupyter = $false | ||
| foreach ($i in 1..60) { | ||
| if (-not $okStudio) { try { Invoke-WebRequest -UseBasicParsing -Uri "http://localhost:$PORT_STUDIO/api/health" -TimeoutSec 4 | Out-Null; $okStudio = $true } catch {} } | ||
| if (-not $okJupyter) { try { Invoke-WebRequest -UseBasicParsing -Uri "http://localhost:$PORT_JUPYTER/api" -TimeoutSec 4 | Out-Null; $okJupyter = $true } catch {} } |
There was a problem hiding this comment.
Probe Jupyter login in the Windows confirmer
When the Windows confirmation script reaches this full-image check, studio_launch.sh has already configured Jupyter with a hashed password, so an unauthenticated request to /api returns 403 even when JupyterLab is healthy. The Linux confirmer and workflow use /login for this reason; leaving the PowerShell path on /api makes Windows users see a false Jupyter failure unless they authenticate the request or probe /login.
Useful? React with 👍 / 👎.
| if re.match(r"^[a-z]+\+", token) or "://" in token or token.startswith((".", "/")): | ||
| return None # vcs / url / local path -> let it pass through |
There was a problem hiding this comment.
Block VCS egg installs for protected packages
Fresh evidence beyond the earlier fixed PEP 508 case: legacy VCS requirements such as pip install git+https://github.com/unslothai/unsloth.git#egg=unsloth still take this URL/VCS passthrough branch, so _canon() returns None and the shim later executes the token as a real install target. In notebooks using that valid pip form, protected packages can still be reinstalled into the baked venv and bypass _KEEP; parse #egg=/editable values before treating VCS URLs as passthrough.
Useful? React with 👍 / 👎.
| # Of those value-flags, the ones whose VALUE is itself an install target: a | ||
| # requirements file pulls real requirements. An index-url / find-links / | ||
| # constraint / target value is an option, not something to install. | ||
| _REQ_FILE_FLAGS = {"-r", "--requirement"} |
There was a problem hiding this comment.
Filter constraint files before pip sees them
When a notebook runs something like pip install -c constraints.txt peft and that constraints file pins transformers or torch, this branch keeps -c verbatim because only requirement files are inspected. Pip applies constraints to dependency resolution, so installing a kept package can still downgrade/reinstall protected packages from the constraint file without _KEEP or the sidecar marker ever seeing those specs; filter or reject protected entries in constraint files too.
Useful? React with 👍 / 👎.
| # Persistence: the update is written to the container's writable layer, so it | ||
| # survives `docker restart`. To keep it across a full `docker rm` + `docker run` | ||
| # (and to keep your chats/users/models), run Studio with its home on a named | ||
| # volume: -v unsloth_studio_home:/opt/unsloth-studio |
There was a problem hiding this comment.
Do not mount over the Studio install home
If users follow this persistence instruction with a fresh named volume, Docker masks the baked /opt/unsloth-studio tree, including bin/unsloth and the Studio venv that supervisord.conf starts. The full image then boots with an empty home and the Studio service cannot exec; persist only a data subdirectory or seed the volume before recommending this mount.
Useful? React with 👍 / 👎.
|
Two parts here: the WSL2 report and the Codex review on the same commit. @LeoBorcherding thanks for the thorough Windows 11 / WSL2 pass. The llama.cpp EXDEV failure is fixed in 2c31686. install_llama_prebuilt.py no longer uses a bare os.replace for the rollback and activation moves: On the cu128 dependency: correct, the studio build needs install.sh's Codex review 4583085472 (same commit) is also addressed in 2c31686:
|
End to end validation of the published Docker imageI ran the published Blackwell Docker image end to end on the free GitHub-hosted Linux x64 runners. The image is Feature probe (identical PASS on all three runners):
Live session: Boot path: |
A Jupyter password hash is always configured, so /api returns 403; the Windows confirmation reported a healthy full image as a hard failure. Matches the fix already in docker_confirm.sh and docker-publish.yml.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8fc483ec62
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| echo = False, | ||
| name = "vLLM STDERR", | ||
| ready_regex = None, | ||
| # vLLM >= 0.19 emits "Starting vLLM API server ... on ..." (and |
There was a problem hiding this comment.
Preserve explicit infinite vLLM startup waits
When callers pass timeout=None, the previous Event.wait(timeout=None) waited indefinitely, which is a useful escape hatch for large models or slow first-time downloads. This expression now converts None (and 0) back to 1200 seconds, so those runs are killed after 20 minutes even though the caller explicitly disabled the timeout; handle None as an unbounded deadline instead of falling back to the default.
Useful? React with 👍 / 👎.
| v = requested_version() | ||
| if v and "transformers" not in sys.modules: | ||
| activate(v) |
There was a problem hiding this comment.
Scope transformer pins to the current kernel
This hook reads one shared /tmp/unsloth_nb/requested_transformers marker for every Jupyter kernel in the container. If two notebooks run concurrently with different install-cell pins, whichever cell writes the marker last controls the other kernel's next pre-run hook, so that notebook can activate the wrong transformers sidecar before its model cell; make the marker per-kernel/notebook rather than global.
Useful? React with 👍 / 👎.
| if not changed: | ||
| return path, None, [] | ||
| try: | ||
| fd, tmp = tempfile.mkstemp(prefix = "unsloth-nb-req-", suffix = ".txt") |
There was a problem hiding this comment.
Keep filtered requirement files beside the original
When a requirements file is changed because a protected package was dropped, the filtered copy is written under the default temp directory. For a valid file that also contains a relative nested include such as -r extras.txt, pip resolves that include relative to the requirements file it is currently reading, so after this rewrite it looks in /tmp instead of the notebook/project directory and the install fails; create the temporary file next to path or rewrite relative include paths.
Useful? React with 👍 / 👎.
| && mkdir -p /root/.ipython/profile_default/startup \ | ||
| && cp /opt/unsloth-nb/unsloth_ipython_startup.py /root/.ipython/profile_default/startup/00-unsloth-nb.py \ |
There was a problem hiding this comment.
Install the notebook startup hook outside root home
When users start the base image with --user (common with mounted workspaces to avoid root-owned files), IPython uses that user's home rather than /root, so this startup file is never loaded. In that context UNSLOTH_NB_SHIM is not set and the PATH shim deliberately execs the real pip, letting notebook !pip/%pip cells mutate the baked torch/transformers stack; install the hook in a system-wide IPython/Jupyter startup location or otherwise enable it per kernel.
Useful? React with 👍 / 👎.
| first = t.lstrip().split("\n", 1)[0].strip().lower() | ||
| return first.startswith("%%capture") or first.startswith("%%bash") |
There was a problem hiding this comment.
Hash substantive captured or bash notebook cells
This treats every %%capture or %%bash cell as boilerplate, even when the cell is real tutorial logic such as data prep, launches, or captured training code. For an untouched notebook where upstream changes one of those cells, both content signatures drop the changed cell and middle_unchanged can return SAME, so the boot refresh skips a substantive upstream fix; only exclude these magics after confirming they are the generated install/setup cell.
Useful? React with 👍 / 👎.
| # | ||
| # Persistence: the swap lands in the container's writable layer (survives | ||
| # docker restart). To keep it across a full recreate, mount the prebuilt dir on | ||
| # a named volume: -v unsloth_llama:/opt/unsloth/llama.cpp |
There was a problem hiding this comment.
Do not mount over the baked llama.cpp bundle
If users follow this persistence example with a fresh named volume, Docker masks /opt/unsloth/llama.cpp, including the baked binaries and converter that GGUF export and the Studio symlink rely on. The image then boots with an empty llama.cpp install and GGUF tooling fails until the update command seeds it; recommend seeding the volume first or mounting a parent/data path instead.
Useful? React with 👍 / 👎.


Summary
Adds a Docker setup for Unsloth that runs on any NVIDIA GPU host from Ampere through Blackwell (sm_80 through sm_120: A100, RTX 30/40, H100, B100/B200, RTX 50-series, RTX 6000 Pro Blackwell) and natively on aarch64 (GB10 / Grace, DGX Spark). Two images are published to
docker.io/unsloth/unsloth:docker/Dockerfile, tag:base): the full training stack --torch 2.10.0+cu128, Unsloth + unsloth_zoo, TRL / PEFT / accelerate, bitsandbytes, triton, xformers (amd64), vLLM -- plus llama.cpp for GGUF tooling, JupyterLab, and the baked Unsloth notebooks. Run it headless for training,unsloth-run <notebook|url>, orjupyter lab.docker/Dockerfile.studio, tag:latest): layers Unsloth Studio on top of the base image and runs the production service trio under supervisord -- Studio on 8000, JupyterLab on 8888, key-only sshd on 22 -- plus an optional Cloudflare tunnel for JupyterLab.The build itself requires no GPU at all: it runs on free GitHub-hosted runners, on a developer laptop without an NVIDIA card, or on any datacenter GPU. All produce byte-identical images.
Multi-arch. amd64 and arm64 are built in parallel on native GitHub runners (
ubuntu-latest+ubuntu-24.04-arm, both free on public repos since Aug 2025), pushed by digest, then merged into a single multi-platform manifest, sodocker pullselects the right child automatically. Native arm64 is ~3x faster than QEMU and runs on DGX Spark / Grace with CUDA working as normal (no runtime emulation).Why the build does not need a GPU
There are four places where a naive Docker build silently couples to the build-host GPU. The Dockerfile breaks each one:
1. Wheel selection. The README install line
uv pip install unsloth --torch-backend=autointrospects the build host's driver. This Dockerfile pinstorch==2.10.0against--extra-index-url https://download.pytorch.org/whl/cu128explicitly. No--torch-backend=auto, noinstall.sh.2. Dep resolution order. Splitting installs into multiple
pip installcalls letsbitsandbytes0.49.x's transitivecuda-toolkit==13.0.2dep silently upgradetorch 2.10.0+cu128 -> 2.12.0+cu130in a later pass, leaving cu128 xformers stranded. This Dockerfile collapses everything into a singleuv pip installwith--index-strategy unsafe-best-matchso the resolver sees all constraints at once.3. Build-time verification.
torch.cuda.get_arch_list()returns[]when no GPU is visible. The Dockerfile uses the raw C++ accessortorch._C._cuda_getArchFlags(), which reads compiled wheel metadata directly ('sm_70 sm_75 sm_80 sm_86 sm_90 sm_100 sm_120'on amd64;'sm_80 sm_90 sm_100 sm_120'on aarch64). Required packages are checked viaimportlib.metadata.version()instead of importing them, becauseimport unslothtriggerstorch.cuda.get_device_properties(0), which can't be satisfied on a GPU-less host. Import-time correctness is exercised at deploy time bysmoke_test.py.4. Compiled-kernel cache. If anything imports unsloth during the build, Triton JITs kernels keyed to the build host's compute capability and bakes them into
unsloth_compiled_cache/.UNSLOTH_COMPILE_DISABLE=1andUNSLOTH_COMPILE_OVERWRITE=0prevent this. The deploy GPU produces its own cache on first use.The underlying reason all of this works is that cu128 PyTorch wheels are already fat binaries, cross-compiled upstream to include SASS for every architecture from sm_70 through sm_120 (and sm_80/90/100/120 on aarch64). The build host's GPU was never needed for the binary content, only by code that pretended to need it.
What is in the images
Base image (
:base)torch 2.10.0+cu128(held against the cu cascade), withTORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;8.9;9.0;10.0;10.3;12.0+PTX"so any source build inside the container honours the full Ampere-through-Blackwell range. The+PTXsuffix on the top arch gives forward-compat JIT-PTX for future consumer Blackwell SKUs.unslothai/notebooksbaked in as a read-only template and synced to/workspace/unsloth-notebookson boot (edit-preserving refresh from GitHub when reachable).!pip/!uvshim that keeps the core GPU stack pinned while letting a notebook install its own extras, per-notebook transformers-version sidecars, andunsloth-runfor headless execution of any notebook or URL.[huggingface]extra (no cu128 aarch64 xformers/vLLM wheel) and installs cuda-nvrtc/nvcc from NVIDIA's sbsa repo.Full image (
:latest)arm64 / aarch64
A native arm64 child is published in the same manifest. CI covers it two ways:
docker-build-arm64-native.ymlbuilds + smoke-checks on the freeubuntu-24.04-armrunner (the path DGX Spark / Grace users take), anddocker-build-arm64-qemu.ymlcross-builds under QEMU as a fallback signal and to validate the documentedsetup_qemu.shrecipe.Validation
Base image validated end-to-end on a B200 (sm_100) host. With
CUDA_VISIBLE_DEVICES=""(simulating the GPU-less CI runner), the resolved stack istorch 2.10.0+cu128, triton, xformers (cu128 wheel from the PyTorch index), bitsandbytes, unsloth + unsloth_zoo, transformers, trl, peft, accelerate, with arch flags['sm_70','sm_75','sm_80','sm_86','sm_90','sm_100','sm_120']. Runtime path on B200 GPU 0:smoke_test.pyimports unsloth, loadsLlama-3.2-1B-Instruct-bnb-4bitin 4-bit, and completes LoRA steps with loss decreasing. The full image was launched on the same host and Studio, JupyterLab, and sshd all came up under supervisord.Files
docker/Dockerfiledocker/Dockerfile.studiodocker/entrypoint.sh,docker/supervisord.conf,docker/studio_launch.shdocker/unsloth_sync_notebooks.sh,docker/unsloth_nb_compat.py,docker/unsloth_nb_content_sig.pydocker/unsloth_pip_shim.py,docker/unsloth_run.py,docker/unsloth_ipython_startup.py!pip/!uvshim, headless runner, IPython startup hookdocker/unsloth_studio_update.sh,docker/unsloth_llama_update.sh,docker/unsloth_jupyter_tunnel.sh,docker/fetch_llama_prebuilt.pydocker/run.sh,docker/build.sh,docker/test_locally.sh,docker/docker_confirm.sh,docker/docker_confirm.ps1,docker/setup_qemu.shdocker/freeze.sh,docker/hf_pull.sh,docker/hf_push.sh,docker/smoke_test.py,docker/.dockerignore.github/workflows/docker-publish.ymlDesign notes
nvidia/cuda:12.8.1-cudnn-devel-ubuntu24.04for the build stage,-cudnn-runtime-ubuntu24.04for deploy. No nvcc in the published amd64 image./opt/unsloth-venv/requirements.lock.txtinside the image and can be extracted withdocker/freeze.shfor a fully-pinned rebuild later.Test plan
DOCKERHUB_USERNAMEandDOCKERHUB_TOKENunsloth/unsloth:lateston an RTX 50-series host and confirm Studio (8000) + JupyterLab (8888) come upvars.HAS_GPU_RUNNER=trueonce a self-hosted GPU runner is registered so the post-publish smoke test exercises real sm_120 paths