Add Unsloth Docker images (base + Studio) for any NVIDIA GPU host, Ampere through Blackwell by danielhanchen · Pull Request #5748 · unslothai/unsloth

danielhanchen · 2026-05-24T06:54:11Z

Summary

Adds a Docker setup for Unsloth that runs on any NVIDIA GPU host from Ampere through Blackwell (sm_80 through sm_120: A100, RTX 30/40, H100, B100/B200, RTX 50-series, RTX 6000 Pro Blackwell) and natively on aarch64 (GB10 / Grace, DGX Spark). Two images are published to docker.io/unsloth/unsloth:

Lean base image (docker/Dockerfile, tag :base): the full training stack -- torch 2.10.0+cu128, Unsloth + unsloth_zoo, TRL / PEFT / accelerate, bitsandbytes, triton, xformers (amd64), vLLM -- plus llama.cpp for GGUF tooling, JupyterLab, and the baked Unsloth notebooks. Run it headless for training, unsloth-run <notebook|url>, or jupyter lab.
Full image (docker/Dockerfile.studio, tag :latest): layers Unsloth Studio on top of the base image and runs the production service trio under supervisord -- Studio on 8000, JupyterLab on 8888, key-only sshd on 22 -- plus an optional Cloudflare tunnel for JupyterLab.

The build itself requires no GPU at all: it runs on free GitHub-hosted runners, on a developer laptop without an NVIDIA card, or on any datacenter GPU. All produce byte-identical images.

Multi-arch. amd64 and arm64 are built in parallel on native GitHub runners (ubuntu-latest + ubuntu-24.04-arm, both free on public repos since Aug 2025), pushed by digest, then merged into a single multi-platform manifest, so docker pull selects the right child automatically. Native arm64 is ~3x faster than QEMU and runs on DGX Spark / Grace with CUDA working as normal (no runtime emulation).

Why the build does not need a GPU

There are four places where a naive Docker build silently couples to the build-host GPU. The Dockerfile breaks each one:

1. Wheel selection. The README install line uv pip install unsloth --torch-backend=auto introspects the build host's driver. This Dockerfile pins torch==2.10.0 against --extra-index-url https://download.pytorch.org/whl/cu128 explicitly. No --torch-backend=auto, no install.sh.

2. Dep resolution order. Splitting installs into multiple pip install calls lets bitsandbytes 0.49.x's transitive cuda-toolkit==13.0.2 dep silently upgrade torch 2.10.0+cu128 -> 2.12.0+cu130 in a later pass, leaving cu128 xformers stranded. This Dockerfile collapses everything into a single uv pip install with --index-strategy unsafe-best-match so the resolver sees all constraints at once.

3. Build-time verification. torch.cuda.get_arch_list() returns [] when no GPU is visible. The Dockerfile uses the raw C++ accessor torch._C._cuda_getArchFlags(), which reads compiled wheel metadata directly ('sm_70 sm_75 sm_80 sm_86 sm_90 sm_100 sm_120' on amd64; 'sm_80 sm_90 sm_100 sm_120' on aarch64). Required packages are checked via importlib.metadata.version() instead of importing them, because import unsloth triggers torch.cuda.get_device_properties(0), which can't be satisfied on a GPU-less host. Import-time correctness is exercised at deploy time by smoke_test.py.

4. Compiled-kernel cache. If anything imports unsloth during the build, Triton JITs kernels keyed to the build host's compute capability and bakes them into unsloth_compiled_cache/. UNSLOTH_COMPILE_DISABLE=1 and UNSLOTH_COMPILE_OVERWRITE=0 prevent this. The deploy GPU produces its own cache on first use.

The underlying reason all of this works is that cu128 PyTorch wheels are already fat binaries, cross-compiled upstream to include SASS for every architecture from sm_70 through sm_120 (and sm_80/90/100/120 on aarch64). The build host's GPU was never needed for the binary content, only by code that pretended to need it.

What is in the images

Base image (:base)

Training stack pinned at torch 2.10.0+cu128 (held against the cu cascade), with TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;8.9;9.0;10.0;10.3;12.0+PTX" so any source build inside the container honours the full Ampere-through-Blackwell range. The +PTX suffix on the top arch gives forward-compat JIT-PTX for future consumer Blackwell SKUs.
vLLM and llama.cpp (prebuilt) for inference and GGUF tooling.
The curated notebook extras (JupyterLab, ipywidgets, audio/TTS codecs, etc.) pinned to tested versions for reproducible rebuilds.
unslothai/notebooks baked in as a read-only template and synced to /workspace/unsloth-notebooks on boot (edit-preserving refresh from GitHub when reachable).
Notebook ergonomics: a !pip/!uv shim that keeps the core GPU stack pinned while letting a notebook install its own extras, per-notebook transformers-version sidecars, and unsloth-run for headless execution of any notebook or URL.
arm64 swaps to the [huggingface] extra (no cu128 aarch64 xformers/vLLM wheel) and installs cuda-nvrtc/nvcc from NVIDIA's sbsa repo.

Full image (:latest)

Everything in the base image, plus Unsloth Studio.
supervisord runs Studio (8000), JupyterLab (8888), an optional Cloudflare tunnel for JupyterLab, and key-only sshd (22). The studio venv dedups its CUDA libraries against the base venv to keep the image size down.

arm64 / aarch64

A native arm64 child is published in the same manifest. CI covers it two ways: docker-build-arm64-native.yml builds + smoke-checks on the free ubuntu-24.04-arm runner (the path DGX Spark / Grace users take), and docker-build-arm64-qemu.yml cross-builds under QEMU as a fallback signal and to validate the documented setup_qemu.sh recipe.

Validation

Base image validated end-to-end on a B200 (sm_100) host. With CUDA_VISIBLE_DEVICES="" (simulating the GPU-less CI runner), the resolved stack is torch 2.10.0+cu128, triton, xformers (cu128 wheel from the PyTorch index), bitsandbytes, unsloth + unsloth_zoo, transformers, trl, peft, accelerate, with arch flags ['sm_70','sm_75','sm_80','sm_86','sm_90','sm_100','sm_120']. Runtime path on B200 GPU 0: smoke_test.py imports unsloth, loads Llama-3.2-1B-Instruct-bnb-4bit in 4-bit, and completes LoRA steps with loss decreasing. The full image was launched on the same host and Studio, JupyterLab, and sshd all came up under supervisord.

Files

Path	Purpose
`docker/Dockerfile`	multi-stage cu128 base image, no GPU required (amd64 + arm64)
`docker/Dockerfile.studio`	full image: base + Studio + JupyterLab + sshd under supervisord
`docker/entrypoint.sh`, `docker/supervisord.conf`, `docker/studio_launch.sh`	boot + service orchestration
`docker/unsloth_sync_notebooks.sh`, `docker/unsloth_nb_compat.py`, `docker/unsloth_nb_content_sig.py`	baked-notebook sync + per-notebook transformers sidecars
`docker/unsloth_pip_shim.py`, `docker/unsloth_run.py`, `docker/unsloth_ipython_startup.py`	`!pip`/`!uv` shim, headless runner, IPython startup hook
`docker/unsloth_studio_update.sh`, `docker/unsloth_llama_update.sh`, `docker/unsloth_jupyter_tunnel.sh`, `docker/fetch_llama_prebuilt.py`	in-image updaters + llama.cpp fetch + Cloudflare tunnel
`docker/run.sh`, `docker/build.sh`, `docker/test_locally.sh`, `docker/docker_confirm.sh`, `docker/docker_confirm.ps1`, `docker/setup_qemu.sh`	local build / run / cross-OS confirm helpers
`docker/freeze.sh`, `docker/hf_pull.sh`, `docker/hf_push.sh`, `docker/smoke_test.py`, `docker/.dockerignore`	lockfile extract, HF helpers, runtime smoke test
`.github/workflows/docker-publish.yml`	multi-arch build + push of both images to Docker Hub

Design notes

Base images: nvidia/cuda:12.8.1-cudnn-devel-ubuntu24.04 for the build stage, -cudnn-runtime-ubuntu24.04 for deploy. No nvcc in the published amd64 image.
A lockfile is emitted at /opt/unsloth-venv/requirements.lock.txt inside the image and can be extracted with docker/freeze.sh for a fully-pinned rebuild later.

Test plan

Set repo secrets DOCKERHUB_USERNAME and DOCKERHUB_TOKEN
Trigger the publish workflow (or merge to main) and confirm both images build on the native amd64 + arm64 runners in under ~90 min and merge into one manifest
Pull unsloth/unsloth:latest on an RTX 50-series host and confirm Studio (8000) + JupyterLab (8888) come up
Pull the same tag on a B200 host and confirm the image works with no rebuild
Pull on a DGX Spark / Grace (arm64) host and confirm the arm64 child runs natively
Optional: set vars.HAS_GPU_RUNNER=true once a self-hosted GPU runner is registered so the post-publish smoke test exercises real sm_120 paths

Adds a multi-stage Dockerfile producing an image that works on Ampere through Blackwell (sm_80 through sm_120: A100, RTX 30/40, H100, B100/B200, RTX 50-series, RTX 6000 Pro Blackwell). The build itself requires no GPU at all and runs on a free GitHub-hosted ubuntu-latest runner. How the GPU-less build works: 1. cu128 PyTorch wheels are fat binaries. torch._C._cuda_getArchFlags() returns 'sm_70 sm_75 sm_80 sm_86 sm_90 sm_100 sm_120' regardless of which GPU compiled the image, because the wheels are cross-compiled upstream by the PyTorch team. 2. All deps resolve in a single uv pip install pass with explicit pins (torch==2.10.0, --extra-index-url cu128, no --torch-backend=auto, no install.sh). This prevents the silent cu cascade where bitsandbytes' transitive cuda-toolkit==13 dep upgrades torch to 2.12+cu130 in a later resolver pass, leaving xformers and other cu128 wheels stranded. 3. Build-time verification uses package metadata (importlib.metadata.version) and the raw torch._C._cuda_getArchFlags() accessor. We deliberately avoid import unsloth at build time because unsloth.__init__ calls torch.cuda.get_device_properties(0), which requires an actual CUDA device and is not bypassable. Import-time correctness is exercised at deploy time by smoke_test.py with --gpus all. 4. UNSLOTH_COMPILE_DISABLE=1 and CUDA_VISIBLE_DEVICES="" during the build stage prevent any code path from JIT-compiling kernels for the build host's compute capability and baking the resulting cache into the image. The deploy GPU produces its own cache on first use. Other notes: - --index-strategy unsafe-best-match is needed because the PyTorch wheel index serves an old requests==2.28.1 that conflicts with datasets>=2.32.2, which the default first-index-wins strategy rejects. - Extra is cu128-ampere-torch2100 (ampere precedes the torch version in the pyproject ordering). - No flash-attn in the base image. FA3 is hard-refused on Blackwell upstream and unsloth gracefully falls back to xformers + SDPA. Users on Ampere / Ada / Hopper who want FA2 can pip install flash-attn on top. - Two stages: nvidia/cuda:12.8.1-cudnn-devel-ubuntu24.04 for the build, -cudnn-runtime for the deploy image. No nvcc in the published image. - A lockfile is emitted at /opt/unsloth-venv/requirements.lock.txt inside the image and can be extracted with docker/freeze.sh for byte-identical rebuilds even after PyPI moves on. CI workflow .github/workflows/docker-publish.yml: - Builds on ubuntu-latest on every push to main, every tag, weekly via cron, and manually via workflow_dispatch. Pushes to docker.io/unsloth/unsloth with cache via type=gha. - Optional smoke-test job runs on a self-hosted GPU runner if vars.HAS_GPU_RUNNER is set; skipped otherwise. End-to-end verification on sm_120 hardware is a nice-to-have, not a publish blocker. Validation: - Install path validated on a B200 host with CUDA_VISIBLE_DEVICES="" set (simulating the GPU-less CI runner): torch 2.10.0+cu128 holds, xformers 0.0.34, bitsandbytes 0.49.2, triton 3.6.0, transformers 5.5.0, trl 0.24.0, peft 0.19.1, accelerate 1.13.0. Arch flags include sm_100 and sm_120. - Runtime path validated end-to-end on B200: smoke_test.py imports unsloth, loads Llama-3.2-1B-Instruct-bnb-4bit in 4-bit, completes 5 LoRA steps with loss decreasing 4.11 -> 3.75. xformers fallback active as designed. Files: - docker/Dockerfile multi-stage cu128 build - docker/build.sh local build wrapper - docker/freeze.sh extract lockfile from a built image - docker/smoke_test.py runtime verification, run with --gpus all - docker/.dockerignore - .github/workflows/docker-publish.yml

for more information, see https://pre-commit.ci

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c6d92160f6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-24T06:56:05Z

+          docker pull ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest
+          docker run --rm --gpus all \
+            ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest \


Smoke-test the image built in this run, not latest

The smoke-test job always pulls :latest, but this workflow also runs on tag pushes where latest is not guaranteed to be among the tags produced by metadata-action (it is only enabled on the default branch in this workflow). In that case, the smoke test can validate an older image and miss regressions in the freshly built tag from the current run.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-24T06:56:05Z

+            UNSLOTH_REF=${{ github.event.inputs.unsloth_ref || 'main' }}
+            UNSLOTH_ZOO_REF=${{ github.event.inputs.unsloth_zoo_ref || 'main' }}


Build from triggering ref instead of hardcoding main

For non-workflow_dispatch events (including tag pushes), github.event.inputs.* is unset, so these build args always resolve to main. That means images produced for v* tags can contain unsloth and unsloth-zoo code from main rather than the release ref that triggered the run, which breaks release correctness and reproducibility.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request introduces a Dockerized environment for Unsloth and unsloth-zoo, specifically optimized for NVIDIA Blackwell GPUs (sm_100 and sm_120). The changes include a multi-stage Dockerfile, build and freeze scripts, and a comprehensive smoke test to verify GPU compatibility and training functionality. Review feedback suggests optimizing the Dockerfile by removing a redundant installation of the uv tool, correcting a version mismatch for torchaudio to ensure consistency with the PyTorch stack, and relocating cache directories outside of the workspace to prevent issues when mounting host volumes at runtime.

gemini-code-assist · 2026-05-24T06:56:24Z

+RUN ${VENV}/bin/pip install uv \
+ && ${VENV}/bin/uv pip install \
+        --python ${VENV}/bin/python \


The uv tool is already installed at the system level in line 63. Installing it again inside the virtual environment at line 99 is redundant. Using the system-wide uv binary to install packages into the venv is more efficient and avoids unnecessary layers.

RUN uv pip install \ --python ${VENV}/bin/python \

gemini-code-assist · 2026-05-24T06:56:24Z

+        --python ${VENV}/bin/python \
+        --index-strategy unsafe-best-match \
+        --extra-index-url https://download.pytorch.org/whl/cu128 \
+        "torch==2.10.0" "torchvision==0.25.0" "torchaudio==2.11.0" \


There appears to be a version mismatch for torchaudio. While torch is pinned to 2.10.0 and torchvision to 0.25.0 (which correctly follows the standard 0.(Y+15) mapping for Torch 2.10), torchaudio is set to 2.11.0. Typically, PyTorch and Torchaudio versions are released in sync (e.g., Torch 2.6.0 with Torchaudio 2.6.0). Using 2.10.0 ensures consistency across the stack.

"torch==2.10.0" "torchvision==0.25.0" "torchaudio==2.10.0" \

gemini-code-assist · 2026-05-24T06:56:24Z

+    HF_HOME=/workspace/.cache/huggingface \
+    TRITON_CACHE_DIR=/workspace/.cache/triton \


Setting HF_HOME and TRITON_CACHE_DIR to subdirectories of /workspace (the WORKDIR) can lead to issues when users mount a host directory to /workspace at runtime. The mount will obscure the directories created during the build, forcing the application to recreate them at runtime, which can cause permission issues or redundant downloads. Moving these caches to a location outside of the workspace, such as /opt/cache, avoids these issues.

HF_HOME=/opt/cache/huggingface \ TRITON_CACHE_DIR=/opt/cache/triton \

gemini-code-assist · 2026-05-24T06:56:24Z

+COPY --from=builder /opt/unsloth-venv /opt/unsloth-venv
+
+WORKDIR /workspace
+RUN mkdir -p ${HF_HOME} ${TRITON_CACHE_DIR}


When moving the cache directories outside of /workspace, creating them with broad permissions (e.g., 777) ensures that non-root users can write to the cache at runtime without encountering permission errors.

RUN mkdir -p ${HF_HOME} ${TRITON_CACHE_DIR} && chmod -R 777 /opt/cache

When someone launches the unsloth container, the common failure modes are not unsloth bugs -- they're Docker / nvidia-container-toolkit / driver issues that surface as cryptic CUDA errors deep in torch. The entrypoint catches the three that cover ~95% of "it doesn't work" reports up front: 1. nvidia-smi inside the container sees no GPU -> user forgot --gpus all, or host is missing nvidia-container-toolkit -> entrypoint prints the exact docker run flag and the toolkit install URL 2. nvidia-smi works but torch.cuda.is_available() is False -> host driver is older than CUDA 12.8 supports -> entrypoint prints the minimum driver version per architecture 3. compute capability < sm_80 -> entrypoint prints the supported architecture table and exits Each check fails with a clear, actionable message rather than a stack trace. Set UNSLOTH_SKIP_GPU_CHECK=1 to bypass (for docs builds, offline tooling, CI). run.sh wraps `docker run` with the flags people most often forget: --gpus all (without it, the new entrypoint refuses to start) --ipc=host (DataLoader workers need >64MB shm) --ulimit memlock=-1 (NCCL + CUDA pinned host buffers) --ulimit stack=64MB (some torch kernels OOM the default 8MB stack) Plus it mounts the host HF cache + Triton JIT cache so model downloads and compiled kernels persist across container runs, and forwards HF_TOKEN / WANDB_API_KEY / UNSLOTH_LICENSE only when they are set on the host. Usage: bash docker/run.sh # interactive python REPL bash docker/run.sh bash # shell in container bash docker/run.sh python /workspace/smoke_test.py bash docker/run.sh python /workspace/host/train.py # $PWD mounted at /workspace/host Verified locally: - No GPU visible: entrypoint refuses with driver-version message, exit 1 - B200 sm_100 visible: entrypoint prints GPU banner, exits cleanly into the user command (rc=0)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 58693c4c73

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-24T07:09:22Z

+[[ -n "${HF_TOKEN:-}"        ]] && ENV_FORWARD+=(-e "HF_TOKEN=${HF_TOKEN}")
+[[ -n "${WANDB_API_KEY:-}"   ]] && ENV_FORWARD+=(-e "WANDB_API_KEY=${WANDB_API_KEY}")
+[[ -n "${UNSLOTH_LICENSE:-}" ]] && ENV_FORWARD+=(-e "UNSLOTH_LICENSE=${UNSLOTH_LICENSE}")
+
+set -x


Remove xtrace before invoking docker run with secrets

This script conditionally forwards HF_TOKEN, WANDB_API_KEY, and UNSLOTH_LICENSE, then enables set -x right before docker run, which prints the fully expanded command line. In any environment where those variables are set (local terminals with history/log capture or CI logs), their raw values are exposed in plaintext, creating an avoidable credential leak.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-24T07:09:22Z

+    import transformers
+
+    print(f"transformers {transformers.__version__}")


Import unsloth before transformers in smoke test

The default execution path runs training (check_tiny_train) but check_imports imports transformers first, which contradicts the later requirement that Unsloth be imported first for patched training behavior. This means the smoke test is not validating the intended Unsloth training path and can produce misleading pass/fail results (e.g., patches not applied or warnings/fallback paths during the LoRA step).

Useful? React with 👍 / 👎.

Single bash script that runs the full validation flow against the image: 1. Host pre-flight: docker version, nvidia-smi, nvidia-container-toolkit runtime registered with docker. 2. Build the image (auto-detects the build context -- current dir, docker/ subdir, or clones the docker-blackwell-build branch into /tmp/unsloth-pr/). 3a. Smoke test: 5-step LoRA on Llama-3.2-1B-Instruct-bnb-4bit. 3b. Real workload: gpt-oss-20B fine-tuning notebook from unslothai/notebooks, patched to max_steps=10, with the three pre-train demo generations dropped for brevity. Auto-installs triton_kernels at the SHA the upstream notebook pins for MXFP4. All output is teed to /tmp/unsloth-docker-test/ (or --log-dir). Usage: bash docker/test_locally.sh # full run, ~15 min bash docker/test_locally.sh --skip-notebook # blocks 1-3a only, ~3 min bash docker/test_locally.sh --skip-build # reuse existing TAG TAG=my:tag HF_TOKEN=hf_xxx bash docker/test_locally.sh Each block fails fast with the exact log path to paste back.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: acbb16c8a1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-24T07:19:51Z

+        --python ${VENV}/bin/python \
+        --index-strategy unsafe-best-match \
+        --extra-index-url https://download.pytorch.org/whl/cu128 \
+        "torch==2.10.0" "torchvision==0.25.0" "torchaudio==2.11.0" \


Align torchaudio pin with torch 2.10 stack

The install set mixes torch==2.10.0 with torchaudio==2.11.0, but TorchAudio wheels are release-coupled to specific Torch versions; this can make dependency resolution fail or force an unexpected Torch change, which breaks the Docker build’s stated guarantee that Torch stays on 2.10.0. This is especially risky here because the same layer also installs many transitive deps from multiple indexes, so one incompatible pin can fail the image build on CI.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-24T07:19:51Z

+if ! command -v nvidia-smi >/dev/null 2>&1; then
+    err "nvidia-smi not found inside the container."
+    err "The CUDA runtime in this image is broken. Re-pull the image."
+    exit 1


Avoid hard-failing when nvidia-smi binary is unavailable

Startup currently exits before any CUDA check if nvidia-smi is missing, but there are valid GPU runtimes (for example compute-only capability profiles) where CUDA is usable while nvidia-smi/NVML tools are not mounted. In those environments the container will refuse to start even though torch.cuda.is_available() could succeed, causing false negatives in production deployments that intentionally limit driver capabilities.

Useful? React with 👍 / 👎.

The Dockerfile uses BuildKit-only features (the # syntax=docker/dockerfile:1.7 parser directive and RUN ... <<'PY' heredocs added in dockerfile 1.3+). The legacy builder rejects the --progress flag at the CLI level and would fail later at the heredocs anyway. Detect docker buildx and use it when available (preserves --progress=plain output). Otherwise fall back to plain `docker build` with DOCKER_BUILDKIT=1 exported, which gets the BuildKit features without buildx's nicer formatting. Reproduces the failure path seen on Docker 28.2.2 without buildx installed: unknown flag: --progress ERROR docker build exited 125

Docker 28 removed the legacy image builder entirely. Setting DOCKER_BUILDKIT=1 no longer falls back to a builtin builder -- it delegates to buildx, which then errors out if buildx isn't installed: ERROR: BuildKit is enabled but the buildx component is missing or broken. The Ubuntu docker.io package omits buildx by default, so users on that path hit this immediately. Detect missing buildx up front and print exact install commands for apt / dnf / manual binary instead of attempting a fallback that cannot work.

If the user is not in the 'docker' group, every docker command after the pre-flight returns "permission denied while trying to connect to the Docker daemon socket at /var/run/docker.sock". This used to surface as a confusing buildx failure mid-Block-2, but the actual problem is a host permissions issue that's settable up front. Detect by running 'docker info' and checking its exit code (not just grep on its output -- a permission failure prints to stderr and returns non-zero, so the old grep-based check was a silent skip). Also clarify the nvidia-runtime WARN: on Docker 28+ with CDI mode this is a false positive most of the time. The real GPU-attach test is the smoke run in Block 3a, where the container entrypoint catches missing GPUs with an actionable message.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 23a5b43180

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-24T07:55:00Z

+cd /workspace/host
+
+echo "=== install triton_kernels (MXFP4 support for unsloth/gpt-oss-20b) ==="
+pip install -q 'git+https://github.com/triton-lang/triton.git@0add68262ab0a2e33b84524346cb27cbb2787356#subdirectory=python/triton_kernels' 2>&1 | tail -5


Propagate pip install failures through the tail pipeline

The generated run_notebook.sh script runs with set -e but not pipefail, and this line pipes pip install into tail -5; in Bash that means the pipeline exits with tail's status, so a failed pip install can be treated as success and the script continues until later steps fail with misleading errors (for example missing triton_kernels imports). This causes false-positive notebook validation and makes debugging CI/local failures much harder whenever installation fails (network hiccups, dependency conflicts, or bad commit hash).

Useful? React with 👍 / 👎.

Ubuntu 24.04 (noble) marks the system Python interpreter as externally-managed per PEP 668, so: curl get-pip.py | python python -m pip install -U pip uv fails inside the builder image with: error: externally-managed-environment This environment is externally managed The system-level pip and uv were never used: the very next RUN creates the venv at /opt/unsloth-venv, which bootstraps its own pip via the ensurepip module (provided by the python3.12-venv apt package). uv is then installed INTO the venv with the venv's pip, and used from there. Drop the two system-pip bootstrap lines. The venv path is unchanged. Reproduces on any Docker build of the unsloth-blackwell image against a noble base image (which our nvidia/cuda:12.8.1-cudnn-devel-ubuntu24.04 is).

… / peft unsloth_zoo/__init__.py guards against being imported standalone: if "UNSLOTH_IS_PRESENT" not in os.environ: raise ImportError("Please install Unsloth via `pip install unsloth`!") The env var is set by unsloth/__init__.py at import time, so importing unsloth must happen first. The old check_imports() imported xformers, bnb, transformers, trl, peft, then unsloth_zoo -- which fired the guard because unsloth had not been imported yet. Reorder check_imports() to import unsloth (and unsloth_zoo) first, then the rest. check_unsloth_import() becomes a thin re-import to keep the "FastLanguageModel reachable" banner in the output. Same fix the unsloth README has been recommending for years: "import unsloth at the top of your file, before transformers/trl/peft."

Triton's nvidia backend lazily JIT-compiles a small C extension (CudaUtils, in triton/backends/nvidia/driver.py) on first GPU access. Without a C compiler and Python headers in the runtime image, the very first forward pass of any Unsloth model dies with: RuntimeError: Failed to find C compiler. Please specify via CC environment variable. The builder stage has build-essential and python3.12-dev so this worked during the build's verification step (no GPU = no Triton kernel call = no C extension build). But the runtime stage stripped those out for size, so the failure only surfaces when a real user runs training inside the container. Add gcc + g++ + python3.12-dev to the runtime stage. Increases the runtime image by ~250MB, which is the cost of letting Triton JIT correctly. Pre-compiling CudaUtils at build time would need a real CUDA device (the constructor calls cuda runtime functions), so shipping the toolchain is the right trade-off.

danielhanchen · 2026-05-24T08:27:04Z

Smoke-test validation on a fresh deploy host (AWS B200, not the build host)

End-to-end validated docker/test_locally.sh --skip-notebook on an AWS EC2 instance with 8x B200, driver 590.48.01, Docker 28.2.2 (Ubuntu docker.io + docker-buildx 0.30.1). The build host was a separate GCP B200 — so this confirms cross-host reproducibility.

What was validated

Build: builder + runtime stages completed cleanly
Built-in arch check: arches: ['sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120']
Entrypoint pre-flight: Unsloth container: 8 GPU(s). Primary: NVIDIA B200 sm_100 bf16=True
Imports: unsloth → unsloth_zoo → xformers → bnb → transformers → trl → peft (no order violations)
Model load: unsloth/Llama-3.2-1B-Instruct-bnb-4bit loaded in 4-bit, 16 QKV + 16 O LoRA layers patched
Training: 5 LoRA steps completed, loss decreased monotonically

Unsloth's own banner inside the container:

NVIDIA B200. Num GPUs = 8. Max memory: 178.353 GB. Platform: Linux.
Torch: 2.10.0+cu128. CUDA: 10.0. CUDA Toolkit: 12.8. Triton: 3.6.0
Bfloat16 = TRUE. FA [Xformers = 0.0.34. FA2 = False]

Loss progression

Bit-for-bit identical to the internal validation on a GCP B200 (different host, same image):

step 0  loss=4.1108
step 1  loss=4.0617
step 2  loss=3.9786
step 3  loss=3.8695
step 4  loss=3.7511

Bugs caught and fixed during validation

Each was a real defect surfaced only by running the image on a fresh host:

`docker build --progress` rejected by the legacy builder → require buildx (c6d9216, 56d2701)
Pre-flight missed `permission denied` on docker socket → check `docker info` exit code, print `usermod -aG docker` recipe (23a5b43)
PEP 668 rejected the system-wide `pip install -U pip uv` bootstrap → dropped, the venv self-bootstraps via ensurepip (fd55ed0)
`smoke_test.py` imported `unsloth_zoo` before `unsloth` → reorder (00cbc82)
Triton's lazily-built CudaUtils C extension needed gcc + python3-dev at runtime → add to runtime stage (1cdc5f1)

Full gpt-oss-20B fine-tuning notebook run still pending; will post follow-up.

…ia.com/cuda/gpus TORCH_CUDA_ARCH_LIST now covers the full set of compute capabilities NVIDIA publishes on https://developer.nvidia.com/cuda/gpus for x86_64 hardware, from Turing onward: sm_75 Turing T4, RTX 20-series, Quadro RTX sm_80 Ampere DC A100, A30 sm_86 Ampere A40, RTX A6000, RTX 30-series sm_89 Ada L4, L40, L40S, RTX 40-series sm_90 Hopper H100, H200, GH200 sm_100 Blackwell DC B100, B200, GB200 sm_103 Blackwell DC B300, GB300 sm_120 Blackwell RTX 50-series, RTX PRO 6000 Blackwell sm_121 Blackwell GB10 (DGX Spark) with +PTX on the highest entry so future arch revisions can JIT. Setting TORCH_CUDA_ARCH_LIST only affects nvcc invocations for any source build the user adds on top of this image (e.g. flash-attn, a custom CUDA op). The prebuilt cu128 wheels already include SASS for sm_70/75/80/86/90/100/120 (verified at build time via torch._C._cuda_getArchFlags()). Ada (sm_89), B300 (sm_103) and DGX Spark (sm_121) GPUs run via JIT-PTX from the nearest available arch. Jetson archs (sm_87 Orin, sm_110 Thor) are intentionally NOT included -- they require aarch64 wheels and this image is linux/amd64 only. Also lower the entrypoint's compute-capability gate from sm_80 to sm_75. Turing GPUs work, with the caveat that bfloat16 is unavailable; the entrypoint prints a NOTE in that case so Unsloth's fp16 fallback isn't a surprise.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dde5170e7a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-24T08:34:37Z

+    ("sm_120", "Blackwell",    "RTX 50-series, RTX PRO 6000 Blackwell"),
+    ("sm_121", "Blackwell",    "GB10 (DGX Spark)"),
+)
+if major < 7 or (major == 7 and minor < 5):


Reject pre-Ampere GPUs in entrypoint check

The entrypoint claims to fail fast on GPUs older than Ampere, but this condition only rejects devices below sm_75, so Turing (sm_75) is allowed through. That contradicts both the script’s own header (Unsloth requires sm_80+) and smoke_test.py, which exits on any cap[0] < 8; on T4/RTX20 hosts the container starts successfully and then fails later during validation/workloads instead of giving the intended immediate, actionable error.

Useful? React with 👍 / 👎.

HF Hub does not act as a generic OCI registry for arbitrary Docker images -- the registry.hf.space endpoint only serves images that Spaces have built, not images pushed by `docker push`. So we cannot do `docker push huggingface.co/user/repo:tag` for an Unsloth image. For cross-host testing where we want one canonical place to pull from (and Docker Hub credentials are not yet configured), wrap the manual flow into push/pull-shaped commands: hf_push.sh: docker save | pigz | huggingface-cli upload hf_pull.sh: huggingface-cli download | gunzip | docker load This is approximation, not real OCI semantics -- every push uploads the full ~4 GB blob, no layer dedup, no manifest negotiation. Good for testing across A100 / H100 / RTX 6000 boxes; the real release should go through .github/workflows/docker-publish.yml to Docker Hub, which gets layer dedup, multi-arch manifest support, and standard `docker pull` UX for users. Usage: bash docker/hf_push.sh unsloth-blackwell:test danielhanchen/unsloth-blackwell-docker bash docker/hf_pull.sh danielhanchen/unsloth-blackwell-docker unsloth-blackwell-test.tar.gz unsloth-blackwell:test

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4bfb4b891a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-24T09:28:43Z

+        with:
+          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
+          tags: |
+            type=raw,value=latest,enable={{is_default_branch}}


Prevent workflow_dispatch builds from overwriting latest

This tag rule publishes latest whenever the run is on the default branch, but workflow_dispatch also allows overriding UNSLOTH_REF/UNSLOTH_ZOO_REF to arbitrary refs. A manual test run from main can therefore push a non-release image as latest, which makes downstream users pull an unintended build. Gate latest to trusted release flows (or only when both refs are the defaults) to avoid accidental retagging.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-24T09:28:43Z

+set -euo pipefail
+
+REPO="${1:?usage: hf_pull.sh <hf_repo> [<blob>] [<verify_tag>]}"
+BLOB="${2:-unsloth-blackwell.tar.gz}"


Match hf_pull default blob name to push naming

hf_push.sh uploads archives as <image-name>-<tag>.tar.gz, but hf_pull.sh defaults to unsloth-blackwell.tar.gz. If users run the documented short form bash docker/hf_pull.sh <hf_repo> after a normal push, the download will target a filename that was never uploaded and fail. Keeping both scripts on the same default naming convention avoids this broken default path.

Useful? React with 👍 / 👎.

…face-cli` In huggingface_hub >= 0.27 the `huggingface-cli` binary is deprecated and prints a "Use hf instead" notice then exits without doing the operation. The previous wrappers ran `huggingface-cli upload/download` silently, treated the deprecation exit as success, and uploaded nothing. Detect the new `hf` binary first and use that. If only the legacy `huggingface-cli` is on PATH (older installs), fall back with a WARN so users know the failure mode if anything goes sideways. Also: hf_pull.sh now asserts the downloaded file is non-empty (`test -s`) so we catch silent download failures before the `docker load` step.

…t call `jupyter nbconvert --to script nb.ipynb --output nb 2>/dev/null` was silently exiting 0 without producing the output file in some environments (likely because jupyter/jupyter_core wasn't on PATH or nbconvert's --output handling differed across versions). The 2>/dev/null hid the underlying error, and `set -e` did not catch the missing-output case because nbconvert itself returned 0. Switch to a direct nbformat-based conversion: pip install -q nbformat python -c "import nbformat; nb=nbformat.read('nb.ipynb', as_version=4); code='\n\n'.join(c.source for c in nb.cells if c.cell_type == 'code') open('nb.py','w').write(code + '\n')" Smaller dep set, no shell-out to a jupyter wrapper script, and an explicit `test -s nb.py` afterwards catches any silent failure before downstream steps try to read the file. Reproduces the failure on RTX PRO 6000 Blackwell (sm_120, docker 29.2.1, ubuntu 24.04) where nbconvert's CLI silently no-op'd.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8344fa0a56

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-24T10:05:19Z

+    TRITON_CACHE_DIR=/workspace/.cache/triton \
+    # Keep the arch list visible at runtime in case the user source-builds anything
+    # extra inside the container (e.g. a custom CUDA op).
+    TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0;10.0;12.0+PTX"


Include sm_75 in runtime arch list

The runtime image advertises support from Turing onward (see entrypoint.sh supported list), but TORCH_CUDA_ARCH_LIST here starts at 8.0. Any CUDA extension compiled inside the running container (the exact use case this env var comment describes) will be built without sm_75, so on T4/RTX20 hosts those kernels can fail at runtime with “no kernel image is available” despite the container claiming that architecture is supported.

Useful? React with 👍 / 👎.

The previous nbformat-based conversion dumped raw cell.source for every code cell. The gpt-oss-20B notebook's first cell uses Jupyter !shell magic to install dependencies: !pip install --upgrade -qqq uv !uv pip install -qqq ... \ git+https://github.com/triton-lang/triton.git@0add68... ... Dumped verbatim, the `@0add68...` token tripped the Python parser with "SyntaxError: invalid decimal literal" before training could even start. The container already has unsloth, triton, transformers, etc. baked in, so we don't need the notebook's install cell. Skip any cell whose source contains pip/install markers, and comment out stray !cmd / %magic lines in any other cells. Then assert nb.py parses with ast.parse() before trying to run it -- catches conversion failures up front instead of at training time. Reproduces on RTX PRO 6000 Blackwell (sm_120, fresh Docker 29.2.1 host) where the previous conversion produced an invalid nb.py.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 391532c031

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-24T10:16:35Z

+            git clone --depth 1 -b docker-blackwell-build \
+                https://github.com/unslothai/unsloth.git /tmp/unsloth-pr 2>&1 | tail -3


Clone stable ref in fallback build-context path

When this script is run outside the repo tree, the fallback path clones a hardcoded docker-blackwell-build branch. That branch name is PR-specific and can disappear after merge, so the fallback clone will fail and Block 2 cannot build at all. This breaks the script’s advertised “clone if needed” flow for users validating from a clean host; use a stable default ref (or a configurable ref input) instead of a transient PR branch.

Useful? React with 👍 / 👎.

danielhanchen · 2026-05-24T10:19:20Z

Cross-host validation #2: sm_120 (RTX PRO 6000 Blackwell, 96GB) — full gpt-oss-20B fine-tuning

End-to-end validated on a fresh GCP RTX PRO 6000 Blackwell Server Edition host. Image pulled from HF (huggingface.co/danielhanchen/unsloth-blackwell-docker), loaded into Docker, and exercised with both the smoke test AND the full gpt-oss-20B fine-tuning notebook.

Setup

Host: GCP, NVIDIA RTX PRO 6000 Blackwell Server Edition (sm_120), 97887 MiB, driver 580.126.09
Docker: 29.2.1, buildx 0.30.1
Image: identical `unsloth-blackwell:test` built on the AWS B200, distributed via HF Hub

Smoke test (5-step LoRA on Llama-3.2-1B-bnb-4bit)

```
Unsloth container: 1 GPU(s). Primary: NVIDIA RTX PRO 6000 Blackwell Server Edition sm_120 bf16=True
arches: ['sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120']
step 0 loss=4.0946
step 1 loss=4.0556
step 2 loss=3.9781
step 3 loss=3.8706
step 4 loss=3.7453
=== all checks passed ===
```

Loss is ~0.4% off the B200 (sm_100) run (4.1108 → 3.7511); expected because sm_100 vs sm_120 Triton kernels produce slightly different bf16 rounding paths, deterministically.

Full gpt-oss-20B fine-tuning (10 LoRA steps, MXFP4 + MoE)

10 SFT steps on `HuggingFaceH4/Multilingual-Thinking` with the gpt-oss-20B MXFP4 model. Real workload, real MoE expert LoRA, real Harmony format inference at multiple reasoning_effort levels:

```
==((====))== Unsloth 2026.5.6: Fast Gpt_Oss patching. Transformers: 5.5.0.
\\ /| NVIDIA RTX PRO 6000 Blackwell Server Edition. Num GPUs = 1.
Max memory: 94.971 GB. Platform: Linux.
O^O/ \/ \ Torch: 2.10.0+cu128. CUDA: 12.0. CUDA Toolkit: 12.8. Triton: 3.6.0
\ / Bfloat16 = TRUE. FA [Xformers = 0.0.34. FA2 = False]
"-___-"

Unsloth: Detected MoE model with num_experts = 32 and target_modules = [...].
Enabling LoRA on MoE parameters: ['mlp.experts.gate_up_proj', 'mlp.experts.down_proj']

Trainable parameters = 3,981,312 of 20,918,738,496 (0.02% trained)
Num examples = 933 | Num Epochs = 1 | Total steps = 10
Batch size per device = 1 | Gradient accumulation steps = 4

step 1: loss=1.071 grad_norm=2.805
step 2: loss=1.633 grad_norm=3.155
step 3: loss=1.053 grad_norm=2.815
step 4: loss=0.835 grad_norm=2.14
step 5: loss=1.363 grad_norm=2.239
step 6: loss=0.936 grad_norm=1.645
step 7: loss=0.958 grad_norm=1.641
step 8: loss=1.203 grad_norm=1.843
step 9: loss=1.296 grad_norm=2.013
step 10: loss=0.952 grad_norm=1.679

train_runtime = 141.7s (2.36 minutes)
peak reserved memory = 12.625 GB / 94.971 GB (13.3%)
```

Post-train inference at `reasoning_effort="medium"` and `"high"` produced coherent French reasoning output via the Harmony format -- confirming MXFP4 weights are loading correctly, MoE expert routing works, and the Triton kernels JIT-compile for sm_120 at first use.

What this validates

cu128 wheel's sm_120 SASS works on real sm_120 hardware
xformers + bnb + triton fat binaries run cross-arch (built for the same image that worked on B200 sm_100)
MXFP4 quantization (`unsloth/gpt-oss-20b`, no bnb-4bit variant) works inside the container
MoE expert LoRA targeting (`mlp.experts.gate_up_proj`, `mlp.experts.down_proj`) works
The whole `docker save → HF Hub → docker pull → docker load → docker run` flow works for a 13 GB image
Triton 3.6.0 + sm_120 first-run JIT cost is acceptable (the 1st training step is slow at 105s as Triton compiles; subsequent steps drop to ~3-7s/step)

Additional bugs caught and fixed during this validation

`huggingface-cli` is deprecated in huggingface_hub >= 0.27, silently exits without doing the upload/download. Switched `hf_{push,pull}.sh` to the new `hf` CLI (7354642).
`jupyter nbconvert` was silently failing in the container. Replaced with direct `nbformat`-based conversion (8344fa0).
`nbformat` conversion dumped raw `!pip install` shell magic from notebook install cells, breaking Python parse. Added install-cell skip + `ast.parse()` assertion (391532c).

Make the docker image multi-arch so DGX Spark (GB10, sm_121, aarch64) and the Grace-Hopper / Grace-Blackwell SoCs (GH200 arm64, GB200 arm64) pull a natively-built arm64 child from the same manifest. Runtime emulation is NOT involved -- QEMU is used only for the cross-compile step on x86_64 CI runners; consumers on aarch64 hosts get a normal arm64 image and CUDA works as on any other host. Dockerfile: * ARG TARGETARCH; switch unsloth extras between cu128-ampere-torch2100 (amd64, with xformers) and huggingface (arm64, no xformers -- there is no cu128 aarch64 xformers wheel as of 0.0.34, so we fall back to Unsloth's native SDPA path; ~5-10% slowdown but functionally complete). * Build-time torch._C._cuda_getArchFlags() assertion: amd64 still requires sm_120, arm64 accepts sm_120 or sm_121. * Same TORCH_CUDA_ARCH_LIST on both arches; nvcc emits whatever's listed. docker/setup_qemu.sh (new): One-time host setup -- registers binfmt_misc handlers via tonistiigi/binfmt and creates a 'unsloth-multiarch' docker-container buildx builder. Required only on x86_64 build hosts targeting arm64. docker/test_locally.sh: --platform amd64|arm64 flag. Cross-builds verify QEMU is registered, then build through the in-image arch-flags assertion. Smoke + notebook blocks auto-skip when image arch != host arch (CUDA cannot run under user-space QEMU + nvidia-container-toolkit cannot bridge a QEMU guest to a real GPU). .github/workflows/docker-publish.yml: platforms: linux/amd64,linux/arm64 (single manifest, two children). Timeout bumped 60 -> 150 min for the slower arm64-under-QEMU leg. docker/setup-qemu-action@v3 with platforms: arm64 (was implicit before).

danielhanchen · 2026-05-24T10:36:16Z

DGX Spark / linux/arm64 support added via QEMU at build time (e7cfcea).

The image is now multi-arch: one Docker manifest with linux/amd64 + linux/arm64 children. docker pull unsloth/unsloth:latest on x86_64 hosts gets the amd64 layer; on DGX Spark / Grace / Grace-Hopper it gets the arm64 layer natively, with normal CUDA access. QEMU is only used at build time on the x86_64 CI runner -- never at runtime, where it would break CUDA.

What changed:

docker/Dockerfile: ARG TARGETARCH switches the unsloth extras between cu128-ampere-torch2100 (amd64, with xformers) and huggingface (arm64, falls back to Unsloth's native SDPA path -- xformers does not yet publish a cu128 aarch64 wheel as of 0.0.34). Build-time torch._C._cuda_getArchFlags() assertion now accepts sm_120 or sm_121 on arm64.
docker/setup_qemu.sh: one-time host setup that registers binfmt_misc handlers via tonistiigi/binfmt and creates a multi-arch buildx builder. Only needed when building arm64 on an x86_64 host.
docker/test_locally.sh: --platform amd64|arm64 flag. When the target arch differs from the host arch, the smoke test and notebook blocks auto-skip with a clear warning (CUDA does not work under QEMU runtime emulation and nvidia-container-toolkit cannot bridge a QEMU guest to a real GPU). The build itself still runs through the arch-flags assertion, so PRs get build-time validation of the arm64 leg even on amd64 runners.
.github/workflows/docker-publish.yml: platforms: linux/amd64,linux/arm64 (single manifest). Timeout bumped 60 -> 150 min for the slower arm64-under-QEMU leg.

Arm64 GPU SoCs supported by the new variant:

Compute Cap	SoC	Examples
sm_90	Grace-Hopper	GH200
sm_100	Grace-Blackwell	GB200
sm_121	Blackwell + Grace	GB10 (DGX Spark)

End-to-end arm64 validation on actual hardware is pending -- the build-time assertion exercises the wheel resolution and the cu128 aarch64 fat binary, but the final proof is docker run --gpus all on a real DGX Spark / GB200. That can run once we have access; the published manifest will already include the arm64 child for early users to test.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e7cfceadab

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-24T10:39:05Z

+    import unsloth_zoo
+
+    print(f"unsloth_zoo {unsloth_zoo.__version__}")
+    import xformers


Gate xformers import by platform in smoke test

The arm64 image path intentionally omits xformers (UNSLOTH_EXTRA="huggingface" in docker/Dockerfile) because no cu128 aarch64 wheel is expected, but check_imports() unconditionally does import xformers here. On arm64 runs this makes /workspace/smoke_test.py fail before training, so the published arm64 variant cannot pass the repository’s own runtime validation despite being a supported target.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-24T10:39:06Z

+    print(f"cuda build  {torch.version.cuda}")
+    print(f"arches      {arches}")
+    assert "sm_100" in arches, f"sm_100 missing: {arches}"
+    assert "sm_120" in arches, f"sm_120 missing: {arches}"


Accept sm_121 in smoke-test arch validation

This assertion hard-requires sm_120, but the same commit’s Dockerfile build validation explicitly treats arm64 Blackwell as valid when either sm_120 or sm_121 is present. As written, a valid arm64 build that reports only sm_121 will fail the smoke test with a false negative, even though the image is intended to support GB10/DGX Spark.

Useful? React with 👍 / 👎.

…st tag, arm64 decord - pip shim: count editable/local/url/vcs targets (-e ., ., git+https, wheel URLs) as install targets, not just canonical package names, so they are no longer silently skipped inside notebooks - notebook sync: never overwrite a pre-existing user notebook on first boot (match the refresh path's ownership rule); skip .unsloth_sync_state.tmp when recording state so it is not tracked as a managed file - docker-publish: set flavor latest=false on the base image metadata so a v* tag push cannot publish :latest from the base image (the Studio image owns it) - notebook deps: pin to tested versions and install decord on its own, hard on amd64 and fail-soft on arm64 (no aarch64 wheel) so the arm64 base build works

The base build-args never passed LLAMA_PREBUILT_TAG, so the Dockerfile fell back to latest and each matrix leg resolved whatever unslothai/llama.cpp release was current at its own build time. If latest moved between the amd64 and arm64 legs, one published manifest could carry different GGUF binaries per arch. Resolve the release once in a new prepare job (explicit llama_prebuilt_tag dispatch input for a frozen build, else follow the /releases/latest redirect to a concrete tag, mirroring docker/build.sh) and pass that single tag to both legs.

danielhanchen · 2026-06-26T08:27:27Z

Thanks, all four addressed.

Gate decord on amd64 builds (8f693c6): decord now installs on its own line, hard on amd64 and fail-soft on arm64 (no aarch64 wheel, no sdist), so the arm64 base build no longer dies in that layer.

Invoke pip for requirement-file installs (8f693c6): the final guard now counts any non-flag token as an install target (has_install_target), so -e ., ., git+https://..., a wheel URL, and -r reqs.txt all run pip instead of being silently skipped.

Disable auto latest for tag refs (8f693c6): flavor: latest=false is set on the base image metadata (merge job and the smoke-test recompute), so a v* tag push can't publish :latest from the lean base image. Left latest=auto on the Studio image deliberately: the Studio image owns :latest, and on a tagged release we do want :latest to follow the newest semver tag. The base base- prefix keeps the two namespaces separate.

Pin llama.cpp prebuilt in release builds (8402dce): the base build-args never passed LLAMA_PREBUILT_TAG, so each matrix leg resolved latest independently and a release moving between the amd64 and arm64 legs could put different GGUF binaries under one manifest. A new prepare job resolves the release once (explicit llama_prebuilt_tag dispatch input for a frozen build, else the /releases/latest redirect to a concrete tag, mirroring docker/build.sh) and passes that single tag to both legs.

danielhanchen · 2026-06-26T08:32:36Z

@codex review

danielhanchen · 2026-06-26T08:35:55Z

Went through the latest review pass (the 5-reviewer run plus the Codex review) and fixed the genuinely-open items on this branch:

pip shim (docker/unsloth_pip_shim.py): it counted only canonical package names as install targets, so pip install -e ., pip install ., git+https://... and direct wheel URLs were silently dropped inside notebooks. It now treats any non-flag token as a target, so editable / local / VCS / URL installs run while a flags-only line (e.g. pip install --upgrade) still no-ops.
notebook sync (docker/unsloth_sync_notebooks.sh): the first-boot populate cp -a'd the baked template over any pre-existing notebook, which could clobber a bind-mounted or hand-created file on the very first start. It now applies the same ownership rule the refresh path already had (keep a pre-existing file unless it is byte-identical to the template). Also excluded .unsloth_sync_state.tmp when recording state so the temp file is not tracked as a managed notebook.
docker-publish.yml: set flavor: latest=false on the base image metadata steps. metadata-action defaults to latest=auto, which would have tagged :latest on a v* (semver) tag push from the base image and collided with the Studio image that legitimately owns :latest.
notebook deps / decord: pinned the notebook dependency set to the tested versions and split decord into its own step, hard on amd64 (a missing/incompatible wheel is a real regression there) and fail-soft on arm64/other (decord ships no aarch64 wheel), so the native arm64 base build no longer breaks on it.

Also re-merged current main so the branch carries the latest offline checkpoint load/export and Colab tunnel work and stays conflict-free.

Most of the older comments in this PR were already resolved by the branch's history (build-args prose moved out of the build-args: blocks, least-privilege permissions: default, run.sh secret handling / TTY-gated -it / --gpus translation, torchaudio==2.10.0, preserved numpy/_core/tests, pinned UNSLOTH_REF / UNSLOTH_ZOO_REF).

One follow-up that is intentionally out of scope here: the reviewers flagged asymmetric CPU-mode CUDA-probe guards in unsloth_zoo (compiler.py, loss_utils.py). Those live in the unslothai/unsloth_zoo repo and need a separate PR there; the Docker CPU-mode confirmation scripts stay torch-only until that lands so they do not fail on the unguarded probe.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8402dcebdd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-26T08:54:30Z

+        fi
+        cp -a "$TEMPLATE/$rel" "$DEST/$rel" 2>/dev/null || true
+    done
+    record_state


Preserve pre-existing notebooks outside the sync state

When /workspace/unsloth-notebooks already contains a user-edited file on first boot, the branch above skips copying the template, but this record_state call records every file under DEST as if the sync wrote it. On the subsequent GitHub refresh, that file's current hash matches the recorded hash, so it is treated as pristine and can be overwritten by the upstream notebook despite being pre-existing user data.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-26T08:54:30Z

+    # not just tokens with a canonical pkg name: editable / local / url / vcs
+    # installs (`-e .`, `.`, `git+https://...`, a wheel URL) carry no canonical
+    # name but must still run, and a `-r`/`-c` file pulls in real requirements.
+    has_install_target = any(not t.startswith("-") for t in keep_args)


Ignore option values when deciding whether pip has targets

For notebook install cells that only request baked packages plus an index/link flag, e.g. pip install --extra-index-url https://download.pytorch.org/whl/cu128 torch transformers==5.3.0, the shim drops the package specs but leaves the flag and its URL in keep_args. This test then treats the URL as an install target and execs pip install --extra-index-url <url> with no requirement, causing the cell to fail instead of no-oping after preserving the baked stack.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-26T08:54:30Z

+            # The full Studio image owns the unprefixed namespace, headed by
+            # :latest. Same :latest gating rationale as the base job.
+            type=raw,value=latest,enable=${{ github.ref == format('refs/heads/{0}', github.event.repository.default_branch) && github.event.inputs.unsloth_ref == '' }}
+            type=ref,event=tag


Disable implicit latest tags for Studio tag builds

This Studio metadata block omits flavor: latest=false, so type=ref,event=tag can still emit an implicit :latest tag under metadata-action's default latest=auto behavior (docs), bypassing the explicit branch-only gate above on v* tag pushes. The mirrored smoke-test metadata block has the same config, so it can also pull :latest instead of the tag that was just published.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-26T08:54:30Z

+# (or any branch/tag/sha); otherwise take the latest PyPI release.
+if [ -n "$REF" ]; then
+    SPECS="git+https://github.com/unslothai/unsloth.git@${REF}#egg=unsloth"
+    SPECS="$SPECS git+https://github.com/unslothai/unsloth-zoo.git@${REF}#egg=unsloth_zoo"


Resolve unsloth-zoo separately for ref updates

When --ref is an Unsloth release tag or commit SHA, this installs unsloth-zoo from the same ref even though that repo is not guaranteed to have matching tags or SHAs; the publish workflow already has separate zoo-ref resolution for this reason. In those cases the advertised unsloth-studio-update --ref <tag|sha> path fails before updating Studio, so the script should resolve the zoo ref independently or leave it on a known default.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-26T08:54:30Z

+#   omegaconf  TTS families + both NeMo-Gym RL notebooks' config objects
+#   einx  TTS codec tensor-rearrange (Llasa / Oute / Spark TTS)
+#   librosa  Whisper audio feature extraction (pairs with soundfile + torchcodec)
+#   ftfy  Oute TTS text normalisation


Remove unsupported sm_103 from the CUDA 12.8 arch list

This runtime arch list is consumed by PyTorch/CUDA extension builds inside the container, but the image only installs CUDA 12.8 nvcc on amd64 while NVIDIA documents compiler target support for sm_103 as added in CUDA 12.9 (CUDA features archive). Any pip install/JIT path that honors TORCH_CUDA_ARCH_LIST will pass an unsupported compute_103 target to nvcc 12.8 and fail, even on non-B300 hosts; drop 10.3 here or ship a 12.9+/13 compiler wherever it is advertised.

Useful? React with 👍 / 👎.

…gs, zoo ref, arch list) - pip shim: do not treat the value of an index-url / find-links / constraint flag as an install target. A cell like 'pip install --extra-index-url <url> torch' now no-ops after keeping the baked stack instead of exec'ing a bare 'pip install --extra-index-url <url>' that fails. Positional . / url / vcs and -r/--requirement files still count as targets. - notebook sync: on first boot, record only files we actually wrote (or that are byte-identical to the template), never a kept pre-existing user file; and on the GitHub refresh, treat a file present in DEST but absent from the sync state as user-owned and keep it. Previously a bind-mounted notebook was recorded as managed and then overwritten by upstream. - docker-publish: add flavor latest=false to the Studio metadata steps too, so a v* tag push cannot emit an implicit :latest via metadata-action's latest=auto; :latest stays default-branch-only, and the smoke test pulls the published tag. - unsloth-studio-update: resolve the unsloth-zoo ref independently of --ref (new --zoo-ref, else use the ref only when the zoo repo has it, else fall back to main) so 'update --ref <unsloth-tag/sha>' does not fail on a missing zoo ref. - Dockerfile: drop 10.3 (compute_103) from TORCH_CUDA_ARCH_LIST in both the builder and runtime stages. B300 runs sm_100 SASS, and the bundled CUDA 12.8 nvcc cannot compile compute_103 (added in 12.9), which broke arch-list-honoring source / JIT builds.

danielhanchen · 2026-06-26T09:19:19Z

All five are valid; fixed:

docker/unsloth_sync_notebooks.sh: two-part fix. On first boot, a kept pre-existing user file is no longer recorded in the sync state (only files we actually wrote, or that are byte-identical to the template, are recorded), and the GitHub refresh now treats a file that exists in DEST but is absent from the state as user-owned and keeps it. Previously a bind-mounted notebook was recorded as managed, then overwritten by upstream on the next refresh.
docker/unsloth_pip_shim.py: the install-target test no longer counts the value of an index-url / find-links / constraint flag as a target. pip install --extra-index-url <url> torch transformers==5.3.0 now no-ops after keeping the baked stack instead of exec'ing a bare pip install --extra-index-url <url> that fails. Positional . / url / vcs / editable targets and -r/--requirement files still count.
.github/workflows/docker-publish.yml: added flavor: latest=false to the Studio metadata steps as well (the merge step and the smoke-test step). They omitted it, so type=ref,event=tag under the default latest=auto could still emit an implicit :latest on a v* tag and bypass the default-branch-only gate, and the smoke test could pull :latest instead of the just-published tag. :latest (and the :studio alias) now publish only via the explicit default-branch rules.
docker/unsloth_studio_update.sh: --ref no longer forces the same ref onto unsloth-zoo. There is a new --zoo-ref, and otherwise the unsloth ref is used for the zoo only when the zoo repo actually has it (checked via git ls-remote), falling back to main so update --ref <unsloth-tag/sha> does not fail on a tag/SHA the zoo never cut. This mirrors the publish workflow's separate zoo-ref resolution.
docker/Dockerfile: dropped 10.3 (compute_103) from TORCH_CUDA_ARCH_LIST in both the builder and runtime stages. B300 runs sm_100 SASS (already noted in the wheel-coverage comment), and the bundled CUDA 12.8 nvcc cannot compile compute_103 (added in CUDA 12.9), so listing it broke any source/JIT build inside the container that honors the arch list.

Validation: bash -n on the shell scripts, py_compile on the Python helpers, and the workflow parses as valid YAML.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0ebbdbb9cc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-26T09:38:59Z

+            if prev_flag in _REQ_FILE_FLAGS:
+                has_target = True


Filter requirement files before invoking pip

When a notebook install cell uses pip install -r requirements.txt, this branch marks the requirements file as a real target and later execs pip with the file unchanged. Any protected packages inside that file, such as torch, transformers, vLLM, or nvidia wheels, bypass _KEEP and the Transformers sidecar marker logic, so a requirements file with common training deps can overwrite the baked cu128 stack or install Transformers into the base venv. Parse/filter requirement files or reject files that mention protected packages before passing them through.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-26T09:38:59Z

+          # out of build-args -- forwarded lines must be KEY=VALUE only.)
+          build-args: |
+            BASE_IMAGE=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}@${{ needs.merge.outputs.digest }}
+            UNSLOTH_STUDIO_REF=${{ github.event.inputs.unsloth_ref || (startsWith(github.ref, 'refs/tags/') && github.ref_name) || github.sha || 'main' }}


Forward the resolved zoo ref into Studio builds

For workflow dispatches that set unsloth_zoo_ref (or future tag pushes where the zoo resolver returns something other than main), the base image bakes steps.zoo_ref.outputs.ref but the Studio job only forwards the Unsloth ref. Dockerfile.studio runs install.sh --local, and that local-install path overlays unsloth-zoo from git main, so the published full image can run a Studio backend with a different zoo than the base image and the operator-requested ref; pass the resolved zoo ref through this build and install that ref in the Studio venv.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-26T09:38:59Z

+# see unsloth_sync_notebooks.sh + unsloth_nb_content_sig.py. Inherited as-is by
+# the studio image (FROM base).
+RUN set -eux \
+ && git clone --depth 1 https://github.com/unslothai/notebooks /opt/unsloth-notebooks \


Pin baked notebooks to one resolved commit

Each architecture leg runs this clone independently against unslothai/notebooks HEAD, so if that repo advances between the amd64 and arm64 builds (or between release reruns), the same Docker tag seeds different baked notebook templates and .unsloth_template_commit state depending on the pulled platform. Resolve the notebooks commit once in the workflow/build script and clone that ref, like the llama.cpp prebuilt tag, so the multi-arch image contents stay consistent.

Useful? React with 👍 / 👎.

…inned notebooks commit) - unsloth_pip_shim.py: filter protected packages out of a notebook `pip install -r requirements.txt`. The -r value was passed to the real pip unchanged, so torch / transformers / vLLM / nvidia pins inside the file could overwrite the baked cu128 stack or push transformers into the base venv. _filter_requirements_file() applies the same _KEEP / transformers-sidecar rules per line, writes the survivors to a temp file, keeps comments, option lines, nested includes and urls verbatim, and records a pinned transformers version for the sidecar. - install.sh + Dockerfile.studio + docker-publish.yml: forward the resolved unsloth-zoo ref into the Studio build. install.sh --local overlaid unsloth-zoo from git main regardless of the operator-requested or base-image ref, so the full image could run a different zoo than the base. install.sh now honors UNSLOTH_ZOO_REF across all four --local overlays, Dockerfile.studio passes UNSLOTH_STUDIO_ZOO_REF through to it, and the workflow resolves one zoo ref in the prepare job and shares it with both the base and Studio builds. - Dockerfile + docker-publish.yml: pin unslothai/notebooks to one resolved commit. Each arch leg cloned HEAD independently, so the same tag could seed different baked templates and .unsloth_template_commit depending on the pulled platform. The prepare job freezes notebooks to one sha (like the llama.cpp prebuilt tag) and the Dockerfile fetches that single ref at depth 1.

danielhanchen · 2026-06-26T10:53:50Z

All three were valid. Fixed in d476c7764:

docker/unsloth_pip_shim.py (filter -r files): a notebook pip install -r requirements.txt previously passed the file to the real pip unchanged, so torch / transformers / vLLM / nvidia pins inside the file bypassed _KEEP and the transformers sidecar and could overwrite the baked cu128 stack or push transformers into the base venv. _filter_requirements_file() now applies the same _KEEP / transformers rules per requirement line, writes the survivors to a temp file passed to pip, and records a pinned transformers version for the sidecar. Comments, option lines, nested -r/-c includes and url/vcs/path lines are kept verbatim; when nothing is protected or the file cannot be read, the original path is passed through unchanged.
install.sh + docker/Dockerfile.studio + .github/workflows/docker-publish.yml (forward the zoo ref into Studio): install.sh --local overlaid unsloth-zoo from git main regardless of the resolved/operator-requested ref, so the full image could run a different zoo than the base. install.sh now honors UNSLOTH_ZOO_REF across all four --local overlays, Dockerfile.studio forwards UNSLOTH_STUDIO_ZOO_REF into it, and the workflow resolves one zoo ref in the prepare job and shares it with both the base build and the Studio build (so they always agree).
docker/Dockerfile + .github/workflows/docker-publish.yml (pin the notebooks commit): each arch leg cloned unslothai/notebooks HEAD independently, so the same tag could seed different baked templates and .unsloth_template_commit across platforms or release reruns. The prepare job now freezes notebooks to one resolved sha (the same pattern as the llama.cpp prebuilt tag) and the Dockerfile fetches that single ref at depth 1.

Validation: bash -n on the shell scripts, py_compile plus a functional check of _filter_requirements_file (torch/transformers/nvidia/vLLM dropped, transformers version recorded, numpy/omegaconf/nested-r/git-url kept), and the workflow parses as valid YAML. Propagated into the stacked Studio PR.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d476c7764b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-26T11:10:11Z

+    if re.match(r"^[a-z]+\+", token) or "://" in token or token.startswith((".", "/")):
+        return None  # vcs / url / local path -> let it pass through


Keep protected direct-reference installs out

When a notebook uses a quoted PEP 508 direct reference for a protected package, e.g. pip install "torch @ https://.../torch.whl" or "unsloth @ git+https://...", this early URL check returns None before extracting the distribution name. The token is then kept and treated as a real target, so the shim can reinstall torch/Unsloth into the base venv even though _KEEP is supposed to preserve the baked CUDA stack.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-26T11:10:11Z

+        if tok in _VALUE_FLAGS:
+            keep_args.append(tok)
+            skip_next = True
+            prev_flag = tok
+            continue


Handle equals-form requirement files

For a valid pip invocation like pip install --requirement=requirements.txt (the same --requirement <file> option also accepts the standard --option=value form), this exact-token check does not recognize the requirements file. The argument starts with -, so it is kept as an option but has_target remains false; cells whose only install target is that file return as a no-op and skip all of its dependencies.

Useful? React with 👍 / 👎.

…value req files Two more notebook-shim gaps from review: - A quoted PEP 508 direct reference for a protected package, e.g. `pip install "torch @ https://.../torch.whl"` or `"unsloth @ git+https://..."`, bypassed _KEEP: _canon hit the url guard and returned None before pulling the distribution name, so the token was treated as a real target and reinstalled into the base venv. _canon now extracts the name from the `name [extras] @ url` form first, so a protected package pinned through a URL/VCS is still dropped; a non-protected direct reference returns its name and is kept exactly as before. - The `--requirement=reqs.txt` equals-form (pip accepts `--option=value` for any value-taking flag) was not recognized: the token starts with `-`, so it was kept as an opaque option, the file was never filtered, and has_target stayed false -- a cell whose only target was that file silently no-op'd. The scan now splits `--flag=value`, filters the requirements file for `-r`/`--requirement`, and counts it as a target; other inline-value options stay options.

danielhanchen · 2026-06-26T11:40:59Z

Both valid. Fixed in 7083a2d9f:

Direct-reference protected installs: a quoted PEP 508 direct reference for a protected package, e.g. pip install "torch @ https://.../torch.whl" or "unsloth @ git+https://...", hit the url guard in _canon and returned None before the distribution name was extracted, so the token was kept as a real target and reinstalled into the base venv. _canon now extracts the name from the name [extras] @ <url> form first, so a protected package pinned through a URL/VCS is dropped like any other _KEEP entry. A non-protected direct reference still returns its name and is kept exactly as before, so genuine extra installs are unaffected.
--requirement= equals-form: pip accepts --option=value for any value-taking flag, but the scan only matched the exact --requirement / -r tokens, so --requirement=requirements.txt started with -, was kept as an opaque option, and has_target stayed false. A cell whose only target was that file silently no-op'd and installed nothing. The scan now splits --flag=value, filters the requirements file for -r/--requirement, and counts it as a target; other inline-value options (--index-url=...) stay options.

Validation: py_compile plus a functional matrix: protected direct refs (torch @ url, unsloth @ git+..., unsloth-zoo@git+..., vllm[extra] @ url) drop and no-op; non-protected (mypkg @ url) and plain url/vcs/path pass through; --requirement=FILE and the space forms all filter the file and record the transformers pin; --index-url=URL pkg keeps the option and installs only the package. Propagated into the stacked Studio PR.

for more information, see https://pre-commit.ci

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2ee7f4b644

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-26T11:58:50Z

+ && git checkout -q FETCH_HEAD \
+ && UNSLOTH_STUDIO_HOME="${UNSLOTH_STUDIO_HOME}" \
+    UNSLOTH_TORCH_INDEX_FAMILY="${TORCH_FAMILY}" \
+    UNSLOTH_ZOO_REF="${UNSLOTH_STUDIO_ZOO_REF}" \


Use the resolved zoo ref during Studio setup

When a Studio build passes UNSLOTH_STUDIO_ZOO_REF other than main, this only pins the first install.sh overlay; the same --local flow then runs studio/setup.sh, whose install_python_stack.py local branch force-reinstalls unsloth-zoo @ git+https://github.com/unslothai/unsloth-zoo with no @ref, so the Studio venv ends up on zoo main while the base venv used the resolved ref. This still affects workflow_dispatch/tag publishes that intentionally resolve a non-main zoo ref; thread this env through setup/install_python_stack before the final overlay.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-26T11:58:50Z

+        if not stripped or stripped.startswith(("#", "-")):
+            out.append(line)  # comment / blank / option / nested include -> keep


Filter nested requirement includes too

For a notebook cell like pip install -r reqs.txt where reqs.txt contains -r torch-reqs.txt, this branch copies the nested include verbatim; pip recursively processes those requirement includes, so protected specs in the nested file never pass through _KEEP or the Transformers marker logic and can still reinstall torch/transformers into the baked venv. The fresh issue in the current helper is the one-level filter: nested -r/--requirement lines are explicitly preserved instead of being filtered or rejected.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-26T11:58:51Z

+        fi
+    fi
+    mkdir -p "$(dirname "$dst")" 2>/dev/null || true
+    if cp -a "$f" "$dst" 2>/dev/null; then


Honor deleted notebooks during refresh

When UNSLOTH_KEEP_DELETED_NOTEBOOKS=1 is set, the offline restore block is skipped, but the GitHub refresh still falls through here for any tracked notebook the user deleted and copies it back from the fresh clone whenever upstream advances. That makes the documented deletion opt-out only work until the next remote update; skip paths that are recorded in the state but absent from DEST when this option is enabled.

Useful? React with 👍 / 👎.

LeoBorcherding · 2026-06-26T17:17:08Z

reviewed on windows 11 / WSL2 (ubuntu 24.04, docker 29.6.1 + nvidia-container-toolkit). built the :core image with no gpu used during the build, then ran it on my RTX 5060 Ti (sm_120 blackwell).

image came out with torch 2.10.0+cu128 and arch flags sm_70 sm_75 sm_80 sm_86 sm_90 sm_100 sm_120, so the sm_100/sm_120 kernels are baked in even though nothing built against a gpu.

ran it with --gpus all and /workspace/smoke_test.py passed:

NVIDIA GeForce RTX 5060 Ti. Torch: 2.10.0+cu128. CUDA Toolkit: 12.8
loaded unsloth/Llama-3.2-1B-Instruct-bnb-4bit in 4-bit
step 0 loss=4.1073 ... step 4 loss=3.7499
OK: 5 LoRA steps completed

danielhanchen · 2026-06-26T22:20:10Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2ee7f4b644

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-26T22:33:03Z

+          ok_studio=0; ok_jupyter=0
+          for i in $(seq 1 60); do
+              if curl -fsS http://localhost:18000/api/health >/dev/null 2>&1; then ok_studio=1;  fi
+              if curl -fsS http://localhost:18888/api        >/dev/null 2>&1; then ok_jupyter=1; fi


Probe a public Jupyter endpoint

In the full-image smoke test this request hits Jupyter's /api without any token or login cookie. The launcher always configures Jupyter with a password hash when booting the image, so unauthenticated API calls return 403; because this uses curl -f, ok_jupyter never flips and any HAS_GPU_RUNNER publish run reports the full image unhealthy even when JupyterLab is up. The same /api probe is duplicated in docker/docker_confirm.sh, so use a public endpoint such as /login or authenticate the request.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-26T22:33:03Z

+# setup is not silently ignored.
+if [[ "${UNSLOTH_ALLOW_CPU:-0}" == "1" ]]; then
+    if ! command -v nvidia-smi >/dev/null 2>&1 || ! nvidia-smi -L 2>/dev/null | grep -q '^GPU'; then
+        warn "UNSLOTH_ALLOW_CPU=1 and no GPU visible -- continuing on CPU."


Gate CPU mode before Studio model loads

When the image is started on a CPU-only host with UNSLOTH_ALLOW_CPU=1, this branch lets the full Studio image continue and the comments/docs advertise Studio chat as usable. However UNSLOTH_ALLOW_CPU makes Unsloth report DEVICE_TYPE == "cuda", and Studio inference then calls FastLanguageModel.from_pretrained(...), whose CUDA path still unconditionally executes torch.cuda.get_device_properties(0) (see unsloth/models/llama.py:2310 and unsloth/models/vision.py:747). On Docker Desktop/macOS or Windows+AMD, loading a chat model therefore raises instead of falling back to CPU; either keep CPU mode to tooling/Jupyter or guard those CUDA probes.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-26T22:33:03Z

+absolute path so there is no recursion. `python -m pip` / `%pip` bypass PATH and
+are not intercepted -- the driven `unsloth-run` handles those by parsing the


Intercept %pip before it mutates the baked stack

For notebooks that use %pip or python -m pip, this explicitly bypasses the PATH shim, but neither the IPython startup hook nor unsloth-run rewrites those cells or installs a pip module wrapper. In that scenario a cell like %pip install transformers==... or %pip install torch... runs the real pip inside /opt/unsloth-venv and can overwrite the cu128 torch/transformers stack that the shim is meant to protect, so the safe notebook execution path is only safe for !pip/!uv shell commands.

Useful? React with 👍 / 👎.

…XDEV, %pip shim) - docker-publish smoke + docker_confirm.sh probe Jupyter /login, not /api: the launcher always configures a password hash so /api returns 403 and curl -f would never flip the health flag (false build failure). - entrypoint.sh CPU messaging: CPU mode covers Jupyter, GGUF tooling and llama.cpp (GGUF) Studio chat; training AND loading an Unsloth model (FastLanguageModel) still need a GPU, since from_pretrained runs CUDA probes. - install_llama_prebuilt.py: rollback/activation moves used bare os.replace, which fails with EXDEV across overlayfs in a Docker build and fell back to a broken source build (no nvcc). Add is_cross_device_error + move_install_dir_aside (os.replace fast path, copy+remove on EXDEV; busy errors still re-raise). - notebooks: %pip / %uv line magics and the `!python -m pip` form bypassed the PATH pip/uv shim and could overwrite the baked cu128 torch/vLLM stack. Add unsloth_nb_pip_magic.py to re-point them at the shim, wired via the IPython startup hook and installed into the venv site-packages.

for more information, see https://pre-commit.ci

chatgpt-codex-connector

💡 Codex Review

https://github.com/unslothai/unsloth/blob/f1525695e55fe5c85d3f33efb585d4bff3dcadb9/docker/unsloth_nb_content_sig.py#L256-L257
Do not drop captured body cells from signatures

For any real tutorial cell that starts with %%capture or %%bash to hide noisy output or run shell preprocessing, this helper excludes the entire cell from the middle digest even if it does not install packages. If upstream later changes that cell, middle_unchanged reports SAME, the refresh treats the notebook as only header/footer churn, and users keep stale executable content; only classify these magics as boilerplate when the cell is actually an install/setup cell.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-27T09:07:19Z

+  $okStudio = $false; $okJupyter = $false
+  foreach ($i in 1..60) {
+    if (-not $okStudio)  { try { Invoke-WebRequest -UseBasicParsing -Uri "http://localhost:$PORT_STUDIO/api/health" -TimeoutSec 4 | Out-Null; $okStudio = $true } catch {} }
+    if (-not $okJupyter) { try { Invoke-WebRequest -UseBasicParsing -Uri "http://localhost:$PORT_JUPYTER/api"       -TimeoutSec 4 | Out-Null; $okJupyter = $true } catch {} }


Probe Jupyter login in the Windows confirmer

When the Windows confirmation script reaches this full-image check, studio_launch.sh has already configured Jupyter with a hashed password, so an unauthenticated request to /api returns 403 even when JupyterLab is healthy. The Linux confirmer and workflow use /login for this reason; leaving the PowerShell path on /api makes Windows users see a false Jupyter failure unless they authenticate the request or probe /login.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-27T09:07:19Z

+    if re.match(r"^[a-z]+\+", token) or "://" in token or token.startswith((".", "/")):
+        return None  # vcs / url / local path -> let it pass through


Block VCS egg installs for protected packages

Fresh evidence beyond the earlier fixed PEP 508 case: legacy VCS requirements such as pip install git+https://github.com/unslothai/unsloth.git#egg=unsloth still take this URL/VCS passthrough branch, so _canon() returns None and the shim later executes the token as a real install target. In notebooks using that valid pip form, protected packages can still be reinstalled into the baked venv and bypass _KEEP; parse #egg=/editable values before treating VCS URLs as passthrough.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-27T09:07:19Z

+# Of those value-flags, the ones whose VALUE is itself an install target: a
+# requirements file pulls real requirements. An index-url / find-links /
+# constraint / target value is an option, not something to install.
+_REQ_FILE_FLAGS = {"-r", "--requirement"}


Filter constraint files before pip sees them

When a notebook runs something like pip install -c constraints.txt peft and that constraints file pins transformers or torch, this branch keeps -c verbatim because only requirement files are inspected. Pip applies constraints to dependency resolution, so installing a kept package can still downgrade/reinstall protected packages from the constraint file without _KEEP or the sidecar marker ever seeing those specs; filter or reject protected entries in constraint files too.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-27T09:07:19Z

+# Persistence: the update is written to the container's writable layer, so it
+# survives `docker restart`. To keep it across a full `docker rm` + `docker run`
+# (and to keep your chats/users/models), run Studio with its home on a named
+# volume: -v unsloth_studio_home:/opt/unsloth-studio


Do not mount over the Studio install home

If users follow this persistence instruction with a fresh named volume, Docker masks the baked /opt/unsloth-studio tree, including bin/unsloth and the Studio venv that supervisord.conf starts. The full image then boots with an empty home and the Studio service cannot exec; persist only a data subdirectory or seed the volume before recommending this mount.

Useful? React with 👍 / 👎.

danielhanchen · 2026-06-27T09:13:20Z

Two parts here: the WSL2 report and the Codex review on the same commit.

@LeoBorcherding thanks for the thorough Windows 11 / WSL2 pass.

The llama.cpp EXDEV failure is fixed in 2c31686. install_llama_prebuilt.py no longer uses a bare os.replace for the rollback and activation moves: is_cross_device_error plus move_install_dir_aside fall back to copy + remove on EXDEV, while busy/in-use errors still re-raise so a live install is never half-copied. The updater now survives the cross-overlayfs move inside a Docker build instead of giving up and source-building (which is broken here with no nvcc). Good catch on why it did not show in our own testing, the base already had the latest llama tag so the updater never ran.

On the cu128 dependency: correct, the studio build needs install.sh's UNSLOTH_TORCH_INDEX_FAMILY=cu128 support, which lives in #6692. These three PRs are stacked: #6692 lands first (or point UNSLOTH_STUDIO_REF at the branch when building before merge), #5748 is the base image, and #6681 sits on top.

Codex review 4583085472 (same commit) is also addressed in 2c31686:

Jupyter smoke probe (docker-publish workflow + docker_confirm.sh) now hits /login instead of /api. The launcher always configures a password hash, so /api returns 403 and curl -f would never flip the health flag.
entrypoint.sh CPU messaging clarified: CPU mode covers Jupyter, the GGUF tooling and llama.cpp (GGUF) Studio chat; training and loading an Unsloth model (FastLanguageModel) still need a GPU, since from_pretrained runs CUDA probes.
%pip / %uv / python -m pip: the PATH shim only intercepted !pip / !uv. Added unsloth_nb_pip_magic.py so the %pip / %uv line magics and the !python -m pip form also route through the shim and cannot overwrite the baked cu128 torch/vLLM stack.

danielhanchen · 2026-06-27T14:32:24Z

End to end validation of the published Docker image

I ran the published Blackwell Docker image end to end on the free GitHub-hosted Linux x64 runners. The image is danielhanchen/unsloth-blackwell-docker (unsloth-blackwell-studio.tar.gz), pulled with docker/hf_pull.sh and docker loaded, then booted in CPU mode (UNSLOTH_ALLOW_CPU=1) so it exercises the GGUF / llama.cpp path the image ships. Matrix: ubuntu-latest, ubuntu-24.04, ubuntu-22.04, all green.

Feature probe (identical PASS on all three runners):

Feature	Result
Studio auth (login + forced password rotation)	PASS
Load `unsloth/gemma-4-E4B-it-GGUF:UD-Q4_K_XL`	PASS (is_gguf, ctx=2048)
Chat completion	PASS (412 char reply)
RAG (create KB, ingest a doc, search)	PASS (1 hit)
Web search tool	PASS (tool call observed)
llama.cpp prebuilt	PASS (tag `b9813-mix-1f1aaa4`, repo `unslothai/llama.cpp`, llama-server v9813)
`unsloth-llama-update --check`	PASS (up to date)
Branding integrity (`python -m unsloth_branding --verify`)	PASS
Studio versions	unsloth 2026.6.9, unsloth_zoo 2026.6.7

gemma-4-E4B-it-GGUF at UD-Q4_K_XL loaded and answering in the Studio chat, with the web-search tool firing live and citing sources (CPU generation ~5.7 tok/s):

Live session:

Boot path: docker load of the published tarball, then /api/health (Studio) and /login (JupyterLab) both healthy in CPU mode. The update button is covered in a separate check (staged unsloth 2026.5.5, ran unsloth-studio-update, version moved forward and the service restarted clean).

A Jupyter password hash is always configured, so /api returns 403; the Windows confirmation reported a healthy full image as a hard failure. Matches the fix already in docker_confirm.sh and docker-publish.yml.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8fc483ec62

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-07-01T10:56:48Z

            echo = False,
            name = "vLLM STDERR",
-            ready_regex = None,
+            # vLLM >= 0.19 emits "Starting vLLM API server ... on ..." (and


Preserve explicit infinite vLLM startup waits

When callers pass timeout=None, the previous Event.wait(timeout=None) waited indefinitely, which is a useful escape hatch for large models or slow first-time downloads. This expression now converts None (and 0) back to 1200 seconds, so those runs are killed after 20 minutes even though the caller explicitly disabled the timeout; handle None as an unbounded deadline instead of falling back to the default.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-07-01T10:56:48Z

+    v = requested_version()
+    if v and "transformers" not in sys.modules:
+        activate(v)


Scope transformer pins to the current kernel

This hook reads one shared /tmp/unsloth_nb/requested_transformers marker for every Jupyter kernel in the container. If two notebooks run concurrently with different install-cell pins, whichever cell writes the marker last controls the other kernel's next pre-run hook, so that notebook can activate the wrong transformers sidecar before its model cell; make the marker per-kernel/notebook rather than global.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-07-01T10:56:48Z

+    if not changed:
+        return path, None, []
+    try:
+        fd, tmp = tempfile.mkstemp(prefix = "unsloth-nb-req-", suffix = ".txt")


Keep filtered requirement files beside the original

When a requirements file is changed because a protected package was dropped, the filtered copy is written under the default temp directory. For a valid file that also contains a relative nested include such as -r extras.txt, pip resolves that include relative to the requirements file it is currently reading, so after this rewrite it looks in /tmp instead of the notebook/project directory and the install fails; create the temporary file next to path or rewrite relative include paths.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-07-01T10:56:48Z

+ && mkdir -p /root/.ipython/profile_default/startup \
+ && cp /opt/unsloth-nb/unsloth_ipython_startup.py /root/.ipython/profile_default/startup/00-unsloth-nb.py \


Install the notebook startup hook outside root home

When users start the base image with --user (common with mounted workspaces to avoid root-owned files), IPython uses that user's home rather than /root, so this startup file is never loaded. In that context UNSLOTH_NB_SHIM is not set and the PATH shim deliberately execs the real pip, letting notebook !pip/%pip cells mutate the baked torch/transformers stack; install the hook in a system-wide IPython/Jupyter startup location or otherwise enable it per kernel.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-07-01T10:56:48Z

+    first = t.lstrip().split("\n", 1)[0].strip().lower()
+    return first.startswith("%%capture") or first.startswith("%%bash")


Hash substantive captured or bash notebook cells

This treats every %%capture or %%bash cell as boilerplate, even when the cell is real tutorial logic such as data prep, launches, or captured training code. For an untouched notebook where upstream changes one of those cells, both content signatures drop the changed cell and middle_unchanged can return SAME, so the boot refresh skips a substantive upstream fix; only exclude these magics after confirming they are the generated install/setup cell.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-07-01T10:56:48Z

+#
+# Persistence: the swap lands in the container's writable layer (survives
+# docker restart). To keep it across a full recreate, mount the prebuilt dir on
+# a named volume: -v unsloth_llama:/opt/unsloth/llama.cpp


Do not mount over the baked llama.cpp bundle

If users follow this persistence example with a fresh named volume, Docker masks /opt/unsloth/llama.cpp, including the baked binaries and converter that GGUF export and the Studio symlink rely on. The image then boots with an empty llama.cpp install and GGUF tooling fails until the update command seeds it; recommend seeding the volume first or mounting a parent/data path instead.

Useful? React with 👍 / 👎.

danielhanchen and others added 2 commits May 24, 2026 06:52

[pre-commit.ci] auto fixes from pre-commit.com hooks

a75aef0

for more information, see https://pre-commit.ci

github-advanced-security AI found potential problems May 24, 2026

View reviewed changes

Comment thread .github/workflows/docker-publish.yml Fixed

chatgpt-codex-connector Bot reviewed May 24, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 24, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 24, 2026

View reviewed changes

danielhanchen added 3 commits May 24, 2026 07:21

chatgpt-codex-connector Bot reviewed May 24, 2026

View reviewed changes

danielhanchen added 3 commits May 24, 2026 08:01

chatgpt-codex-connector Bot reviewed May 24, 2026

View reviewed changes

danielhanchen added 2 commits May 24, 2026 09:40

chatgpt-codex-connector Bot reviewed May 24, 2026

View reviewed changes

danielhanchen added 3 commits June 26, 2026 08:20

Merge remote-tracking branch 'origin/main' into dbb-merge-main

f45c455

chatgpt-codex-connector Bot reviewed Jun 26, 2026

View reviewed changes

[pre-commit.ci] auto fixes from pre-commit.com hooks

2ee7f4b

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed Jun 26, 2026

View reviewed changes

danielhanchen assigned LeoBorcherding Jun 26, 2026

chatgpt-codex-connector Bot reviewed Jun 26, 2026

View reviewed changes

danielhanchen and others added 2 commits June 27, 2026 08:46

[pre-commit.ci] auto fixes from pre-commit.com hooks

f152569

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed Jun 27, 2026

View reviewed changes

danielhanchen mentioned this pull request Jun 27, 2026

docker: Colab-grade JupyterLab and Studio UX for the Unsloth image #6681

Open

danielhanchen added 2 commits June 29, 2026 05:16

docker_confirm.ps1: probe JupyterLab /login, not /api

68f5394

A Jupyter password hash is always configured, so /api returns 403; the Windows confirmation reported a healthy full image as a hard failure. Matches the fix already in docker_confirm.sh and docker-publish.yml.

Merge remote-tracking branch 'origin/main' into docker-blackwell-build

8fc483e

chatgpt-codex-connector Bot reviewed Jul 1, 2026

View reviewed changes

		UNSLOTH_REF=${{ github.event.inputs.unsloth_ref \|\| 'main' }}
		UNSLOTH_ZOO_REF=${{ github.event.inputs.unsloth_zoo_ref \|\| 'main' }}

		HF_HOME=/workspace/.cache/huggingface \
		TRITON_CACHE_DIR=/workspace/.cache/triton \

		import transformers

		print(f"transformers {transformers.__version__}")

		git clone --depth 1 -b docker-blackwell-build \
		https://github.com/unslothai/unsloth.git /tmp/unsloth-pr 2>&1 \| tail -3

		if re.match(r"^[a-z]+\+", token) or "://" in token or token.startswith((".", "/")):
		return None # vcs / url / local path -> let it pass through

		if not stripped or stripped.startswith(("#", "-")):
		out.append(line) # comment / blank / option / nested include -> keep

		absolute path so there is no recursion. `python -m pip` / `%pip` bypass PATH and
		are not intercepted -- the driven `unsloth-run` handles those by parsing the

		&& mkdir -p /root/.ipython/profile_default/startup \
		&& cp /opt/unsloth-nb/unsloth_ipython_startup.py /root/.ipython/profile_default/startup/00-unsloth-nb.py \

		first = t.lstrip().split("\n", 1)[0].strip().lower()
		return first.startswith("%%capture") or first.startswith("%%bash")

Uh oh!

Uh oh!

Conversation

danielhanchen commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why the build does not need a GPU

What is in the images

arm64 / aarch64

Validation

Files

Design notes

Test plan

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented May 24, 2026

Smoke-test validation on a fresh deploy host (AWS B200, not the build host)

What was validated

Loss progression

Bugs caught and fixed during validation

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 24, 2026

Choose a reason for hiding this comment

danielhanchen commented May 24, 2026 •

edited

Loading