Skip to content

Auto Xet to HTTP download fallback in from_pretrained; share Studio's fallback via unsloth_zoo#6638

Open
danielhanchen wants to merge 96 commits into
mainfrom
hf-xet-fallback-wire
Open

Auto Xet to HTTP download fallback in from_pretrained; share Studio's fallback via unsloth_zoo#6638
danielhanchen wants to merge 96 commits into
mainfrom
hf-xet-fallback-wire

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Summary

Hugging Face Xet (hf_xet) is the fast default download transport, but a Xet transfer can hang on a blob with no progress and no exception, and a blocked native Xet thread cannot be killed in-process. Unsloth Studio already recovers from this; the shared logic now lives in unsloth_zoo.hf_xet_fallback (see unslothai/unsloth-zoo#829). This PR puts it to use in two places.

Depends on unslothai/unsloth-zoo#829 (the unsloth_zoo.hf_xet_fallback / unsloth_zoo.hf_cache_state helper). Merge that first; until it lands, the unsloth-main change is a no-op (the helper import fails and the load proceeds normally).

unsloth main

FastModel / FastVisionModel / FastLanguageModel.from_pretrained now warm the repo in a killable subprocess before the in-process weight load. A new maybe_prefetch_hf_snapshot (in models/_utils.py) calls unsloth_zoo.hf_xet_fallback.snapshot_download_with_xet_fallback, which keeps Xet primary and, on a no-progress stall, kills the child, makes the partial safe for HTTP, and respawns once with HF_HUB_DISABLE_XET=1. Because the snapshot is then fully cached, the existing from_pretrained is a cache hit and cannot hang on Xet.

It is best-effort and guarded:

  • local paths, local_files_only, and offline (HF_HUB_OFFLINE / TRANSFORMERS_OFFLINE) are skipped;
  • the vLLM fast_inference path is skipped (it has its own downloader);
  • an older unsloth_zoo without the helper is a no-op (the load proceeds unchanged);
  • a fully cached repo resolves in-process via local_files_only with no subprocess;
  • only a both-transports-stalled DownloadStallError is raised (a clear network error instead of a silent hang); any other hiccup falls through to the normal load.

The call sits inside the existing HF_HUB_ENABLE_HF_TRANSFER window in vision.py and llama.py.

Studio

studio/backend/utils/hf_xet_fallback.py is now a thin shim over unsloth_zoo.hf_xet_fallback. It re-exports the shared API (start_watchdog, get_hf_download_state, child_should_disable_xet, hf_hub_download_with_xet_fallback, snapshot_download_with_xet_fallback, DownloadStallError) and injects Studio's marker-aware prepare_cache_for_transport on the HTTP retry, so the download manager keeps its .transport marker semantics. The local copy of the watchdog + spawn-child download + Xet to HTTP retry is removed (the module drops ~450 lines). Call sites (core/inference/llama_cpp.py, core/training/worker.py) and the orchestrator's DownloadStallError import are unchanged.

The hub's hf_cache_state.py keeps its stdlib-only primitives (the hub stays import-light: its unit tests stub pydantic/fastapi and must not pull torch/transformers). Only the substantial fallback orchestration is shared.

Tests

  • studio/backend/tests/test_hf_xet_fallback.py is slimmed to the shim-specific behavior (re-export surface, and that the shim injects Studio's prepare_cache_for_transport on the HTTP retry); the full watchdog/transport matrix is tested in unsloth_zoo.
  • The GGUF and training integration tests (test_gguf_xet_fallback_integration.py, test_training_xet_fallback.py) are unchanged and pass via the shim.
  • The hub suite stays green and ML-free (hub/tests: 109 passed, ~0.5s).

End-to-end (against unsloth_zoo#829): forcing the Xet attempt to stall on a tiny real repo fires the watchdog after the timeout, kills the child, and completes the download over HTTP, returning a valid snapshot.

… via unsloth_zoo

Hugging Face Xet downloads can hang on a blob with no progress and no exception,
and a blocked native Xet thread cannot be killed in-process. Studio already
recovers from this; the shared logic now lives in unsloth_zoo.hf_xet_fallback.

unsloth main: FastModel/FastVisionModel/FastLanguageModel.from_pretrained now warm
the repo in a killable subprocess (maybe_prefetch_hf_snapshot ->
snapshot_download_with_xet_fallback) before the in-process load, so a stalled Xet
transfer auto-recovers over HTTP and the subsequent load is a cache hit that cannot
hang. Best-effort and guarded: local paths, offline/local_files_only, and the vLLM
fast_inference path are skipped; an older unsloth_zoo without the helper is a no-op.

Studio: utils/hf_xet_fallback.py is now a thin shim over unsloth_zoo.hf_xet_fallback
that injects Studio's marker-aware prepare_cache_for_transport, replacing its local
copy of the watchdog + spawn-child download + Xet->HTTP retry. Call sites and the
orchestrator's DownloadStallError import are unchanged. The hf_xet_fallback unit
test is slimmed to the shim-specific behavior (the full matrix is tested in
unsloth_zoo); the GGUF and training integration tests are unchanged.
@danielhanchen danielhanchen requested a review from Datta0 as a code owner June 24, 2026 11:39
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Xet-to-HTTP fallback mechanism by delegating the core watchdog and download logic to a shared implementation in unsloth_zoo, while maintaining a Studio-specific shim that injects marker-aware cache preparation. Additionally, it introduces a pre-download step (maybe_prefetch_hf_snapshot) in a killable subprocess for Llama and Vision models to prevent in-process hangs on stalled Xet transfers. Feedback focuses on improving robustness: first, by wrapping the cache preparation call in a try-except block to prevent unexpected failures from crashing the fallback process; and second, by properly expanding user paths and checking for local path indicators to avoid false-positive download warnings for local model directories.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +47 to 53
def _studio_prepare_for_http(repo_type: str, repo_id: str) -> None:
"""Make the partial safe for an HTTP resume using Studio's marker-aware purge,
so the download manager's ``.transport`` marker accounting stays consistent
(vs the generic delete-incompletes default in unsloth_zoo)."""
from hub.utils.download_registry import prepare_cache_for_transport
prepare_cache_for_transport(repo_type, repo_id, "http")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If prepare_cache_for_transport raises an exception (for example, due to permission issues or missing directories), it will propagate and crash the entire fallback download process, preventing the HTTP retry from starting.

To ensure robustness, wrap this call in a try...except block and log any failure as a debug message. Avoid using broad, silent exception handlers like except Exception: pass. If importing the logger fails, catch the specific ModuleNotFoundError and verify the module name to avoid suppressing unrelated import errors.

def _studio_prepare_for_http(repo_type: str, repo_id: str) -> None:
    """Make the partial safe for an HTTP resume using Studio's marker-aware purge,
    so the download manager's ``.transport`` marker accounting stays consistent
    (vs the generic delete-incompletes default in unsloth_zoo)."""
    try:
        from hub.utils.download_registry import prepare_cache_for_transport
        prepare_cache_for_transport(repo_type, repo_id, "http")
    except Exception as e:
        try:
            from loggers import get_logger
            get_logger(__name__).debug("prepare_cache_for_transport failed for %s: %s", repo_id, e)
        except ModuleNotFoundError as err:
            if err.name != "loggers":
                raise
References
  1. Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.
  2. When catching an ImportError for an optional dependency, prefer catching the more specific ModuleNotFoundError and check the module name to avoid suppressing unrelated import errors.

Comment thread unsloth/models/_utils.py Outdated
Comment on lines +858 to +862
if not isinstance(model_name, str) or not model_name:
return
# A local directory / file path has nothing to download.
if os.path.isdir(model_name) or os.path.exists(model_name):
return

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Checking os.path.exists(model_name) directly will fail to detect local paths that start with ~ (since it is not expanded) or non-existent local paths (e.g., ./my_model or absolute paths). This will cause the prefetcher to attempt a download from the Hugging Face Hub, resulting in confusing false-positive warnings.

We should expand user paths using os.path.expanduser and check if the path starts with typical local path indicators (like ., /, ~, \\, or is an absolute path) to return early.

Suggested change
if not isinstance(model_name, str) or not model_name:
return
# A local directory / file path has nothing to download.
if os.path.isdir(model_name) or os.path.exists(model_name):
return
if not isinstance(model_name, str) or not model_name:
return
# A local directory / file path has nothing to download.
expanded_model_name = os.path.expanduser(model_name)
if os.path.isdir(expanded_model_name) or os.path.exists(expanded_model_name):
return
if expanded_model_name.startswith((".", "/", "~")) or "\\\\" in expanded_model_name or os.path.isabs(expanded_model_name):
return

danielhanchen and others added 8 commits June 24, 2026 11:59
…une prefetch

- Studio shim: guard the unsloth_zoo.hf_xet_fallback import and raise an actionable
  RuntimeError if the installed unsloth_zoo predates the shared helper, instead of a
  bare ModuleNotFoundError at Studio startup.
- Studio shim: wrap snapshot_download_with_xet_fallback too (not just the single-file
  path) so its HTTP retry uses Studio's marker-aware prepare_cache_for_transport.
- unsloth main: maybe_prefetch_hf_snapshot now passes a conservative ignore_patterns
  so prewarming a mixed-format repo does not pull ONNX/TF/Flax/CoreML/GGUF/training
  state the Transformers load never reads (ignore list, not an allowlist, so no needed
  file is dropped).
- Add a Studio shim test for the snapshot prep injection.
Warm the adapter repo with the Xet->HTTP stall fallback before
PeftModel.from_pretrained at both adapter-load sites in loader.py, so a
stalled Xet transfer of a saved LoRA cannot hang the in-process load the
same way it can for a base model.

Thread subfolder through maybe_prefetch_hf_snapshot and drop the
checkpoint-*/* ignore when the caller loads from a checkpoint-* subfolder,
so prewarming never skips the exact files the load needs.
@danielhanchen

Copy link
Copy Markdown
Member Author

@codex review

@danielhanchen

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ec9c6e9e5b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread studio/backend/utils/hf_xet_fallback.py Outdated
),
daemon = True,
try:
from unsloth_zoo.hf_xet_fallback import (

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Bump the zoo dependency before importing the new helper

In environments that upgrade this package while keeping an already-installed unsloth_zoo that still satisfies the current metadata floor (unsloth_zoo>=2026.6.7), importing this Studio shim now raises before any worker can start because unsloth_zoo.hf_xet_fallback is required at module import. The main Unsloth path treats the missing helper as a no-op, but Studio paths that import the watchdog/GGUF fallback fail immediately; either raise the dependency floor to the first zoo release containing this module or keep a compatible local fallback.

Useful? React with 👍 / 👎.

Comment thread unsloth/models/_utils.py
Comment on lines +911 to +915
model_name,
token = token,
revision = revision,
cache_dir = cache_dir,
ignore_patterns = ignore_patterns,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Limit prefetches to the requested subfolder

When callers load from a specific subfolder, this still snapshots the whole repository; the subfolder argument only affects whether checkpoint-*/* is ignored. For repos that publish multiple checkpoints or variants in sibling subfolders, the normal Transformers load would fetch only the selected subfolder, but the new warmup can force unrelated multi-GB files to download first and exhaust time or disk before the actual load starts. Restrict the snapshot with subfolder-aware patterns, or skip the prefetch for subfolder loads.

Useful? React with 👍 / 👎.

Comment thread unsloth/models/loader.py Outdated
revision = revision,
cache_dir = kwargs.get("cache_dir"),
local_files_only = local_files_only,
fast_inference = fast_inference,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Warm PEFT adapters when fast inference is enabled

When a LoRA/PEFT repo is loaded with fast_inference=True, passing that flag here makes maybe_prefetch_hf_snapshot return early because it assumes vLLM will do the download. In this branch the adapter is still fetched by the following PeftModel.from_pretrained(...) call in the current process, so a stalled Xet transfer for the adapter can still hang instead of using the new HTTP fallback. The base-model vLLM path should stay skipped, but the adapter warmup should not.

Useful? React with 👍 / 👎.

Comment thread unsloth/models/loader.py
old_model_name,
token = token,
revision = revision,
cache_dir = kwargs.get("cache_dir"),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reuse the warmed cache for adapter loads

When a caller supplies a custom cache_dir for a LoRA/PEFT load, this prefetch warms the adapter into that non-default cache, but the following PeftModel.from_pretrained(...) call does not receive cache_dir, so it looks in the default HF cache and can start a second in-process Xet download. In that configuration the new fallback does not protect the actual adapter load and also doubles the download/storage; pass the same cache directory to PEFT or prefetch where PEFT will read.

Useful? React with 👍 / 👎.

…eview)

- _studio_prepare_for_http: wrap the marker-aware purge so a purge failure
  (locked file, missing dir) is logged at debug rather than aborting the
  HTTP retry; the shared orchestrator already guards it, this is belt-and-braces.
- maybe_prefetch_hf_snapshot: expand ~ before the on-disk check and treat an
  absolute / ~ / explicitly-relative / Windows path that is not on disk yet as
  local, so it is not misread as a Hub repo id and does not emit a confusing
  pre-download warning.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7f6e137c6f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

import huggingface_hub
from huggingface_hub import constants as hf_constants

import unsloth_zoo.hf_xet_fallback as shared

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Make shim tests independent of missing unsloth_zoo

In the backend CI job I checked, the dependency step installs studio/backend/requirements/studio.txt plus explicit extras/torch/transformers (.github/workflows/studio-backend-ci.yml lines 64-80), but it does not install requirements/base.txt where unsloth-zoo is listed. This module-level import therefore fails test collection with ModuleNotFoundError: No module named 'unsloth_zoo' in that workflow before any tests can run; either add the dependency to that job or stub/import-skip the shared module in this test.

Useful? React with 👍 / 👎.

Comment thread studio/backend/utils/hf_xet_fallback.py Outdated
Comment on lines +21 to +28
from unsloth_zoo.hf_xet_fallback import (
DEFAULT_GRACE_PERIOD,
DEFAULT_HEARTBEAT_INTERVAL,
DEFAULT_STALL_TIMEOUT,
DownloadStallError,
child_should_disable_xet,
get_hf_download_state,
start_watchdog,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Defer zoo import until after training patches

The training worker imports child_should_disable_xet at the very start of run_training_process (core/training/worker.py line 2063), but this top-level shim import now imports unsloth_zoo immediately. That happens before the Windows ROCm torchao/runtime patches whose ordering comment explicitly says they must run before importing transformers/unsloth_zoo (core/training/worker.py lines 2326-2331), so Windows ROCm training workers can hit the same import-time failures those patches are meant to prevent. Keep the early child_should_disable_xet path stdlib-only or move the zoo import until after the worker setup.

Useful? React with 👍 / 👎.

Comment thread unsloth/models/_utils.py
Comment on lines +834 to +835
"*.h5",
"*.msgpack",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not skip requested TF or Flax weights

When callers pass from_tf=True or from_flax=True, those kwargs are forwarded to the later Transformers from_pretrained call, which then needs tf_model.h5 or flax_model.msgpack from the repo. The new warmup ignores those files unconditionally, so it can report success while leaving the actual requested weights uncached; the following in-process load then starts the Xet download that this fallback was supposed to make killable. Make these ignores conditional on the load format, or pass enough context into the prefetch helper to avoid dropping requested weights.

Useful? React with 👍 / 👎.

Comment thread unsloth/models/llama.py Outdated
Comment on lines +2379 to +2383
maybe_prefetch_hf_snapshot(
model_name,
token = token,
revision = revision,
cache_dir = kwargs.get("cache_dir"),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Move direct Llama warmup after config validation

For direct FastLlamaModel.from_pretrained(...) calls, this now snapshots the whole remote repo before the AutoConfig/architecture checks below run. If the repo is unsupported by the installed Transformers version or is not a Llama-compatible model, the old path failed after fetching only config metadata, but this path can download multi-GB weights first and then raise the same validation error; move the warmup until after the config/model-class validation succeeds.

Useful? React with 👍 / 👎.

@danielhanchen

Copy link
Copy Markdown
Member Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7f6e137c6f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread studio/backend/utils/hf_xet_fallback.py Outdated
),
daemon = True,
try:
from unsloth_zoo.hf_xet_fallback import (

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Bump unsloth_zoo before importing shared fallback

The package metadata still allows unsloth_zoo>=2026.6.7 (unchanged in pyproject.toml), but this new top-level import now requires the newer unsloth_zoo.hf_xet_fallback module. In an environment that satisfies the declared dependency with the older zoo package, importing Studio's utils.hf_xet_fallback raises during startup (for example through the orchestrator/llama_cpp imports), whereas the previous local implementation worked; please bump/pin the required zoo version or keep a compatibility fallback.

Useful? React with 👍 / 👎.

Comment thread unsloth/models/loader.py Outdated
revision = revision,
cache_dir = kwargs.get("cache_dir"),
local_files_only = local_files_only,
fast_inference = fast_inference,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Warm adapters despite fast_inference

When loading a PEFT adapter with fast_inference=True, this passes the flag into maybe_prefetch_hf_snapshot, whose fast-inference guard returns immediately because vLLM handles base-model downloads. The next call still invokes PeftModel.from_pretrained for the adapter in-process, so vLLM/LoRA loads get no killable predownload and can still hang on a stalled Xet adapter transfer; pass False for adapter prewarming or make the skip apply only to base-model loads.

Useful? React with 👍 / 👎.

Comment thread unsloth/models/loader.py
old_model_name,
token = token,
revision = revision,
cache_dir = kwargs.get("cache_dir"),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Warm adapters in the cache PEFT will read

When callers provide a custom cache_dir, this predownloads the adapter into that cache, but the PeftModel.from_pretrained call below is made without the same cache_dir (and without forwarding **kwargs). PEFT therefore looks in the default Hub cache and can still perform the in-process Xet download, while the new prewarm only adds an extra download; pass the same cache directory through to PEFT or avoid warming a different cache.

Useful? React with 👍 / 👎.

@danielhanchen

Copy link
Copy Markdown
Member Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7f6e137c6f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread studio/backend/utils/hf_xet_fallback.py Outdated
),
daemon = True,
try:
from unsloth_zoo.hf_xet_fallback import (

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Bump unsloth_zoo minimum for the Studio shim

In environments that upgrade unsloth without also upgrading unsloth_zoo, this import now runs at Studio startup and raises if the installed unsloth_zoo still satisfies the current package floor but predates the new unsloth_zoo.hf_xet_fallback module. The main prefetch path gracefully no-ops when the helper is missing, but Studio becomes unstartable under an allowed dependency set; please bump the unsloth_zoo requirement to the first version containing this helper or keep a local fallback.

Useful? React with 👍 / 👎.

Comment thread unsloth/models/loader.py
old_model_name,
token = token,
revision = revision,
cache_dir = kwargs.get("cache_dir"),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Warm the same PEFT cache that the adapter load reads

When a PEFT load is called with cache_dir=..., this prefetch writes the adapter snapshot to that custom cache, but the following PeftModel.from_pretrained(...) call does not pass cache_dir (same pattern in FastModel below), so PEFT still resolves the adapter from the default HF cache and may perform the Xet download in-process. In that context the fallback is ineffective and the adapter can be downloaded twice; pass the same cache settings into PEFT or prefetch the cache it actually uses.

Useful? React with 👍 / 👎.

Comment thread unsloth/models/_utils.py
)
]
try:
snapshot_download_with_xet_fallback(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restrict snapshot prefetch to the files Transformers will load

For repos that publish duplicate weight formats/variants (for example both model.safetensors and pytorch_model.bin shards), this whole-snapshot call downloads every non-ignored file before Transformers has selected the actual checkpoint. A normal from_pretrained would only fetch the chosen weights, so these repos can now spend tens of GB of extra network/disk and fail with disk-full before the real load; derive allow patterns from the selected config/index or otherwise skip unused weight sets.

Useful? React with 👍 / 👎.

danielhanchen and others added 4 commits June 25, 2026 08:18
force_download=True previously bypassed the killable prefetch: the warm-cache
short-circuit no-oped, then Transformers re-fetched in-process over the Xet path
this avoids. Thread force_download through maybe_prefetch_hf_snapshot and every
call site so a forced reload also runs in the killable child.
…w 3)

Studio's shim hard-imported unsloth_zoo.hf_xet_fallback at startup, so a Studio
install whose unsloth_zoo predates the shared helper crashed at boot. Instead of
raising, fall back to thin stubs: plain HF snapshot/file downloads with the
no-progress watchdog disabled (the same best-effort posture core Unsloth uses in
from_pretrained), a DownloadStallError mirror, and no-op watchdog/state helpers.
Studio boots and downloads work; automatic Xet -> HTTP recovery returns as soon
as unsloth_zoo is upgraded. These are thin stubs, not a second copy of the
orchestration. Verified against the installed older unsloth_zoo and added a
degraded-path test (5 passed).

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ed16544cf8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread unsloth/models/_utils.py
revision = revision,
cache_dir = cache_dir,
ignore_patterns = ignore_patterns,
force_download = force_download,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Skip prefetches when force_download is requested

When callers pass force_download=True, this warmup also forces a download, but the original force_download kwarg is still forwarded to the later Transformers/PEFT from_pretrained call, so that load bypasses the warmed cache and starts a second in-process Hub transfer. In that scenario the new fallback does not protect the actual load from a stalled Xet download and also doubles network/disk work; skip the prefetch for forced downloads or consume/clear the flag before the final load.

Useful? React with 👍 / 👎.

Comment thread unsloth/models/_utils.py
)
]
try:
snapshot_download_with_xet_fallback(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Forward proxy settings into the prefetch

In environments that require callers to pass proxies through from_pretrained(..., proxies=...), the later Transformers load would use that proxy, but this prefetch path calls the Hub snapshot helper without any of the caller's proxy settings. If the direct connection stalls rather than failing immediately, DownloadStallError is raised here before the proxied normal load can run, so proxy-only users can fail on the new warmup despite having supplied valid download settings; forward the relevant Hub kwargs or skip the prefetch when they are present.

Useful? React with 👍 / 👎.

Comment thread unsloth/models/llama.py Outdated
# to HTTP on a no-progress stall, so the in-process load below is a cache
# hit and cannot hang on a stalled Xet transfer.
maybe_prefetch_hf_snapshot(
model_name,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Warm separate tokenizer repositories

When tokenizer_name is set to a different remote repo, this warmup only snapshots model_name, but the tokenizer is loaded later via load_correct_tokenizer(tokenizer_name=...) in the same process. In that configuration the model weights are protected by the Xet fallback, while the tokenizer repo can still start an in-process Hub/Xet download and hang; prefetch the tokenizer repo as well when it is remote and distinct from the model repo.

Useful? React with 👍 / 👎.

…raded shim (review 4)

- maybe_prefetch_hf_snapshot now returns whether it warmed the cache, and every
  call site clears force_download after a successful warm. force_download=True
  previously warmed the cache in the killable child and then re-forced the same
  download in-process over Xet, defeating the protection.
- Add the prefetch guard to the diffusion FastModel path (diffusion.py), which
  loaded weights via model_cls.from_pretrained without the Xet->HTTP recovery the
  vision/language paths get.
- Degraded Studio shim (older unsloth_zoo) now honors cancel_event: it raises
  before starting and after finishing the plain HF download instead of dropping
  cancellation via **kwargs.
- Prewarm skips redundant *.bin when the repo also ships safetensors (best-effort
  model_info), so a dual-format repo is not double-downloaded before Transformers
  picks safetensors. Threaded use_safetensors through the call sites.
- Studio shim test imports the shared helper optionally so the degraded-path test
  runs on an unsloth_zoo without it; added a degraded cancellation test.
…nsloth into hf-xet-fallback-wire

# Conflicts:
#	studio/backend/utils/hf_xet_fallback.py
…alidation

Two paths let a real in-process weight download run without the Xet stall
fallback, or ran the prefetch before it could fail locally:

- A num_labels classification load takes the AutoModelForSequenceClassification
  branch (an in-process download) even under fast_inference=True, but the llama
  prefetch skipped the warm whenever fast_inference was set, so those weights
  fetched over un-killable Xet. Gate the skip on num_labels is None, so only a
  real vLLM-owned load bypasses the warm.

- The FastBaseModel (vision / FastModel) prefetch ran before the mutually-
  exclusive load_in_4bit/8bit/16bit validation, so an invalid combination
  downloaded a multi-GB snapshot before failing. SentenceTransformer already
  moved that check ahead of its prefetch; the vision loader cannot, because
  check_and_disable_bitsandbytes_loading only resolves the effective flags after
  the config fetch (moving the check earlier would false-reject a valid load of
  an already-quantized model). Move the base + tokenizer prefetch to run after
  the validation instead, so an invalid call fails without downloading. The moved
  block is unchanged and still precedes the in-process load and the vLLM
  fast_inference_setup remap.

Adds static AST regression tests asserting the vision prefetch follows the
load-mode validation and the llama prefetch's fast_inference skip is gated on
num_labels.
@danielhanchen

Copy link
Copy Markdown
Member Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a89ec0da0d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread unsloth/models/sentence_transformer.py Outdated
model_name,
token = token,
revision = revision,
cache_dir = kwargs.get("cache_folder") or os.environ.get("SENTENCE_TRANSFORMERS_HOME"),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use the FastModel cache for ST fallback prefetch

When FastSentenceTransformer falls through to the FastModel.from_pretrained path below, that weight load does not consume SENTENCE_TRANSFORMERS_HOME or cache_folder as a Hugging Face cache_dir (it is invoked later without a cache_dir argument). With cache_folder/SENTENCE_TRANSFORMERS_HOME set, this prefetch warms the SentenceTransformers cache, but the actual FastModel weight load uses the default HF cache, so it misses the warm snapshot and can still perform the unprotected in-process Xet download the change is meant to avoid.

Useful? React with 👍 / 👎.

Comment thread unsloth/models/_utils.py
Comment on lines +1216 to +1217
except Exception:
return False # older unsloth_zoo without the helper: load normally

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require a zoo build that contains the fallback helper

In environments that satisfy the current declared dependency (unsloth_zoo>=2026.6.7) but predate the new unsloth_zoo.hf_xet_fallback module, this broad import fallback silently returns False, so every new maybe_prefetch_hf_snapshot call becomes a no-op and the protected Xet→HTTP pre-download never runs. Unless the package minimum is bumped to the first zoo release containing this helper, normal installs can keep hitting the same unkillable in-process Xet downloads this change is meant to avoid.

Useful? React with 👍 / 👎.

The custom SentenceTransformer fallback path prefetches into the cache the ST
load uses (cache_folder, else SENTENCE_TRANSFORMERS_HOME), and the module-file
loads were already resolved to that same cache. But the model WEIGHTS load runs
through FastModel.from_pretrained, which resolves its cache from the HF cache_dir
and never reads cache_folder / SENTENCE_TRANSFORMERS_HOME. With either set, the
weights loaded from the default HF cache and missed the warm, so they still ran
the unprotected in-process Xet download the prefetch is meant to avoid.

Resolve the same cache into kwargs["cache_dir"] before the FastModel load (when a
custom ST cache is in play and the caller passed no explicit HF cache_dir, which
wins), so the weights hit the warm. A None resolution leaves the default HF
cache, matching the prefetch.

Adds a static AST regression test asserting from_pretrained resolves the ST cache
into kwargs["cache_dir"] before the FastModel.from_pretrained call.
@danielhanchen

Copy link
Copy Markdown
Member Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4f5f8da7b5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread unsloth/models/sentence_transformer.py Outdated
model_name,
token = token,
revision = revision,
cache_dir = kwargs.get("cache_folder") or os.environ.get("SENTENCE_TRANSFORMERS_HOME"),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor explicit HF cache_dir for ST prefetches

When FastSentenceTransformer.from_pretrained(..., cache_dir=...) falls through to the FastModel path, this prefetch warms cache_folder/SENTENCE_TRANSFORMERS_HOME (or the default cache) and ignores the explicit HF cache_dir. Later the fallback weight load preserves and forwards kwargs["cache_dir"], so the warmed snapshot is missed and the weights can still be downloaded in-process over Xet from the custom cache. Include kwargs.get("cache_dir") in this cache resolution so the warm and load target the same cache.

Useful? React with 👍 / 👎.

Comment thread unsloth/models/_utils.py Outdated
Comment on lines +1169 to +1171
_filename_has_variant(sibling.rfilename, variant)
if variant
else _is_canonical_model_weight_safetensors(sibling.rfilename)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep variant .bin unless the safetensors name is loadable

For variant loads, this treats any in-scope safetensors whose basename contains .<variant>. or .<variant>- as proving the .bin weights are redundant. Transformers variant loading looks for the canonical model.<variant>.safetensors/index/shard names, so a repo with an unrelated sidecar such as consolidated.fp16.safetensors plus the actual pytorch_model.fp16.bin will add *.bin to ignore_patterns; the prefetch then skips the only loadable weights and from_pretrained falls back to an unprotected in-process Xet download.

Useful? React with 👍 / 👎.

Comment thread unsloth/models/vision.py
Comment on lines +947 to +951
_warm_tokenizer_repo = (
isinstance(_tokenizer_repo, str)
and bool(_tokenizer_repo)
and _tokenizer_repo != model_name
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Warm the vLLM path tokenizer after remapping

When FastVisionModel runs with fast_inference=True and no separate tokenizer_name, _vllm_owns_weights skips the base snapshot warm, but this condition also skips the tokenizer warm because _tokenizer_repo == model_name. Later fast_inference_setup may remap model_name, then the code sets tokenizer_name = model_name and calls AutoProcessor/AutoTokenizer in-process, so processor/tokenizer files can still fetch over unprotected Xet. Warm the final tokenizer repo after remap (or include fast_inference in this condition) so the cache hit is guaranteed.

Useful? React with 👍 / 👎.

use_safetensors = kwargs.get("use_safetensors"),
# Diffusion variants (variant="fp16") are common: forward it so the warm never drops a
# variant .bin for a non-variant safetensors.
variant = kwargs.get("variant"),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Forward the diffusion variant to the real load

When FastDiffusionModel.from_pretrained(..., variant="fp16") is used, the prefetch warms the variant-specific files, but load_kwargs never includes variant, so model_cls.from_pretrained still asks Transformers for the default weight variant. That misses the warmed cache (and can load the wrong precision or fetch the default weights in-process over Xet), so the same variant value needs to be forwarded to the actual load.

Useful? React with 👍 / 👎.

…ess Xet fetch

Four cases where the prewarm and the real load targeted different files or
different caches, so the load still ran an unprotected in-process Xet download:

- ST prefetch ignored an explicit HF cache_dir. FastSentenceTransformer resolves
  the prefetch (and fallback module-load) cache from cache_folder /
  SENTENCE_TRANSFORMERS_HOME, but the FastModel fallback weight load forwards
  kwargs["cache_dir"]. A caller passing cache_dir warmed one cache and read
  another, missing the warm. The three ST cache resolutions now honor
  kwargs.get("cache_dir") first.

- Variant .bin dropped for a non-canonical safetensors sidecar. For a variant
  load the prewarm dropped *.bin as soon as ANY variant-infixed safetensors was
  shipped, so a repo with consolidated.fp16.safetensors (a sidecar the load does
  not read) plus the real pytorch_model.fp16.bin had its only loadable weights
  dropped from the warm. The check now requires a CANONICAL model variant name
  (model.<variant>.safetensors, either shard infix, or the variant index) via
  _is_canonical_variant_model_weight_safetensors, matching the strictness already
  used for the no-variant path.

- vLLM path never warmed the tokenizer. With fast_inference and no separate
  tokenizer_name, the base warm is skipped (vLLM owns the weights) and the
  tokenizer warm was deferred because fast_inference_setup may remap model_name.
  The final tokenizer repo was then loaded in-process (AutoProcessor /
  AutoTokenizer) with no warm. Warm the final tokenizer repo (tokenizer-only)
  after the remap on the vLLM path.

- Diffusion variant not forwarded to the load. FastDiffusionModel forwarded
  variant to the prefetch but not to model_cls.from_pretrained, so the pipeline
  asked for the default weight variant -- missing the warmed variant weights
  (wrong precision, or a default weight a variant-only repo may not ship, fetched
  in-process). load_kwargs now forwards variant.

Adds regression tests for each (a non-canonical sidecar keeps its .bin, the ST
cache resolutions honor cache_dir, the vLLM tokenizer warm runs after the remap,
and the diffusion load forwards variant).
@danielhanchen

Copy link
Copy Markdown
Member Author

All four are real and fixed in e883f71:

  • Honor explicit HF cache_dir for ST prefetches (P2): the three ST cache resolutions (the prefetch and the fallback module loads) now resolve kwargs.get("cache_dir") before cache_folder / SENTENCE_TRANSFORMERS_HOME, so a caller passing cache_dir warms and reads the same cache. The FastModel fallback load already forwards kwargs["cache_dir"].
  • Keep variant .bin unless the safetensors name is loadable (P2): the variant safetensors detector now requires a CANONICAL model variant name (model.<variant>.safetensors, either shard infix, or model.safetensors.index.<variant>.json) via _is_canonical_variant_model_weight_safetensors, so a non-canonical sidecar like consolidated.fp16.safetensors no longer causes the real pytorch_model.fp16.bin to be dropped from the warm. This matches the strictness already used on the no-variant path.
  • Warm the vLLM path tokenizer after remapping (P2): on the fast_inference path the base warm is skipped and the tokenizer warm was deferred (the comment notes fast_inference_setup may remap model_name). The final tokenizer repo is now warmed (tokenizer-only) after the remap, so the in-process AutoProcessor / AutoTokenizer load is a cache hit.
  • Forward the diffusion variant to the real load (P2): FastDiffusionModel now forwards variant to model_cls.from_pretrained (not only to the prefetch), so it reads the warmed variant weights instead of asking for the default variant.

Regression tests added for each (35 -> 42 in the prefetch-scope suite).

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

2 similar comments
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

@danielhanchen

Copy link
Copy Markdown
Member Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 08ba451756

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1457 to +1459
cache_dir = kwargs.get("cache_dir")
or kwargs.get("cache_folder")
or os.environ.get("SENTENCE_TRANSFORMERS_HOME"),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Align ST prefetch cache with native loads

When the caller passes cache_dir but not cache_folder, this prefetch warms that explicit HF cache, but the native SentenceTransformer(...) paths do not read it: the inference branch only copies cache_folder into st_kwargs, and the fast-encoder branch passes cache_folder = kwargs.get("cache_folder"). In those contexts the actual load falls back to the default/ST cache, missing the warmed snapshot, so large repos can be downloaded twice and the real weight fetch still happens in-process without the Xet fallback this change is meant to provide. Either don't prefer cache_dir for native ST loads or translate it into the cache_folder passed to SentenceTransformer.

Useful? React with 👍 / 👎.

Comment thread unsloth/models/sentence_transformer.py Outdated
# Resolve the cache the load uses: an explicit HF cache_dir wins (the FastModel fallback load
# forwards it), else cache_folder, else SENTENCE_TRANSFORMERS_HOME (which ST honors when
# cache_folder is unset), else the default HF cache -- a wrong-cache warm would be missed.
maybe_prefetch_hf_snapshot(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid whole-snapshot warm for root-only ST fallback

When FastSentenceTransformer is used on a plain encoder repo with no modules.json, the code later falls back to FastModel.from_pretrained, which reads only root weights. This unconditional prefetch leaves weights_at_root at its default False, so _prefetch_ignore_patterns treats the repo as multi-component: mixed safetensors/bin root repos keep both formats and arbitrary subdir weights are not excluded. In that fallback path this can pull multi-GB weights the actual load never reads before the real load starts; determine modules.json first or pass weights_at_root=True for the no-modules case.

Useful? React with 👍 / 👎.

# Pre-download the confirmed diffusion repo (Xet -> HTTP on a stall) so the weight load is a cache
# hit. subfolder is NOT forwarded: the pipeline loads the whole repo root (every component
# subfolder), so narrowing to one would leave unet/, vae/, text_encoder/ to in-process Xet.
maybe_prefetch_hf_snapshot(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Mark diffusion prefetch as root-weight load

FastDiffusionModel resolves a Transformers model class and later calls model_cls.from_pretrained(...) on the repo root, not a diffusers-style multi-component pipeline. Leaving this warm at the default weights_at_root=False sends _prefetch_ignore_patterns down the whole-snapshot path, so even use_safetensors=True or safetensors-preferred mixed repos keep redundant .bin weights and nested weight dirs eligible for download. For mixed-format diffusion checkpoints this can pull a second multi-GB weight set that the real load never reads; pass weights_at_root=True for this prefetch.

Useful? React with 👍 / 👎.

@danielhanchen

Copy link
Copy Markdown
Member Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 08ba451756

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1457 to +1459
cache_dir = kwargs.get("cache_dir")
or kwargs.get("cache_folder")
or os.environ.get("SENTENCE_TRANSFORMERS_HOME"),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Warm the cache SentenceTransformer will read

When a caller passes cache_dir to FastSentenceTransformer.from_pretrained and the model takes the for_inference or fast-encoder SentenceTransformer(...) branch, this prefetch writes the snapshot into that HF cache, but the later native SentenceTransformer loads only forward cache_folder (or nothing), not cache_dir. In that scenario the real load misses the warmed snapshot and can still perform an in-process Hub/Xet download, defeating the stall fallback and duplicating the transfer; either map cache_dir to cache_folder for those branches or avoid preferring it for this prefetch.

Useful? React with 👍 / 👎.

danielhanchen and others added 2 commits July 2, 2026 05:22
…ansformer loads

The ST prefetch warms the cache resolved as cache_dir -> cache_folder ->
SENTENCE_TRANSFORMERS_HOME (an explicit HF cache_dir wins). But the for_inference and
fast-encoder branches then construct a native SentenceTransformer, which takes
cache_folder, not cache_dir: for_inference copied only a caller-supplied cache_folder
into st_kwargs, and the fast-encoder passed cache_folder = kwargs.get("cache_folder").
So a caller passing cache_dir warmed one cache and the native load read another
(cache_folder / SENTENCE_TRANSFORMERS_HOME / the default HF cache), missing the warm and
starting an unprotected in-process Hub/Xet download that defeats the stall fallback and
duplicates the transfer.

Both native loads now set cache_folder = kwargs.get("cache_dir") or
kwargs.get("cache_folder"): the explicit HF cache_dir wins, else the caller's
cache_folder, else None so ST honors SENTENCE_TRANSFORMERS_HOME -- matching the
prefetch's resolution so the native load hits the warmed cache. The FastModel fallback
branch already resolves cache_dir (unchanged). Adds a static regression guard.
@danielhanchen

Copy link
Copy Markdown
Member Author

Real and fixed in 6d4ce26.

  • Warm the cache SentenceTransformer will read (P2): the ST prefetch resolves its cache as cache_dir -> cache_folder -> SENTENCE_TRANSFORMERS_HOME (an explicit HF cache_dir wins), but the for_inference and fast-encoder branches then construct a native SentenceTransformer, which takes cache_folder, not cache_dir: for_inference copied only a caller-supplied cache_folder into its kwargs, and the fast-encoder passed cache_folder = kwargs.get("cache_folder"). So a caller passing cache_dir warmed one cache and the native load read another, missing the warm and starting an unprotected in-process Hub/Xet download. Both native loads now set cache_folder = kwargs.get("cache_dir") or kwargs.get("cache_folder") (cache_dir wins, else cache_folder, else None so ST honors SENTENCE_TRANSFORMERS_HOME), matching the prefetch's resolution. The FastModel fallback branch already resolves cache_dir (unchanged). Static regression guard added.

@danielhanchen

Copy link
Copy Markdown
Member Author

@codex review

@danielhanchen

Copy link
Copy Markdown
Member Author

/gemini review

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Keep them coming!

Reviewed commit: 7a6fc8d89c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

@danielhanchen

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

Condense PR-added comments/docstrings across the loader prefetch wiring (_utils.py, vision/llama/loader/diffusion/sentence_transformer.py, tokenizer_utils.py), the prefetch-scope tests, and the Studio Xet->HTTP shim (backend utils + tests, model-selector tsx). Comments/docstrings/JSDoc only, no code changed (AST + TS-compiler verified). Preserves the prefetch-placement and Studio marker-injection rationale tersely. Prefetch-scope tests 43 passed; Studio backend tests 18 passed.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

From a consolidated review pass (12x reviewer.py personas + 3 forks). The
SentenceTransformer prefetch now forwards force_download and clears it after a
successful warm, matching the llama / vision / diffusion sites. Before, a
force_download=True ST load warmed without forcing, so a stale cache could be
reused or the refresh could fall to an unguarded in-process Hub/Xet download.

Generalises the ST prefetch-wiring AST test to accept the return-captured call.
Prefetch suite 43 passed.

Not taken: the llama / vision base-load prefetch omitting revision is correct --
revision is a named from_pretrained parameter that never reaches the in-process
AutoModel load (which resolves model_name on its default branch), so the warm
matches the load; forwarding it would warm the wrong branch. The degraded Studio
shim dropping local_files_only is not production-reachable (its GGUF callers never
pass it, and the public snapshot wrapper is test-only).
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant