Auto Xet to HTTP download fallback in from_pretrained; share Studio's fallback via unsloth_zoo#6638
Auto Xet to HTTP download fallback in from_pretrained; share Studio's fallback via unsloth_zoo#6638danielhanchen wants to merge 96 commits into
Conversation
… via unsloth_zoo Hugging Face Xet downloads can hang on a blob with no progress and no exception, and a blocked native Xet thread cannot be killed in-process. Studio already recovers from this; the shared logic now lives in unsloth_zoo.hf_xet_fallback. unsloth main: FastModel/FastVisionModel/FastLanguageModel.from_pretrained now warm the repo in a killable subprocess (maybe_prefetch_hf_snapshot -> snapshot_download_with_xet_fallback) before the in-process load, so a stalled Xet transfer auto-recovers over HTTP and the subsequent load is a cache hit that cannot hang. Best-effort and guarded: local paths, offline/local_files_only, and the vLLM fast_inference path are skipped; an older unsloth_zoo without the helper is a no-op. Studio: utils/hf_xet_fallback.py is now a thin shim over unsloth_zoo.hf_xet_fallback that injects Studio's marker-aware prepare_cache_for_transport, replacing its local copy of the watchdog + spawn-child download + Xet->HTTP retry. Call sites and the orchestrator's DownloadStallError import are unchanged. The hf_xet_fallback unit test is slimmed to the shim-specific behavior (the full matrix is tested in unsloth_zoo); the GGUF and training integration tests are unchanged.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Code Review
This pull request refactors the Xet-to-HTTP fallback mechanism by delegating the core watchdog and download logic to a shared implementation in unsloth_zoo, while maintaining a Studio-specific shim that injects marker-aware cache preparation. Additionally, it introduces a pre-download step (maybe_prefetch_hf_snapshot) in a killable subprocess for Llama and Vision models to prevent in-process hangs on stalled Xet transfers. Feedback focuses on improving robustness: first, by wrapping the cache preparation call in a try-except block to prevent unexpected failures from crashing the fallback process; and second, by properly expanding user paths and checking for local path indicators to avoid false-positive download warnings for local model directories.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| def _studio_prepare_for_http(repo_type: str, repo_id: str) -> None: | ||
| """Make the partial safe for an HTTP resume using Studio's marker-aware purge, | ||
| so the download manager's ``.transport`` marker accounting stays consistent | ||
| (vs the generic delete-incompletes default in unsloth_zoo).""" | ||
| from hub.utils.download_registry import prepare_cache_for_transport | ||
| prepare_cache_for_transport(repo_type, repo_id, "http") | ||
|
|
There was a problem hiding this comment.
If prepare_cache_for_transport raises an exception (for example, due to permission issues or missing directories), it will propagate and crash the entire fallback download process, preventing the HTTP retry from starting.
To ensure robustness, wrap this call in a try...except block and log any failure as a debug message. Avoid using broad, silent exception handlers like except Exception: pass. If importing the logger fails, catch the specific ModuleNotFoundError and verify the module name to avoid suppressing unrelated import errors.
def _studio_prepare_for_http(repo_type: str, repo_id: str) -> None:
"""Make the partial safe for an HTTP resume using Studio's marker-aware purge,
so the download manager's ``.transport`` marker accounting stays consistent
(vs the generic delete-incompletes default in unsloth_zoo)."""
try:
from hub.utils.download_registry import prepare_cache_for_transport
prepare_cache_for_transport(repo_type, repo_id, "http")
except Exception as e:
try:
from loggers import get_logger
get_logger(__name__).debug("prepare_cache_for_transport failed for %s: %s", repo_id, e)
except ModuleNotFoundError as err:
if err.name != "loggers":
raiseReferences
- Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.
- When catching an ImportError for an optional dependency, prefer catching the more specific ModuleNotFoundError and check the module name to avoid suppressing unrelated import errors.
| if not isinstance(model_name, str) or not model_name: | ||
| return | ||
| # A local directory / file path has nothing to download. | ||
| if os.path.isdir(model_name) or os.path.exists(model_name): | ||
| return |
There was a problem hiding this comment.
Checking os.path.exists(model_name) directly will fail to detect local paths that start with ~ (since it is not expanded) or non-existent local paths (e.g., ./my_model or absolute paths). This will cause the prefetcher to attempt a download from the Hugging Face Hub, resulting in confusing false-positive warnings.
We should expand user paths using os.path.expanduser and check if the path starts with typical local path indicators (like ., /, ~, \\, or is an absolute path) to return early.
| if not isinstance(model_name, str) or not model_name: | |
| return | |
| # A local directory / file path has nothing to download. | |
| if os.path.isdir(model_name) or os.path.exists(model_name): | |
| return | |
| if not isinstance(model_name, str) or not model_name: | |
| return | |
| # A local directory / file path has nothing to download. | |
| expanded_model_name = os.path.expanduser(model_name) | |
| if os.path.isdir(expanded_model_name) or os.path.exists(expanded_model_name): | |
| return | |
| if expanded_model_name.startswith((".", "/", "~")) or "\\\\" in expanded_model_name or os.path.isabs(expanded_model_name): | |
| return |
…une prefetch - Studio shim: guard the unsloth_zoo.hf_xet_fallback import and raise an actionable RuntimeError if the installed unsloth_zoo predates the shared helper, instead of a bare ModuleNotFoundError at Studio startup. - Studio shim: wrap snapshot_download_with_xet_fallback too (not just the single-file path) so its HTTP retry uses Studio's marker-aware prepare_cache_for_transport. - unsloth main: maybe_prefetch_hf_snapshot now passes a conservative ignore_patterns so prewarming a mixed-format repo does not pull ONNX/TF/Flax/CoreML/GGUF/training state the Transformers load never reads (ignore list, not an allowlist, so no needed file is dropped). - Add a Studio shim test for the snapshot prep injection.
…et-fallback-wire
for more information, see https://pre-commit.ci
Warm the adapter repo with the Xet->HTTP stall fallback before PeftModel.from_pretrained at both adapter-load sites in loader.py, so a stalled Xet transfer of a saved LoRA cannot hang the in-process load the same way it can for a base model. Thread subfolder through maybe_prefetch_hf_snapshot and drop the checkpoint-*/* ignore when the caller loads from a checkpoint-* subfolder, so prewarming never skips the exact files the load needs.
…nsloth into hf-xet-fallback-wire
for more information, see https://pre-commit.ci
|
@codex review |
|
/gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ec9c6e9e5b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ), | ||
| daemon = True, | ||
| try: | ||
| from unsloth_zoo.hf_xet_fallback import ( |
There was a problem hiding this comment.
Bump the zoo dependency before importing the new helper
In environments that upgrade this package while keeping an already-installed unsloth_zoo that still satisfies the current metadata floor (unsloth_zoo>=2026.6.7), importing this Studio shim now raises before any worker can start because unsloth_zoo.hf_xet_fallback is required at module import. The main Unsloth path treats the missing helper as a no-op, but Studio paths that import the watchdog/GGUF fallback fail immediately; either raise the dependency floor to the first zoo release containing this module or keep a compatible local fallback.
Useful? React with 👍 / 👎.
| model_name, | ||
| token = token, | ||
| revision = revision, | ||
| cache_dir = cache_dir, | ||
| ignore_patterns = ignore_patterns, |
There was a problem hiding this comment.
Limit prefetches to the requested subfolder
When callers load from a specific subfolder, this still snapshots the whole repository; the subfolder argument only affects whether checkpoint-*/* is ignored. For repos that publish multiple checkpoints or variants in sibling subfolders, the normal Transformers load would fetch only the selected subfolder, but the new warmup can force unrelated multi-GB files to download first and exhaust time or disk before the actual load starts. Restrict the snapshot with subfolder-aware patterns, or skip the prefetch for subfolder loads.
Useful? React with 👍 / 👎.
| revision = revision, | ||
| cache_dir = kwargs.get("cache_dir"), | ||
| local_files_only = local_files_only, | ||
| fast_inference = fast_inference, |
There was a problem hiding this comment.
Warm PEFT adapters when fast inference is enabled
When a LoRA/PEFT repo is loaded with fast_inference=True, passing that flag here makes maybe_prefetch_hf_snapshot return early because it assumes vLLM will do the download. In this branch the adapter is still fetched by the following PeftModel.from_pretrained(...) call in the current process, so a stalled Xet transfer for the adapter can still hang instead of using the new HTTP fallback. The base-model vLLM path should stay skipped, but the adapter warmup should not.
Useful? React with 👍 / 👎.
| old_model_name, | ||
| token = token, | ||
| revision = revision, | ||
| cache_dir = kwargs.get("cache_dir"), |
There was a problem hiding this comment.
Reuse the warmed cache for adapter loads
When a caller supplies a custom cache_dir for a LoRA/PEFT load, this prefetch warms the adapter into that non-default cache, but the following PeftModel.from_pretrained(...) call does not receive cache_dir, so it looks in the default HF cache and can start a second in-process Xet download. In that configuration the new fallback does not protect the actual adapter load and also doubles the download/storage; pass the same cache directory to PEFT or prefetch where PEFT will read.
Useful? React with 👍 / 👎.
…eview) - _studio_prepare_for_http: wrap the marker-aware purge so a purge failure (locked file, missing dir) is logged at debug rather than aborting the HTTP retry; the shared orchestrator already guards it, this is belt-and-braces. - maybe_prefetch_hf_snapshot: expand ~ before the on-disk check and treat an absolute / ~ / explicitly-relative / Windows path that is not on disk yet as local, so it is not misread as a Hub repo id and does not emit a confusing pre-download warning.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7f6e137c6f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| import huggingface_hub | ||
| from huggingface_hub import constants as hf_constants | ||
|
|
||
| import unsloth_zoo.hf_xet_fallback as shared |
There was a problem hiding this comment.
Make shim tests independent of missing unsloth_zoo
In the backend CI job I checked, the dependency step installs studio/backend/requirements/studio.txt plus explicit extras/torch/transformers (.github/workflows/studio-backend-ci.yml lines 64-80), but it does not install requirements/base.txt where unsloth-zoo is listed. This module-level import therefore fails test collection with ModuleNotFoundError: No module named 'unsloth_zoo' in that workflow before any tests can run; either add the dependency to that job or stub/import-skip the shared module in this test.
Useful? React with 👍 / 👎.
| from unsloth_zoo.hf_xet_fallback import ( | ||
| DEFAULT_GRACE_PERIOD, | ||
| DEFAULT_HEARTBEAT_INTERVAL, | ||
| DEFAULT_STALL_TIMEOUT, | ||
| DownloadStallError, | ||
| child_should_disable_xet, | ||
| get_hf_download_state, | ||
| start_watchdog, |
There was a problem hiding this comment.
Defer zoo import until after training patches
The training worker imports child_should_disable_xet at the very start of run_training_process (core/training/worker.py line 2063), but this top-level shim import now imports unsloth_zoo immediately. That happens before the Windows ROCm torchao/runtime patches whose ordering comment explicitly says they must run before importing transformers/unsloth_zoo (core/training/worker.py lines 2326-2331), so Windows ROCm training workers can hit the same import-time failures those patches are meant to prevent. Keep the early child_should_disable_xet path stdlib-only or move the zoo import until after the worker setup.
Useful? React with 👍 / 👎.
| "*.h5", | ||
| "*.msgpack", |
There was a problem hiding this comment.
Do not skip requested TF or Flax weights
When callers pass from_tf=True or from_flax=True, those kwargs are forwarded to the later Transformers from_pretrained call, which then needs tf_model.h5 or flax_model.msgpack from the repo. The new warmup ignores those files unconditionally, so it can report success while leaving the actual requested weights uncached; the following in-process load then starts the Xet download that this fallback was supposed to make killable. Make these ignores conditional on the load format, or pass enough context into the prefetch helper to avoid dropping requested weights.
Useful? React with 👍 / 👎.
| maybe_prefetch_hf_snapshot( | ||
| model_name, | ||
| token = token, | ||
| revision = revision, | ||
| cache_dir = kwargs.get("cache_dir"), |
There was a problem hiding this comment.
Move direct Llama warmup after config validation
For direct FastLlamaModel.from_pretrained(...) calls, this now snapshots the whole remote repo before the AutoConfig/architecture checks below run. If the repo is unsupported by the installed Transformers version or is not a Llama-compatible model, the old path failed after fetching only config metadata, but this path can download multi-GB weights first and then raise the same validation error; move the warmup until after the config/model-class validation succeeds.
Useful? React with 👍 / 👎.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7f6e137c6f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ), | ||
| daemon = True, | ||
| try: | ||
| from unsloth_zoo.hf_xet_fallback import ( |
There was a problem hiding this comment.
Bump unsloth_zoo before importing shared fallback
The package metadata still allows unsloth_zoo>=2026.6.7 (unchanged in pyproject.toml), but this new top-level import now requires the newer unsloth_zoo.hf_xet_fallback module. In an environment that satisfies the declared dependency with the older zoo package, importing Studio's utils.hf_xet_fallback raises during startup (for example through the orchestrator/llama_cpp imports), whereas the previous local implementation worked; please bump/pin the required zoo version or keep a compatibility fallback.
Useful? React with 👍 / 👎.
| revision = revision, | ||
| cache_dir = kwargs.get("cache_dir"), | ||
| local_files_only = local_files_only, | ||
| fast_inference = fast_inference, |
There was a problem hiding this comment.
Warm adapters despite fast_inference
When loading a PEFT adapter with fast_inference=True, this passes the flag into maybe_prefetch_hf_snapshot, whose fast-inference guard returns immediately because vLLM handles base-model downloads. The next call still invokes PeftModel.from_pretrained for the adapter in-process, so vLLM/LoRA loads get no killable predownload and can still hang on a stalled Xet adapter transfer; pass False for adapter prewarming or make the skip apply only to base-model loads.
Useful? React with 👍 / 👎.
| old_model_name, | ||
| token = token, | ||
| revision = revision, | ||
| cache_dir = kwargs.get("cache_dir"), |
There was a problem hiding this comment.
Warm adapters in the cache PEFT will read
When callers provide a custom cache_dir, this predownloads the adapter into that cache, but the PeftModel.from_pretrained call below is made without the same cache_dir (and without forwarding **kwargs). PEFT therefore looks in the default Hub cache and can still perform the in-process Xet download, while the new prewarm only adds an extra download; pass the same cache directory through to PEFT or avoid warming a different cache.
Useful? React with 👍 / 👎.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7f6e137c6f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ), | ||
| daemon = True, | ||
| try: | ||
| from unsloth_zoo.hf_xet_fallback import ( |
There was a problem hiding this comment.
Bump unsloth_zoo minimum for the Studio shim
In environments that upgrade unsloth without also upgrading unsloth_zoo, this import now runs at Studio startup and raises if the installed unsloth_zoo still satisfies the current package floor but predates the new unsloth_zoo.hf_xet_fallback module. The main prefetch path gracefully no-ops when the helper is missing, but Studio becomes unstartable under an allowed dependency set; please bump the unsloth_zoo requirement to the first version containing this helper or keep a local fallback.
Useful? React with 👍 / 👎.
| old_model_name, | ||
| token = token, | ||
| revision = revision, | ||
| cache_dir = kwargs.get("cache_dir"), |
There was a problem hiding this comment.
Warm the same PEFT cache that the adapter load reads
When a PEFT load is called with cache_dir=..., this prefetch writes the adapter snapshot to that custom cache, but the following PeftModel.from_pretrained(...) call does not pass cache_dir (same pattern in FastModel below), so PEFT still resolves the adapter from the default HF cache and may perform the Xet download in-process. In that context the fallback is ineffective and the adapter can be downloaded twice; pass the same cache settings into PEFT or prefetch the cache it actually uses.
Useful? React with 👍 / 👎.
| ) | ||
| ] | ||
| try: | ||
| snapshot_download_with_xet_fallback( |
There was a problem hiding this comment.
Restrict snapshot prefetch to the files Transformers will load
For repos that publish duplicate weight formats/variants (for example both model.safetensors and pytorch_model.bin shards), this whole-snapshot call downloads every non-ignored file before Transformers has selected the actual checkpoint. A normal from_pretrained would only fetch the chosen weights, so these repos can now spend tens of GB of extra network/disk and fail with disk-full before the real load; derive allow patterns from the selected config/index or otherwise skip unused weight sets.
Useful? React with 👍 / 👎.
force_download=True previously bypassed the killable prefetch: the warm-cache short-circuit no-oped, then Transformers re-fetched in-process over the Xet path this avoids. Thread force_download through maybe_prefetch_hf_snapshot and every call site so a forced reload also runs in the killable child.
…w 3) Studio's shim hard-imported unsloth_zoo.hf_xet_fallback at startup, so a Studio install whose unsloth_zoo predates the shared helper crashed at boot. Instead of raising, fall back to thin stubs: plain HF snapshot/file downloads with the no-progress watchdog disabled (the same best-effort posture core Unsloth uses in from_pretrained), a DownloadStallError mirror, and no-op watchdog/state helpers. Studio boots and downloads work; automatic Xet -> HTTP recovery returns as soon as unsloth_zoo is upgraded. These are thin stubs, not a second copy of the orchestration. Verified against the installed older unsloth_zoo and added a degraded-path test (5 passed).
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ed16544cf8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| revision = revision, | ||
| cache_dir = cache_dir, | ||
| ignore_patterns = ignore_patterns, | ||
| force_download = force_download, |
There was a problem hiding this comment.
Skip prefetches when force_download is requested
When callers pass force_download=True, this warmup also forces a download, but the original force_download kwarg is still forwarded to the later Transformers/PEFT from_pretrained call, so that load bypasses the warmed cache and starts a second in-process Hub transfer. In that scenario the new fallback does not protect the actual load from a stalled Xet download and also doubles network/disk work; skip the prefetch for forced downloads or consume/clear the flag before the final load.
Useful? React with 👍 / 👎.
| ) | ||
| ] | ||
| try: | ||
| snapshot_download_with_xet_fallback( |
There was a problem hiding this comment.
Forward proxy settings into the prefetch
In environments that require callers to pass proxies through from_pretrained(..., proxies=...), the later Transformers load would use that proxy, but this prefetch path calls the Hub snapshot helper without any of the caller's proxy settings. If the direct connection stalls rather than failing immediately, DownloadStallError is raised here before the proxied normal load can run, so proxy-only users can fail on the new warmup despite having supplied valid download settings; forward the relevant Hub kwargs or skip the prefetch when they are present.
Useful? React with 👍 / 👎.
| # to HTTP on a no-progress stall, so the in-process load below is a cache | ||
| # hit and cannot hang on a stalled Xet transfer. | ||
| maybe_prefetch_hf_snapshot( | ||
| model_name, |
There was a problem hiding this comment.
Warm separate tokenizer repositories
When tokenizer_name is set to a different remote repo, this warmup only snapshots model_name, but the tokenizer is loaded later via load_correct_tokenizer(tokenizer_name=...) in the same process. In that configuration the model weights are protected by the Xet fallback, while the tokenizer repo can still start an in-process Hub/Xet download and hang; prefetch the tokenizer repo as well when it is remote and distinct from the model repo.
Useful? React with 👍 / 👎.
…raded shim (review 4) - maybe_prefetch_hf_snapshot now returns whether it warmed the cache, and every call site clears force_download after a successful warm. force_download=True previously warmed the cache in the killable child and then re-forced the same download in-process over Xet, defeating the protection. - Add the prefetch guard to the diffusion FastModel path (diffusion.py), which loaded weights via model_cls.from_pretrained without the Xet->HTTP recovery the vision/language paths get. - Degraded Studio shim (older unsloth_zoo) now honors cancel_event: it raises before starting and after finishing the plain HF download instead of dropping cancellation via **kwargs. - Prewarm skips redundant *.bin when the repo also ships safetensors (best-effort model_info), so a dual-format repo is not double-downloaded before Transformers picks safetensors. Threaded use_safetensors through the call sites. - Studio shim test imports the shared helper optionally so the degraded-path test runs on an unsloth_zoo without it; added a degraded cancellation test.
…nsloth into hf-xet-fallback-wire # Conflicts: # studio/backend/utils/hf_xet_fallback.py
…alidation Two paths let a real in-process weight download run without the Xet stall fallback, or ran the prefetch before it could fail locally: - A num_labels classification load takes the AutoModelForSequenceClassification branch (an in-process download) even under fast_inference=True, but the llama prefetch skipped the warm whenever fast_inference was set, so those weights fetched over un-killable Xet. Gate the skip on num_labels is None, so only a real vLLM-owned load bypasses the warm. - The FastBaseModel (vision / FastModel) prefetch ran before the mutually- exclusive load_in_4bit/8bit/16bit validation, so an invalid combination downloaded a multi-GB snapshot before failing. SentenceTransformer already moved that check ahead of its prefetch; the vision loader cannot, because check_and_disable_bitsandbytes_loading only resolves the effective flags after the config fetch (moving the check earlier would false-reject a valid load of an already-quantized model). Move the base + tokenizer prefetch to run after the validation instead, so an invalid call fails without downloading. The moved block is unchanged and still precedes the in-process load and the vLLM fast_inference_setup remap. Adds static AST regression tests asserting the vision prefetch follows the load-mode validation and the llama prefetch's fast_inference skip is gated on num_labels.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a89ec0da0d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| model_name, | ||
| token = token, | ||
| revision = revision, | ||
| cache_dir = kwargs.get("cache_folder") or os.environ.get("SENTENCE_TRANSFORMERS_HOME"), |
There was a problem hiding this comment.
Use the FastModel cache for ST fallback prefetch
When FastSentenceTransformer falls through to the FastModel.from_pretrained path below, that weight load does not consume SENTENCE_TRANSFORMERS_HOME or cache_folder as a Hugging Face cache_dir (it is invoked later without a cache_dir argument). With cache_folder/SENTENCE_TRANSFORMERS_HOME set, this prefetch warms the SentenceTransformers cache, but the actual FastModel weight load uses the default HF cache, so it misses the warm snapshot and can still perform the unprotected in-process Xet download the change is meant to avoid.
Useful? React with 👍 / 👎.
| except Exception: | ||
| return False # older unsloth_zoo without the helper: load normally |
There was a problem hiding this comment.
Require a zoo build that contains the fallback helper
In environments that satisfy the current declared dependency (unsloth_zoo>=2026.6.7) but predate the new unsloth_zoo.hf_xet_fallback module, this broad import fallback silently returns False, so every new maybe_prefetch_hf_snapshot call becomes a no-op and the protected Xet→HTTP pre-download never runs. Unless the package minimum is bumped to the first zoo release containing this helper, normal installs can keep hitting the same unkillable in-process Xet downloads this change is meant to avoid.
Useful? React with 👍 / 👎.
The custom SentenceTransformer fallback path prefetches into the cache the ST load uses (cache_folder, else SENTENCE_TRANSFORMERS_HOME), and the module-file loads were already resolved to that same cache. But the model WEIGHTS load runs through FastModel.from_pretrained, which resolves its cache from the HF cache_dir and never reads cache_folder / SENTENCE_TRANSFORMERS_HOME. With either set, the weights loaded from the default HF cache and missed the warm, so they still ran the unprotected in-process Xet download the prefetch is meant to avoid. Resolve the same cache into kwargs["cache_dir"] before the FastModel load (when a custom ST cache is in play and the caller passed no explicit HF cache_dir, which wins), so the weights hit the warm. A None resolution leaves the default HF cache, matching the prefetch. Adds a static AST regression test asserting from_pretrained resolves the ST cache into kwargs["cache_dir"] before the FastModel.from_pretrained call.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4f5f8da7b5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| model_name, | ||
| token = token, | ||
| revision = revision, | ||
| cache_dir = kwargs.get("cache_folder") or os.environ.get("SENTENCE_TRANSFORMERS_HOME"), |
There was a problem hiding this comment.
Honor explicit HF cache_dir for ST prefetches
When FastSentenceTransformer.from_pretrained(..., cache_dir=...) falls through to the FastModel path, this prefetch warms cache_folder/SENTENCE_TRANSFORMERS_HOME (or the default cache) and ignores the explicit HF cache_dir. Later the fallback weight load preserves and forwards kwargs["cache_dir"], so the warmed snapshot is missed and the weights can still be downloaded in-process over Xet from the custom cache. Include kwargs.get("cache_dir") in this cache resolution so the warm and load target the same cache.
Useful? React with 👍 / 👎.
| _filename_has_variant(sibling.rfilename, variant) | ||
| if variant | ||
| else _is_canonical_model_weight_safetensors(sibling.rfilename) |
There was a problem hiding this comment.
Keep variant .bin unless the safetensors name is loadable
For variant loads, this treats any in-scope safetensors whose basename contains .<variant>. or .<variant>- as proving the .bin weights are redundant. Transformers variant loading looks for the canonical model.<variant>.safetensors/index/shard names, so a repo with an unrelated sidecar such as consolidated.fp16.safetensors plus the actual pytorch_model.fp16.bin will add *.bin to ignore_patterns; the prefetch then skips the only loadable weights and from_pretrained falls back to an unprotected in-process Xet download.
Useful? React with 👍 / 👎.
| _warm_tokenizer_repo = ( | ||
| isinstance(_tokenizer_repo, str) | ||
| and bool(_tokenizer_repo) | ||
| and _tokenizer_repo != model_name | ||
| ) |
There was a problem hiding this comment.
Warm the vLLM path tokenizer after remapping
When FastVisionModel runs with fast_inference=True and no separate tokenizer_name, _vllm_owns_weights skips the base snapshot warm, but this condition also skips the tokenizer warm because _tokenizer_repo == model_name. Later fast_inference_setup may remap model_name, then the code sets tokenizer_name = model_name and calls AutoProcessor/AutoTokenizer in-process, so processor/tokenizer files can still fetch over unprotected Xet. Warm the final tokenizer repo after remap (or include fast_inference in this condition) so the cache hit is guaranteed.
Useful? React with 👍 / 👎.
| use_safetensors = kwargs.get("use_safetensors"), | ||
| # Diffusion variants (variant="fp16") are common: forward it so the warm never drops a | ||
| # variant .bin for a non-variant safetensors. | ||
| variant = kwargs.get("variant"), |
There was a problem hiding this comment.
Forward the diffusion variant to the real load
When FastDiffusionModel.from_pretrained(..., variant="fp16") is used, the prefetch warms the variant-specific files, but load_kwargs never includes variant, so model_cls.from_pretrained still asks Transformers for the default weight variant. That misses the warmed cache (and can load the wrong precision or fetch the default weights in-process over Xet), so the same variant value needs to be forwarded to the actual load.
Useful? React with 👍 / 👎.
…ess Xet fetch
Four cases where the prewarm and the real load targeted different files or
different caches, so the load still ran an unprotected in-process Xet download:
- ST prefetch ignored an explicit HF cache_dir. FastSentenceTransformer resolves
the prefetch (and fallback module-load) cache from cache_folder /
SENTENCE_TRANSFORMERS_HOME, but the FastModel fallback weight load forwards
kwargs["cache_dir"]. A caller passing cache_dir warmed one cache and read
another, missing the warm. The three ST cache resolutions now honor
kwargs.get("cache_dir") first.
- Variant .bin dropped for a non-canonical safetensors sidecar. For a variant
load the prewarm dropped *.bin as soon as ANY variant-infixed safetensors was
shipped, so a repo with consolidated.fp16.safetensors (a sidecar the load does
not read) plus the real pytorch_model.fp16.bin had its only loadable weights
dropped from the warm. The check now requires a CANONICAL model variant name
(model.<variant>.safetensors, either shard infix, or the variant index) via
_is_canonical_variant_model_weight_safetensors, matching the strictness already
used for the no-variant path.
- vLLM path never warmed the tokenizer. With fast_inference and no separate
tokenizer_name, the base warm is skipped (vLLM owns the weights) and the
tokenizer warm was deferred because fast_inference_setup may remap model_name.
The final tokenizer repo was then loaded in-process (AutoProcessor /
AutoTokenizer) with no warm. Warm the final tokenizer repo (tokenizer-only)
after the remap on the vLLM path.
- Diffusion variant not forwarded to the load. FastDiffusionModel forwarded
variant to the prefetch but not to model_cls.from_pretrained, so the pipeline
asked for the default weight variant -- missing the warmed variant weights
(wrong precision, or a default weight a variant-only repo may not ship, fetched
in-process). load_kwargs now forwards variant.
Adds regression tests for each (a non-canonical sidecar keeps its .bin, the ST
cache resolutions honor cache_dir, the vLLM tokenizer warm runs after the remap,
and the diffusion load forwards variant).
|
All four are real and fixed in e883f71:
Regression tests added for each (35 -> 42 in the prefetch-scope suite). |
for more information, see https://pre-commit.ci
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
2 similar comments
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 08ba451756
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| cache_dir = kwargs.get("cache_dir") | ||
| or kwargs.get("cache_folder") | ||
| or os.environ.get("SENTENCE_TRANSFORMERS_HOME"), |
There was a problem hiding this comment.
Align ST prefetch cache with native loads
When the caller passes cache_dir but not cache_folder, this prefetch warms that explicit HF cache, but the native SentenceTransformer(...) paths do not read it: the inference branch only copies cache_folder into st_kwargs, and the fast-encoder branch passes cache_folder = kwargs.get("cache_folder"). In those contexts the actual load falls back to the default/ST cache, missing the warmed snapshot, so large repos can be downloaded twice and the real weight fetch still happens in-process without the Xet fallback this change is meant to provide. Either don't prefer cache_dir for native ST loads or translate it into the cache_folder passed to SentenceTransformer.
Useful? React with 👍 / 👎.
| # Resolve the cache the load uses: an explicit HF cache_dir wins (the FastModel fallback load | ||
| # forwards it), else cache_folder, else SENTENCE_TRANSFORMERS_HOME (which ST honors when | ||
| # cache_folder is unset), else the default HF cache -- a wrong-cache warm would be missed. | ||
| maybe_prefetch_hf_snapshot( |
There was a problem hiding this comment.
Avoid whole-snapshot warm for root-only ST fallback
When FastSentenceTransformer is used on a plain encoder repo with no modules.json, the code later falls back to FastModel.from_pretrained, which reads only root weights. This unconditional prefetch leaves weights_at_root at its default False, so _prefetch_ignore_patterns treats the repo as multi-component: mixed safetensors/bin root repos keep both formats and arbitrary subdir weights are not excluded. In that fallback path this can pull multi-GB weights the actual load never reads before the real load starts; determine modules.json first or pass weights_at_root=True for the no-modules case.
Useful? React with 👍 / 👎.
| # Pre-download the confirmed diffusion repo (Xet -> HTTP on a stall) so the weight load is a cache | ||
| # hit. subfolder is NOT forwarded: the pipeline loads the whole repo root (every component | ||
| # subfolder), so narrowing to one would leave unet/, vae/, text_encoder/ to in-process Xet. | ||
| maybe_prefetch_hf_snapshot( |
There was a problem hiding this comment.
Mark diffusion prefetch as root-weight load
FastDiffusionModel resolves a Transformers model class and later calls model_cls.from_pretrained(...) on the repo root, not a diffusers-style multi-component pipeline. Leaving this warm at the default weights_at_root=False sends _prefetch_ignore_patterns down the whole-snapshot path, so even use_safetensors=True or safetensors-preferred mixed repos keep redundant .bin weights and nested weight dirs eligible for download. For mixed-format diffusion checkpoints this can pull a second multi-GB weight set that the real load never reads; pass weights_at_root=True for this prefetch.
Useful? React with 👍 / 👎.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 08ba451756
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| cache_dir = kwargs.get("cache_dir") | ||
| or kwargs.get("cache_folder") | ||
| or os.environ.get("SENTENCE_TRANSFORMERS_HOME"), |
There was a problem hiding this comment.
Warm the cache SentenceTransformer will read
When a caller passes cache_dir to FastSentenceTransformer.from_pretrained and the model takes the for_inference or fast-encoder SentenceTransformer(...) branch, this prefetch writes the snapshot into that HF cache, but the later native SentenceTransformer loads only forward cache_folder (or nothing), not cache_dir. In that scenario the real load misses the warmed snapshot and can still perform an in-process Hub/Xet download, defeating the stall fallback and duplicating the transfer; either map cache_dir to cache_folder for those branches or avoid preferring it for this prefetch.
Useful? React with 👍 / 👎.
…ansformer loads
The ST prefetch warms the cache resolved as cache_dir -> cache_folder ->
SENTENCE_TRANSFORMERS_HOME (an explicit HF cache_dir wins). But the for_inference and
fast-encoder branches then construct a native SentenceTransformer, which takes
cache_folder, not cache_dir: for_inference copied only a caller-supplied cache_folder
into st_kwargs, and the fast-encoder passed cache_folder = kwargs.get("cache_folder").
So a caller passing cache_dir warmed one cache and the native load read another
(cache_folder / SENTENCE_TRANSFORMERS_HOME / the default HF cache), missing the warm and
starting an unprotected in-process Hub/Xet download that defeats the stall fallback and
duplicates the transfer.
Both native loads now set cache_folder = kwargs.get("cache_dir") or
kwargs.get("cache_folder"): the explicit HF cache_dir wins, else the caller's
cache_folder, else None so ST honors SENTENCE_TRANSFORMERS_HOME -- matching the
prefetch's resolution so the native load hits the warmed cache. The FastModel fallback
branch already resolves cache_dir (unchanged). Adds a static regression guard.
for more information, see https://pre-commit.ci
|
Real and fixed in 6d4ce26.
|
|
@codex review |
|
/gemini review |
|
Codex Review: Didn't find any major issues. Keep them coming! Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
|
/gemini review |
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
Condense PR-added comments/docstrings across the loader prefetch wiring (_utils.py, vision/llama/loader/diffusion/sentence_transformer.py, tokenizer_utils.py), the prefetch-scope tests, and the Studio Xet->HTTP shim (backend utils + tests, model-selector tsx). Comments/docstrings/JSDoc only, no code changed (AST + TS-compiler verified). Preserves the prefetch-placement and Studio marker-injection rationale tersely. Prefetch-scope tests 43 passed; Studio backend tests 18 passed.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
From a consolidated review pass (12x reviewer.py personas + 3 forks). The SentenceTransformer prefetch now forwards force_download and clears it after a successful warm, matching the llama / vision / diffusion sites. Before, a force_download=True ST load warmed without forcing, so a stale cache could be reused or the refresh could fall to an unguarded in-process Hub/Xet download. Generalises the ST prefetch-wiring AST test to accept the return-captured call. Prefetch suite 43 passed. Not taken: the llama / vision base-load prefetch omitting revision is correct -- revision is a named from_pretrained parameter that never reaches the in-process AutoModel load (which resolves model_name on its default branch), so the warm matches the load; forwarding it would warm the wrong branch. The degraded Studio shim dropping local_files_only is not production-reachable (its GGUF callers never pass it, and the public snapshot wrapper is test-only).
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Summary
Hugging Face Xet (
hf_xet) is the fast default download transport, but a Xet transfer can hang on a blob with no progress and no exception, and a blocked native Xet thread cannot be killed in-process. Unsloth Studio already recovers from this; the shared logic now lives inunsloth_zoo.hf_xet_fallback(see unslothai/unsloth-zoo#829). This PR puts it to use in two places.Depends on unslothai/unsloth-zoo#829 (the
unsloth_zoo.hf_xet_fallback/unsloth_zoo.hf_cache_statehelper). Merge that first; until it lands, the unsloth-main change is a no-op (the helper import fails and the load proceeds normally).unsloth main
FastModel/FastVisionModel/FastLanguageModel.from_pretrainednow warm the repo in a killable subprocess before the in-process weight load. A newmaybe_prefetch_hf_snapshot(inmodels/_utils.py) callsunsloth_zoo.hf_xet_fallback.snapshot_download_with_xet_fallback, which keeps Xet primary and, on a no-progress stall, kills the child, makes the partial safe for HTTP, and respawns once withHF_HUB_DISABLE_XET=1. Because the snapshot is then fully cached, the existingfrom_pretrainedis a cache hit and cannot hang on Xet.It is best-effort and guarded:
local_files_only, and offline (HF_HUB_OFFLINE/TRANSFORMERS_OFFLINE) are skipped;fast_inferencepath is skipped (it has its own downloader);unsloth_zoowithout the helper is a no-op (the load proceeds unchanged);local_files_onlywith no subprocess;DownloadStallErroris raised (a clear network error instead of a silent hang); any other hiccup falls through to the normal load.The call sits inside the existing
HF_HUB_ENABLE_HF_TRANSFERwindow invision.pyandllama.py.Studio
studio/backend/utils/hf_xet_fallback.pyis now a thin shim overunsloth_zoo.hf_xet_fallback. It re-exports the shared API (start_watchdog,get_hf_download_state,child_should_disable_xet,hf_hub_download_with_xet_fallback,snapshot_download_with_xet_fallback,DownloadStallError) and injects Studio's marker-awareprepare_cache_for_transporton the HTTP retry, so the download manager keeps its.transportmarker semantics. The local copy of the watchdog + spawn-child download + Xet to HTTP retry is removed (the module drops ~450 lines). Call sites (core/inference/llama_cpp.py,core/training/worker.py) and the orchestrator'sDownloadStallErrorimport are unchanged.The hub's
hf_cache_state.pykeeps its stdlib-only primitives (the hub stays import-light: its unit tests stub pydantic/fastapi and must not pull torch/transformers). Only the substantial fallback orchestration is shared.Tests
studio/backend/tests/test_hf_xet_fallback.pyis slimmed to the shim-specific behavior (re-export surface, and that the shim injects Studio'sprepare_cache_for_transporton the HTTP retry); the full watchdog/transport matrix is tested inunsloth_zoo.test_gguf_xet_fallback_integration.py,test_training_xet_fallback.py) are unchanged and pass via the shim.hub/tests: 109 passed, ~0.5s).End-to-end (against unsloth_zoo#829): forcing the Xet attempt to stall on a tiny real repo fires the watchdog after the timeout, kills the child, and completes the download over HTTP, returning a valid snapshot.