Skip to content

Add shared Xet to HTTP stall fallback (hf_xet_fallback, hf_cache_state)#829

Open
danielhanchen wants to merge 82 commits into
unslothai:mainfrom
danielhanchen:hf-xet-fallback-shared
Open

Add shared Xet to HTTP stall fallback (hf_xet_fallback, hf_cache_state)#829
danielhanchen wants to merge 82 commits into
unslothai:mainfrom
danielhanchen:hf-xet-fallback-shared

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Summary

Hugging Face Xet (hf_xet) is the fast default transport, but a Xet transfer can hang on a blob with no progress and no exception, and a blocked native Xet thread cannot be killed in-process. Unsloth Studio already recovers from this (a no-progress watchdog respawns the download over plain HTTP with HF_HUB_DISABLE_XET=1). This PR lands the same logic in unsloth_zoo so it has a single home that both Unsloth main and Studio can share, rather than living only inside Studio.

This is the foundation PR. A follow-up in the unsloth repo wires FastModel.from_pretrained to it and re-points Studio's copy to import from here.

What this adds

  • unsloth_zoo/hf_cache_state.py: the sparse-aware HF cache primitives. On-disk byte accounting prefers st_blocks so a partially written Xet / hf_transfer .incomplete is not mistaken for full-size progress, plus active .incomplete detection over the current HF_HUB_CACHE root.
  • unsloth_zoo/hf_xet_fallback.py:
    • start_watchdog: a 180s no-progress watchdog that fires only while a .incomplete is present and the on-disk size is unchanged, and resets when none is present (so model init is never misread as a stall).
    • hf_hub_download_with_xet_fallback: a single-file download. A cached file short-circuits with no child; otherwise the download runs in a spawn child (its own process group) supervised by the watchdog. On a stall it kills the child group, makes the partial safe for HTTP, and respawns once with HF_HUB_DISABLE_XET=1. Cancellation and deterministic errors (auth / missing / disk full) propagate without a fallback.
    • snapshot_download_with_xet_fallback: the whole-repo variant. It warms a repo in a killable child (the entrypoint Unsloth from_pretrained will use before its in-process load), with an in-process local_files_only fast path so a warm cache spawns nothing.
    • An UNSLOTH_DISABLE_XET knob (also honors UNSLOTH_STABLE_DOWNLOADS and HF_HUB_DISABLE_XET) to force HTTP from the first attempt, plus injectable prepare_for_http_fn / scrub hooks so Studio can pass its marker-aware cache management.

Both new files are AGPL-3.0 (per-file SPDX), matching the Studio source they are lifted from; the rest of unsloth_zoo stays LGPL-3.0.

Tests

tests/test_hf_xet_fallback.py: CPU-only, no network, no real subprocess (the per-attempt download seam is monkeypatched). Covers the watchdog state machine (constant vs growing vs absent .incomplete, sparse-aware bytes, fires at most once), the transport decision (cached short-circuit, cancel and deterministic errors take no fallback, exactly one Xet to HTTP retry, per-file independence), the injected prepare seam, the UNSLOTH_DISABLE_XET knob, the snapshot fast path and stall fallback, and the HF_HUB_DISABLE_XET precondition in a fresh interpreter.

Result: 22 passed, 1 skipped (the sparse-blocks assertion skips on filesystems that report st_blocks == 0).

Hugging Face Xet downloads can hang on a blob with no progress and no
exception, and a blocked native Xet thread cannot be killed in-process.
Unsloth Studio already recovers from this; this lands the same logic in
unsloth_zoo so both Unsloth main and Studio can share one implementation.

- hf_cache_state.py: sparse-aware HF cache primitives (st_blocks based byte
  accounting, active .incomplete detection) over the active HF_HUB_CACHE.
- hf_xet_fallback.py: a 180s no-progress watchdog plus a spawn-child download
  that keeps Xet primary and falls back to plain HTTP exactly once on a stall.
  hf_hub_download_with_xet_fallback handles a single file; the new
  snapshot_download_with_xet_fallback warms a whole repo in a killable child
  (the entrypoint Unsloth from_pretrained uses before its in-process load),
  with an in-process local_files_only fast path for a warm cache. Honors an
  UNSLOTH_DISABLE_XET knob and injectable prepare-for-http / scrub hooks so
  Studio can pass its marker-aware cache management.

Both files are AGPL-3.0 (per-file), matching the Studio source.

Tests: CPU-only, no network, no real subprocess (the download seam is faked).
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces sparse-aware Hugging Face cache primitives and a robust fallback mechanism from Xet to HTTP downloads on no-progress stalls, accompanied by a comprehensive test suite. The feedback focuses on enhancing the reliability and robustness of the implementation. Key recommendations include removing a redundant import check, catching broader OSError exceptions during file cleanup to handle permission issues, improving Windows path detection, resetting the watchdog timer during unmeasurable states to prevent false stalls, and adding a small timeout when draining the process queue to avoid race conditions.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread unsloth_zoo/hf_cache_state.py Outdated
Comment on lines +63 to +66
try:
from huggingface_hub import constants as hf_constants
except ImportError:
return None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The try...except ImportError block for huggingface_hub is redundant because huggingface_hub is a required dependency of the project. Importing it directly is cleaner and adheres to the general rules.

    from huggingface_hub import constants as hf_constants
References
  1. Avoid using try...except ImportError for libraries that are required dependencies of the project, as the check is redundant.

Comment thread unsloth_zoo/hf_xet_fallback.py Outdated
Comment on lines +124 to +127
try:
blob.unlink()
except FileNotFoundError:
continue

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If blob.unlink() raises a PermissionError (which is common on Windows or when files are locked), the outer except Exception block will catch it and abort the entire cleanup loop. This leaves other incomplete files untouched. Catching OSError (which covers both FileNotFoundError and PermissionError) inside the loop ensures maximum cleanup effort.

Suggested change
try:
blob.unlink()
except FileNotFoundError:
continue
try:
blob.unlink()
except OSError:
continue

Comment thread unsloth_zoo/hf_xet_fallback.py Outdated
Comment on lines +149 to +150
if not repo_id or repo_id.startswith(("/", ".", "~")) or "\\" in repo_id:
continue

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Windows absolute paths using forward slashes (e.g., C:/path) or containing drive letters won't be skipped by the current check, potentially leading to unnecessary cache directory lookups. Checking for : (which is never present in valid HF repo IDs) is a robust way to skip Windows absolute paths.

Suggested change
if not repo_id or repo_id.startswith(("/", ".", "~")) or "\\" in repo_id:
continue
if not repo_id or repo_id.startswith(("/", ".", "~")) or "\\" in repo_id or ":" in repo_id:
continue

Comment on lines +199 to +202
if state is None:
if on_heartbeat is not None:
on_heartbeat(f"Downloading ({transport} transport)...")
continue

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If get_hf_download_state returns None due to transient issues, last_change is not updated. If this state persists longer than stall_timeout, the watchdog will immediately trigger a false stall as soon as the state becomes measurable again. Resetting last_change = now during unmeasurable periods avoids this.

Suggested change
if state is None:
if on_heartbeat is not None:
on_heartbeat(f"Downloading ({transport} transport)...")
continue
if state is None:
last_change = now
if on_heartbeat is not None:
on_heartbeat(f"Downloading ({transport} transport)...")
continue

Comment thread unsloth_zoo/hf_xet_fallback.py Outdated
Comment on lines +426 to +429
result = result_queue.get_nowait()
except queue.Empty:
result = None
finally:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Under heavy CPU/disk load, there can be a slight delay between the child process exiting and the OS flushing the queue's pipe. get_nowait() will fail instantly if the pipe is temporarily empty, whereas a small timeout (e.g., 1.0s) provides robustness against scheduling delays.

Suggested change
result = result_queue.get_nowait()
except queue.Empty:
result = None
finally:
try:
result = result_queue.get(timeout = 1.0)
except queue.Empty:
result = None

Comment thread tests/test_hf_xet_fallback.py Outdated

def test_get_state_skips_local_paths(hf_cache):
# Filesystem paths are not HF repo IDs and must be ignored without error.
assert xf.get_hf_download_state(["/abs/path", "./rel", "~user", "c:\\x"]) == (0, False)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Update the test assertion to verify that Windows absolute paths using forward slashes (e.g., c:/x) are also correctly skipped.

Suggested change
assert xf.get_hf_download_state(["/abs/path", "./rel", "~user", "c:\\x"]) == (0, False)
assert xf.get_hf_download_state(["/abs/path", "./rel", "~user", "c:\\x", "c:/x"]) == (0, False)

…d sys import

- The no-progress watchdog and the HTTP-retry cache purge now honor a caller
  supplied snapshot cache_dir, not just HF_HUB_CACHE, so a stall under a custom
  cache is detected and the partial is purged before the HTTP fallback. Threaded
  through hf_cache_root / iter_active_repo_cache_dirs / has_active_incomplete_blobs
  (hf_cache_state) and get_hf_download_state / start_watchdog / _default_prepare_for_http
  / _run_download_attempt (hf_xet_fallback).
- Remove the unused 'import sys' (ruff F401).
- Add a regression test that a stall under a custom cache_dir is watched and cleaned.
… on file path

- Critical: the HTTP retry could still use Xet. The spawn child re-imports the
  (heavy) unsloth_zoo package -- importing huggingface_hub, which reads
  HF_HUB_DISABLE_XET into a module constant at import time -- BEFORE the child body
  ran, so a child-side os.environ assignment landed too late. Now set the transport
  env in the parent (under a lock) around proc.start() so the child inherits it from
  creation; the child still sets it defensively. Tests assert the child inherits
  HF_HUB_DISABLE_XET=1 on the HTTP retry and is left untouched on the Xet attempt.
- hf_hub_download_with_xet_fallback now accepts cache_dir (symmetric with the
  snapshot variant); threaded into try_to_load_from_cache and the child download.
@danielhanchen

Copy link
Copy Markdown
Member Author

@codex review

@danielhanchen

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8af1b5debd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


logger = logging.getLogger(__name__)

_CTX = mp.get_context("spawn")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid forcing spawn for library downloads

When this helper is called from the normal unguarded top-level model-loading scripts that FastModel.from_pretrained will use, the spawn context re-imports the user's __main__ module in the child before _download_child_entry runs. That re-executes the same top-level download/model load and typically hits multiprocessing's bootstrap RuntimeError or exits without putting a queue result, so the wrapper reports a download error instead of downloading. Use an entry point that does not import the caller's main module, or avoid forced spawn where safe.

Useful? React with 👍 / 👎.

Comment thread unsloth_zoo/hf_xet_fallback.py Outdated
Comment on lines +503 to +504
# The generic purge is cache_dir-aware; an injected (Studio) hook owns
# its own cache accounting and keeps the (repo_type, repo_id) signature.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Abort when HTTP prep cannot remove unsafe partials

If _default_prepare_for_http or an injected prep hook fails after a Xet stall, this handler only logs at debug level and still starts the HTTP retry. In the locked-file/permission-error cases where the .incomplete sparse partial is still present, the retry can resume over exactly the unsafe partial the preceding comment says can silently corrupt the blob, leaving a bad model in cache instead of failing safely.

Useful? React with 👍 / 👎.

Comment on lines +191 to +193
return (blobs_dir.is_dir() and _blob_dir_is_partial(blobs_dir)) or (
_repo_dir_has_broken_snapshot_symlinks(repo_dir)
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don't treat stale snapshot gaps as active downloads

In a cache that already has a broken snapshot symlink from a previous interrupted or evicted download but no *.incomplete file, this still reports has_incomplete=True. The watchdog is supposed to reset while no active partial exists, but this state starts the stall timer before the new attempt has written anything and can kill/retry otherwise healthy downloads that spend stall_timeout on metadata or lock waits.

Useful? React with 👍 / 👎.

Comment thread unsloth_zoo/hf_xet_fallback.py Outdated
Comment on lines +127 to +130
for blob in blobs_dir.iterdir():
if blob.is_file() and blob.name.endswith(INCOMPLETE_SUFFIX):
try:
blob.unlink()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Limit HTTP cleanup to this stalled attempt

When two files from the same repo are downloading concurrently, a stall in one wrapper call enters the HTTP retry and this loop unlinks every *.incomplete blob in the repo cache, including a sibling file that is still actively making progress. That can make the other download fail or lose its partial even though it did not stall, so the cleanup needs to be scoped or locked to the attempt being retried.

Useful? React with 👍 / 👎.

- prepare_for_http: continue past a locked/permission-denied partial
  (catch OSError, not just FileNotFoundError) so one bad blob does not
  abort cleanup of the rest.
- get_hf_download_state: also skip drive-letter (":") repo ids so Windows
  absolute paths like C:/models are not mistaken for Hub ids.
- watchdog: reset the no-progress timer on an unmeasurable tick so a long
  unmeasurable gap cannot trip a false stall the instant state is readable.
- result drain: get(timeout=1.0) instead of get_nowait() so a child that
  exits microseconds before its queue feeder flushes is not misreported as
  'exited without a result'.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6266d0811d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +656 to +659
return _download_with_xet_fallback(
repo_id = repo_id,
label = repo_id,
kind = "snapshot",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve local model paths before Hub fallback

When this warm-up helper receives a filesystem model path (for example /mnt/model), the local_files_only probe above raises HFValidationError and this fall-through sends the path into the child snapshot_download, which rejects local paths instead of letting a normal from_pretrained('/mnt/model') proceed. Since the docstring says this is the from_pretrained cache warm-up path, guard or return existing local directories before entering the Hub download loop.

Useful? React with 👍 / 👎.

Comment on lines +111 to +112
out = re.sub(r"hf_[A-Za-z0-9]{8,}", "***", out)
out = re.sub(r"([Bb]earer\s+)[A-Za-z0-9._\-]+", r"\1***", out)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Redact signed URL query credentials

When running outside Studio, child exceptions fall back to this default scrubber, but these patterns only cover hf_ tokens and Bearer headers. If an HTTP/Xet exception includes the resolved presigned blob URL (for example a 403/timeout message with ...?X-Amz-Signature=...), the signed query string is returned to the parent and can be logged with the raised RuntimeError, despite the child-entry contract saying signed URLs are not exposed; please strip query strings or known signing parameters here as well.

Useful? React with 👍 / 👎.

cache_dir = cache_dir,
allow_patterns = allow_patterns,
ignore_patterns = ignore_patterns,
local_files_only = True,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve Hub freshness checks for moving revisions

When revision is omitted or is a branch/tag like main, this local_files_only fast path returns any cached snapshot without the normal Hub ref/etag refresh. After a model is updated upstream, snapshot_download_with_xet_fallback can keep serving the stale cached snapshot indefinitely instead of matching snapshot_download/from_pretrained semantics; restrict this no-network shortcut to immutable commit hashes or offline mode, or perform the refresh in the child.

Useful? React with 👍 / 👎.

Comment on lines +424 to +425
os.environ.update(child_env)
proc.start()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid leaking temporary Xet env to sibling downloads

During an HTTP retry, this mutates process-global os.environ to HF_HUB_DISABLE_XET=1 while spawning the child. The lock only serializes writers in this helper; a concurrent download can call xet_force_disabled() or import huggingface_hub during this window and incorrectly force/cache HTTP for its own first attempt, so multi-model loads can lose Xet because a sibling stalled. Use a spawn bootstrap that does not expose the flag process-wide, or guard all reads/imports with the same lock.

Useful? React with 👍 / 👎.

Comment thread unsloth_zoo/hf_cache_state.py Outdated
target = target_dir_name(repo_type, repo_id)
try:
for entry in root.iterdir():
if entry.name.lower() == target:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match Hub cache dirs case-sensitively

Hugging Face accepts uppercase repo IDs and its cache folder naming preserves case (for example models--Foo--Bar and models--foo--bar are distinct), but this lowercased comparison makes both directories match either repo. In a cache containing repos that differ only by case, the watchdog and HTTP cleanup can count or unlink the other repo's active partials, causing false stalls or cross-repo download disruption; compare against the exact repo_cache_dir_name instead.

Useful? React with 👍 / 👎.

Comment thread unsloth_zoo/hf_xet_fallback.py Outdated
Comment on lines +595 to +596
if isinstance(cached, str) and os.path.exists(cached):
return cached

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor cancellation before cached fast paths

If cancel_event is already set and this file is cached, the helper returns the cached path here before reaching _download_with_xet_fallback's cancellation check, contradicting its documented RuntimeError("Cancelled") behavior. In cancellation-driven model loads this can continue into expensive local loading after the caller has already aborted; check the event before the cache short-circuit (and similarly before the snapshot fast path).

Useful? React with 👍 / 👎.

@danielhanchen

Copy link
Copy Markdown
Member Author

@codex review

1 similar comment
@danielhanchen

Copy link
Copy Markdown
Member Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6266d0811d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


logger = logging.getLogger(__name__)

_CTX = mp.get_context("spawn")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid spawn re-importing unguarded training scripts

When this helper is used from FastModel.from_pretrained in a regular Python training script that calls it at module top level, the spawn context re-imports and executes the user's __main__ in the child before _download_child_entry runs. That re-enters the download during multiprocessing bootstrapping and Python aborts with the usual safe-import RuntimeError, so the child exits without warming the cache or reaching the fallback. Please launch a dedicated helper subprocess that only imports this module, or avoid spawn on POSIX for this path.

Useful? React with 👍 / 👎.

Comment thread unsloth_zoo/hf_xet_fallback.py Outdated
Comment on lines +131 to +134
except OSError:
# A locked / permission-denied blob (common on Windows)
# must not abort cleanup of the rest of the partials.
continue

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Abort HTTP fallback when cleanup cannot remove partials

When a stalled .incomplete is locked or permission-denied, this except leaves it in place and reports success to the caller. The fallback then launches HTTP against the same cache even though this function documents that HTTP resume over a sparse Xet/hf_transfer partial can corrupt the blob, so Windows locks or restricted caches can turn a cleanup failure into a corrupted model file instead of a safe failure.

Useful? React with 👍 / 👎.

Comment thread unsloth_zoo/hf_xet_fallback.py Outdated
Comment on lines +127 to +130
for blob in blobs_dir.iterdir():
if blob.is_file() and blob.name.endswith(INCOMPLETE_SUFFIX):
try:
blob.unlink()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Limit cleanup to the stalled attempt

When two files from the same repo are downloaded concurrently and only one child stalls, this repo-wide loop removes every *.incomplete, including a sibling file that another live child is still writing. On Unix that can unlink the active writer's target and on Windows it collides with the locked-file path above, so the single-file fallback is not independent under parallel shard downloads; cleanup needs to target only the stalled attempt's partials or coordinate at repo scope.

Useful? React with 👍 / 👎.

Comment thread unsloth_zoo/hf_xet_fallback.py Outdated
Comment on lines +109 to +110
if hf_token:
out = out.replace(hf_token, "***")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle boolean Hugging Face tokens when scrubbing

Hugging Face download APIs accept token=True to use the saved token, and callers commonly pass that through from from_pretrained. If the child then raises a deterministic error, this path calls str.replace with a bool, causing a TypeError inside the exception handler before result_queue.put; the parent then reports that the process exited without a result instead of the real 401/404/network error. Guard the replacement with isinstance(hf_token, str).

Useful? React with 👍 / 👎.

Comment on lines +421 to +425
with _SPAWN_ENV_LOCK:
saved_env = {k: os.environ.get(k) for k in child_env}
try:
os.environ.update(child_env)
proc.start()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep temporary Xet env out of other threads

During an HTTP retry this mutates process-wide os.environ while the child is started, but other threads are not protected by _SPAWN_ENV_LOCK; for example, another concurrent helper can call xet_force_disabled() during this window and incorrectly decide to skip Xet and run HTTP cleanup for an unrelated download. Pass the environment to an isolated helper process or keep the transport decision and spawn under a synchronization scheme that other callers also use.

Useful? React with 👍 / 👎.

…ce_download (review 3)

- spawn from notebook / stdin / python -c / unguarded script: multiprocessing
  'spawn' re-opens __main__.__file__, which is missing or '<stdin>' there, so
  proc.start() raised before the child ran and the download never happened.
  Point __main__ at this importable module just for the spawn (serialized under
  the existing lock) and restore it. Verified from a real <stdin> context: the
  child now runs and returns the expected HF error instead of failing to spawn.
- _default_prepare_for_http only deletes from an exact-case cache dir, or a
  single unambiguous case-insensitive match, so preparing HTTP for Org/Repo no
  longer purges a case-colliding org/repo on a case-sensitive filesystem.
- HTTP prep now also clears broken snapshot symlinks, which the incomplete-state
  detector counts as active; otherwise the HTTP retry inherited stale state and
  re-tripped the watchdog.
- snapshot/file fallbacks accept and forward force_download, skipping the warm
  cache short-circuit so force_download=True re-fetches in the killable child.
- Added regression tests for all four (30 passed, 1 skipped).

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2619c825f8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".



def repo_cache_dir_name(repo_type: str, repo_id: str) -> str:
return f"{repo_type}s--{repo_id.replace('/', '--')}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Normalize repo_type=None before cache scans

When callers forward Hugging Face's default repo_type=None for a model download, the child hf_hub_download/snapshot_download normalizes it to "model" and writes under models--..., but the watchdog and HTTP-prep cache probes use this formatter first and look for Nones--... instead. In that scenario an actual stalled model download never exposes has_incomplete=True, so the Xet child is not killed and the HTTP fallback this helper is meant to provide does not run; normalize None to "model" before building cache directory names.

Useful? React with 👍 / 👎.

Comment on lines +219 to +222
for blob in blobs_dir.iterdir():
try:
if blob.is_file():
total += blob_bytes_present(blob)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Scope single-file progress to the watched file

When two hf_hub_download_with_xet_fallback calls download different files from the same repo, this total includes every blob in the repo cache. A stalled child sees a sibling's growing .incomplete as progress and keeps resetting its stall timer, so it will not retry over HTTP after stall_timeout until all other repo downloads stop; large parallel shard downloads can therefore hang on the stuck shard much longer than configured.

Useful? React with 👍 / 👎.

cache_dir = cache_dir,
allow_patterns = allow_patterns,
ignore_patterns = ignore_patterns,
local_files_only = True,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not treat partial snapshots as warm caches

When a previous interrupted download left a snapshot directory for this revision with only some files linked and no active .incomplete, Hugging Face's local_files_only path returns the snapshot folder as soon as it exists and does not verify that every allow_patterns file is present. This fast path then skips the supervised child entirely, so the later model load either fails on missing files or downloads them in-process, defeating the Xet stall fallback for exactly the partial-cache case it is meant to recover.

Useful? React with 👍 / 👎.

…safe (review 4)

- repo_type=None (Hugging Face's default model repo) resolved to a bogus
  Nones--<id> cache dir, so get_hf_download_state/the watchdog missed the real
  models--<id> partial and could let a stalled Xet child run forever. Normalize
  repo_type to 'model' in repo_cache_dir_name and at the public entrypoints.
- The destructive HTTP-prep path was case-collision safe but the read/watchdog
  path still yielded every case-insensitive match, so a stale partial in a
  colliding org/repo could make the watchdog kill an active Org/Repo download.
  Move the exact-or-unambiguous guard into iter_active_repo_cache_dirs so both
  paths share one rule, and drop the now-redundant _destructive_repo_cache_dirs.
- Added regression tests for both (32 passed, 1 skipped).

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dc0fddb108

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

import time
from typing import Any, Callable, Optional

from unsloth_zoo.hf_cache_state import (

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep the fallback importable without full Unsloth init

This new module is meant to be shared with Studio/standalone callers, but a normal import unsloth_zoo.hf_xet_fallback executes unsloth_zoo/__init__.py before this submodule can load; that pre-existing init raises at unsloth_zoo/__init__.py:200-202 when the full unsloth package is not installed and also performs heavy Torch/GPU setup otherwise. The tests only avoid this by loading the file via spec_from_file_location, so CPU/Studio workers that import the documented package path can fail before any fallback code runs; expose this through a lightweight import path or make the package init skip heavy checks for these cache/download submodules.

Useful? React with 👍 / 👎.

Comment on lines +403 to +405
if proc.is_alive():
_signal_group(getattr(signal, "SIGKILL", signal.SIGTERM))
proc.join(timeout = 5.0)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Kill the group even after the wrapper exits

When the Xet transfer has spawned a helper in the child's process group and the Python wrapper exits on SIGTERM before that helper does, this guard skips the SIGKILL pass because proc.is_alive() is already false. The surviving helper can keep the stalled writer alive while the parent starts HTTP cleanup/retry, so the fallback should still send the post-grace SIGKILL to the process group (or otherwise reap the group) even if the leader has exited.

Useful? React with 👍 / 👎.

Comment thread unsloth_zoo/hf_xet_fallback.py Outdated
return

if on_heartbeat is not None:
on_heartbeat(f"Downloading ({transport} transport)...")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Isolate status hook failures from the watchdog

If a caller-supplied on_status/heartbeat hook raises here (for example because a Studio client disconnected), the exception terminates the daemon watchdog thread before it can set the stalled event. In that environment a genuinely hung Xet child can remain supervised only by the parent poll loop and never trigger the HTTP retry, so nonessential status callbacks should be wrapped/logged instead of being allowed to kill stall detection.

Useful? React with 👍 / 👎.

Comment thread unsloth_zoo/hf_cache_state.py Outdated
Comment on lines +65 to +66
if cache_dir is not None:
root = Path(cache_dir)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Normalize cache_dir before scanning it

When callers pass a custom cache_dir that needs user expansion or resolution, such as ~/hf-cache, Hugging Face expands and resolves that path before writing, but this cache-state code scans the literal Path(cache_dir). In that scenario the child writes the expanded cache while the watchdog reports (0, False) and HTTP prep misses the partial, so a stalled Xet download under that custom cache never triggers or cleans up the fallback.

Useful? React with 👍 / 👎.

…(review 4 P2)

- _terminate_process_group always sends the post-grace SIGKILL to the whole
  process group, even when the Python leader already exited on SIGTERM, so a
  surviving Xet helper cannot keep the stalled writer alive during HTTP cleanup.
- Watchdog status/heartbeat callbacks are wrapped: a raising on_status (e.g. a
  disconnected Studio client) no longer kills the daemon thread and stops stall
  detection for a genuinely hung child.
- hf_cache_root expands ~ in a custom cache_dir (as huggingface_hub does on
  write), so a stall under e.g. ~/hf-cache is still seen and cleaned.
- Added regression tests (34 passed, 1 skipped).

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 27c2795880

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread unsloth_zoo/hf_xet_fallback.py Outdated
Comment on lines +618 to +619
if on_status is not None:
on_status(f"{label}: Xet stalled, retrying over HTTP")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard the retry status callback

When a stalled Xet attempt reaches the HTTP fallback, this direct on_status(...) call can raise (for example if a Studio client disconnected) and abort the retry path before disable_xet is set for the next attempt. The heartbeat callback is now isolated with _safe_status, but this remaining direct call still lets a nonessential status hook turn an otherwise recoverable Xet stall into a failed download.

Useful? React with 👍 / 👎.

Comment thread unsloth_zoo/hf_xet_fallback.py Outdated
def hf_hub_download_with_xet_fallback(
repo_id: str,
filename: str,
token: Optional[str],

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Default the optional token argument

When callers use this single-file helper like Hugging Face's hf_hub_download for a public repo and omit token, the current signature raises TypeError before it can hit the cache or start the supervised download. The snapshot helper already defaults token=None; doing the same here keeps unauthenticated single-file downloads from requiring a dummy third argument.

Useful? React with 👍 / 👎.

The Xet fallback runs the download in a spawn child -- a fresh interpreter that
re-imports unsloth_zoo.hf_xet_fallback and so re-runs unsloth_zoo/__init__.py,
which on any non-MLX host imports torch + transformers and does device init the
child never uses (~5-7s per download child). The cache/download submodules only
need stdlib + huggingface_hub.

Reuse the existing MLX lightweight-import path: __init__ now also honors an opt-in
UNSLOTH_ZOO_DISABLE_GPU_INIT=1 (off by default, so normal CUDA/CPU runs are
byte-for-byte unchanged), and the parent sets it in the child's environment only
around spawning the download child (restored immediately, like the transport env).
The unconditional HF cache redirect still runs, so the child writes the same cache.
MLX stub injection is now gated on actual MLX, not the generic skip flag.

Verified: normal import still runs full init (DEVICE_TYPE=cuda, torch+transformers
imported); with the flag the helper imports with torch/transformers absent; a real
download through the spawn child completes with zero heavy-init banners (child stays
light) and the parent env is restored. 35 passed, 1 skipped.
@danielhanchen

Copy link
Copy Markdown
Member Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e93e669ee6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread unsloth_zoo/hf_xet_fallback.py Outdated
Comment on lines +480 to +484
if main_module is not None:
main_file = getattr(main_module, "__file__", None)
if not main_file or str(main_file).startswith("<"):
saved_main_file = main_file
main_module.__file__ = __file__

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid importing caller scripts in spawn children

When this wrapper is invoked from a regular Python script at module top level, __main__.__file__ is a real path, so this branch leaves it unchanged. The spawn child then imports that user script as __mp_main__, re-executes the top-level download/from_pretrained call during multiprocessing bootstrap, and exits with the standard “start a new process before bootstrapping” error; the parent then reports exited ... without a result instead of downloading. This breaks the common unguarded script pattern the comment says this workaround should handle, so the helper should avoid re-opening the caller script even when __file__ is a normal path.

Useful? React with 👍 / 👎.

cache_dir = cache_dir,
allow_patterns = allow_patterns,
ignore_patterns = ignore_patterns,
local_files_only = True,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Verify cached snapshots before the fast path

When a previous snapshot download was interrupted after Hub wrote refs/<branch> and created snapshots/<sha>, snapshot_download(..., local_files_only=True) returns that directory without checking that all allowed files are present. This fast path therefore skips the killable child even while the cache is partial or has stale incomplete state, so from_pretrained can proceed with missing model files instead of completing the download; validate the snapshot or active incomplete state before returning here.

Useful? React with 👍 / 👎.

Comment thread unsloth_zoo/__init__.py
# spawning that child, never for a training/inference process. The
# unconditional HF cache redirect above still runs, so the child writes to the
# same cache as the parent.
_SKIP_GPU_INIT = os.environ.get("UNSLOTH_ZOO_DISABLE_GPU_INIT", "0") == "1"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Allow the shared downloader to import standalone

When Studio or a lightweight utility imports unsloth_zoo.hf_xet_fallback directly in an environment without the unsloth package, package import still leaves _SKIP_GPU_INIT false unless the caller already set this child-only env var, so the later find_spec("unsloth") guard raises before the new shared module can load. That makes the advertised standalone/Studio import path unusable unless consumers know a private pre-import knob; select the lightweight path for these cache/download modules without requiring that external setup.

Useful? React with 👍 / 👎.

Comment thread unsloth_zoo/hf_xet_fallback.py Outdated
Comment on lines +624 to +625
if on_status is not None:
on_status(f"{label}: Xet stalled, retrying over HTTP")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Isolate stall status callback failures

When a transfer stalls, this direct on_status call runs outside _safe_status. If a UI/status callback raises at that moment (for example a disconnected Studio client, which the watchdog heartbeat path already tolerates), the exception escapes before disable_xet = True, so the HTTP retry is skipped and a recoverable Xet stall fails; wrap this notification with the same safe callback handling used for heartbeats.

Useful? React with 👍 / 👎.

Comment on lines +155 to +156
blob.unlink()
except OSError:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not resume HTTP when cleanup failed

When a stalled Xet partial cannot be removed (for example a locked file on Windows or a permission error), this branch suppresses the unlink failure and the caller still proceeds to the HTTP retry. The surrounding comments note that resuming HTTP over a sparse Xet/hf_transfer partial can silently corrupt the blob, so the retry should not continue unless the unsafe partial was actually cleared.

Useful? React with 👍 / 👎.

Comment thread unsloth_zoo/hf_xet_fallback.py Outdated
Comment on lines +128 to +129
if hf_token:
out = out.replace(hf_token, "***")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle boolean HF tokens while scrubbing errors

Hugging Face callers commonly pass token=True to use the configured token; if such a download fails in the child, this scrubber treats the boolean as a replacement string and raises TypeError, so the child exits without putting the real auth/network error on the queue and the parent reports a misleading “exited without a result”. Guard this replacement to string tokens only.

Useful? React with 👍 / 👎.

…ety, robust callbacks/tokens

P1:
- Spawn child now always repoints __main__.__file__ at this side-effect-free
  module (not only for stdin/notebook callers). An unguarded top-level script
  (python script.py with a bare from_pretrained) previously had its file
  re-imported as __mp_main__ in the child, re-running the download and failing
  with the multiprocessing bootstrapping error; now the child imports this module
  instead and downloads. Verified end-to-end with a real unguarded script.
- snapshot fast path no longer short-circuits a cached-but-incomplete snapshot
  (an interrupted download leaving .incomplete blobs / broken symlinks): it checks
  has_active_incomplete_blobs and completes the download in the killable child
  rather than letting from_pretrained load missing files.

P2:
- Xet to HTTP retry status callback wrapped in _safe_status so a raising on_status
  (disconnected client) cannot abort the recoverable retry.
- hf_hub_download_with_xet_fallback defaults token to None (parity with hf_hub_download).
- If an unsafe partial cannot be cleared before HTTP, force a clean re-download
  instead of an unsafe resume over a sparse partial.
- _default_scrub_secrets tolerates token=True (no TypeError in the child scrubber).
- Docstring: state the package imports with unsloth installed; the spawn child uses
  the UNSLOTH_ZOO_DISABLE_GPU_INIT lightweight path.

41 passed, 1 skipped.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e2300e5580

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

repo_type = repo_type,
token = token,
revision = params.get("revision"),
cache_dir = params.get("cache_dir"),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Normalize snapshot cache_dir before spawning

When snapshot_download_with_xet_fallback is called with a cache_dir containing ~, the watchdog and HTTP cleanup scan Path(cache_dir).expanduser() via hf_cache_root, but the child passes the original value to snapshot_download here. In the supported huggingface_hub code path, snapshot_download builds its storage folder from that string without expanding it, so a stalled snapshot download writes under the literal ~/... path while the parent watches $HOME/... and never trips the Xet fallback; normalize the cache_dir once and use the same value for both the child and cache-state probes.

Useful? React with 👍 / 👎.

The snapshot fast path and the incomplete-state detector only inspected the
newest snapshot by mtime. A caller requesting an older revision whose snapshot
had a dangling symlink (an interrupted download) while a newer revision was
clean would read the repo as healthy and load with missing files.

- hf_cache_state.py: add snapshot_dir_has_broken_symlinks(snapshot_dir) and make
  the broken-symlink detector iterate every snapshot, not just the latest.
- hf_xet_fallback.py: validate the exact returned revision dir in the
  local_files_only fast path; clear broken symlinks across all snapshots during
  HTTP prep; write the forced-stall test partial under the caller's cache_dir
  and the repo_type-correct dir name so the watchdog actually sees it.
- tests: add regression coverage for the per-snapshot primitive, broken older
  snapshot detection, fast-path rejection of a broken requested revision, and
  all-snapshot symlink cleanup.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a04a983576

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread unsloth_zoo/hf_cache_state.py Outdated
Comment on lines +223 to +224
if len(entries) <= 1:
return entries

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject folded-only cache matches

When downloading Org/Repo on a case-sensitive cache that only has models--org--repo from another repo/call, this branch returns that sole folded match because len(entries) <= 1. The watchdog/HTTP prep for Org/Repo then counts or deletes the lowercase repo's .incomplete partial before HF has created the exact models--Org--Repo directory, so a stale case-colliding cache can still disrupt the wrong download. Fresh evidence in the current patch is that the new exact-match helper still accepts a folded-only match here; it should require entry.name == target for Hub cache dirs.

Useful? React with 👍 / 👎.

Comment thread unsloth_zoo/hf_xet_fallback.py Outdated
Comment on lines +784 to +787
if (
not snapshot_dir_has_broken_symlinks(Path(cached_dir))
and not has_active_incomplete_blobs(repo_type, repo_id, cache_dir = cache_dir)
):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Verify absent snapshot files before returning cache

When an interrupted snapshot wrote refs/<rev> and the snapshot directory but died before creating a pointer/symlink for one allowed file, snapshot_download(..., local_files_only=True) returns the snapshot folder; these checks pass because there is no dangling symlink and no remaining .incomplete. This still short-circuits to a partial snapshot, so later from_pretrained can fail or download in-process instead of under the fallback. Fresh evidence is that the new guard only validates broken symlinks/partials, not missing expected files.

Useful? React with 👍 / 👎.

Comment on lines +506 to +508
if main_module is not None:
saved_main_file = getattr(main_module, "__file__", _UNSET)
main_module.__file__ = __file__

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clear spec before spawn

When the caller is launched with python -m train or any __main__ that has __spec__.name, multiprocessing's spawn preparation uses that module name in preference to __file__. This assignment therefore does not repoint the child; it re-imports the user's module as __mp_main__, re-running top-level from_pretrained and hitting the bootstrapping RuntimeError instead of reaching _download_child_entry. Clear or replace __spec__ as well, or avoid spawn importing __main__.

Useful? React with 👍 / 👎.

Comment on lines +734 to +738
def snapshot_download_with_xet_fallback(
repo_id: str,
*,
revision: Optional[str] = None,
token: Optional[str] = None,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor local-only warmups

When a from_pretrained caller sets local_files_only=True (common for offline runs), this wrapper has no way to receive or forward that setting. If the snapshot is not fully cached, it falls through to _download_with_xet_fallback and starts a network-capable child download instead of matching Hugging Face/Transformers local-only semantics, or a caller that forwards the kwarg gets TypeError; add and honor local_files_only before spawning.

Useful? React with 👍 / 👎.

Comment on lines +704 to +706
cached = try_to_load_from_cache(
repo_id, filename, repo_type = repo_type, revision = revision, cache_dir = cache_dir
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Expand cache_dir before probing file cache

When cache_dir is passed as ~/hf-cache, a finalized file already present in the expanded cache is missed here because try_to_load_from_cache joins the string directly, while the child hf_hub_download expands/resolves it before writing. Warm single-file loads then spawn a child and may perform Hub metadata/network work instead of the documented no-child/no-network cache short-circuit; normalize the cache dir before this probe and reuse the same value for the attempt.

Useful? React with 👍 / 👎.

@danielhanchen

Copy link
Copy Markdown
Member Author

/gemini review

@danielhanchen

Copy link
Copy Markdown
Member Author

One of the four is real and fixed in 61a629a; the other three I did not take (the license one per maintainer direction).

Fixed:

  • Retry transient LocalEntryNotFoundError cases (P2): huggingface_hub raises LocalEntryNotFoundError both for a genuine offline / uncached miss AND as its wrapper around a transient HEAD connection error / timeout for an uncached file ("... Please check your connection and try again"). The classifier matched the class NAME against the deterministic set before the transient-hint check, so the transient connection wrapper (which a raw ConnectionError with the same message would retry) was surfaced immediately instead of taking the HTTP retry. _is_retryable_download_error now retries a LocalEntryNotFoundError only when its message carries a transient hint (connection / timeout / reset / 5xx ...); a true offline miss still falls through to the deterministic set, stays non-retryable, and keeps its reconstructed type across the spawn boundary. Watchdog / spawn / retry-loop mechanism untouched.

Not taken:

  • Require all declared Diffusers component weights (P2): a static "weight-bearing component" heuristic false-rejects a complete download whenever a component is disabled at the repo level (safety_checker: [null, null]) or optional -- I verified it rejects a real StableDiffusionPipeline that legitimately omits the safety_checker. POST-download stays lenient by design: snapshot_download already guarantees manifest completeness on the normal path, and "at least one declared component weight present" already separates a real pipeline warm from a config-only / checkpoint-only / undeclared-leftover stale cache.
  • Keep broad globs from satisfying root model warms (P2): the loaders warm a bare model with allow_patterns=None (routed through the canonical-scoped _root_model_has_weight), never a broad ['*.safetensors'] glob. A broad glob's load intent is ambiguous (adapter / subfolder / diffusers are all valid), so requiring a canonical root weight would false-reject those legitimate broad-glob downloads.
  • Align package license metadata (P2): the AGPL-3.0-only headers on the new files are intentional and the header already documents the split from the LGPL package; reconciling the package-level license expression is a maintainer decision, out of scope for this PR.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 61a629a050

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1774 to +1778
cached = try_to_load_from_cache(
repo_id, probe_filename, repo_type = repo_type, revision = revision, cache_dir = cache_dir
)
if isinstance(cached, str) and os.path.exists(cached):
return cached

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don't skip online checks for mutable file revisions

When revision is omitted or is a branch/tag, try_to_load_from_cache only resolves the locally cached ref and this returns without contacting the Hub. Unlike hf_hub_download, this bypasses the online HEAD/ETag check for mutable refs, so after main or a tag moves callers can keep receiving a stale file instead of refreshing it under the supervised child; restrict this fast path to immutable commit hashes or perform the ref check in the child.

Useful? React with 👍 / 👎.

Comment on lines +1261 to +1263
return all(
any(candidate in present for candidate in candidates) for candidates in groups.values()
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat exact allow patterns as filters after download

When allow_patterns enumerates optional exact files, such as ['tokenizer.json', 'tokenizer_config.json'] for a repo that only publishes one of them, snapshot_download succeeds because the allow list filters the repo contents rather than requiring every pattern to exist. This all(...) turns that valid subset into an incomplete result, so the wrapper retries over HTTP and then raises DownloadStallError even though both transports produced the complete filtered snapshot.

Useful? React with 👍 / 👎.

@danielhanchen

Copy link
Copy Markdown
Member Author

Neither of these two is a bug I'm taking; reasoning below.

  • Don't skip online checks for mutable file revisions (P2): this is the single-file twin of the mutable-revision report on the snapshot path from the previous review, and I'm not taking it for the same reason. The try_to_load_from_cache short-circuit (like the snapshot local_files_only=True fast path) is an intentional offline-capable warm-cache resolution that avoids a per-load child spawn / network round-trip for an already-finalized blob. Gating it on immutable commit hashes would force a child spawn -- and, offline, a regression -- for the common mutable-ref (main/tag) case even when the file is already warm, which the design deliberately avoids. A moved main/tag is resolved by the caller's own hf_hub_download / from_pretrained (which performs the online HEAD), matching Hugging Face's own local-cache resolution semantics; the wrapper's job is to make the killable-child download available on a miss, not to re-implement ref freshness on the warm path.
  • Treat exact allow patterns as filters after download (P2): correct in isolation -- _requested_exact_files_present_grouped requires each distinct exact-named file, so a multi-file exact list for a repo that publishes a subset would report incomplete. But it does not fire for any warm the loaders generate: every allow list they build includes a glob in the aux patterns (*.py, *.tiktoken, additional_chat_templates/*.jinja), and the check short-circuits to True as soon as any pattern is a glob (I verified both: the all-exact subset returns incomplete, the same list plus a glob returns complete). For a general all-exact list, requiring each named file is the intended guard against a silent-stale config-only snapshot: a requested file that was never fetched leaves no dangling symlink, so this all-present check is its only guard, and relaxing it to a pure filter would let that stale case through. Interchangeable weights are already grouped (the either-format pair needs any one). So this is deliberate all-present semantics for an exact-named subset, not a path the loaders can hit.

@danielhanchen

Copy link
Copy Markdown
Member Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 61a629a050

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +542 to +543
if "pytorch_model.bin" in root_files and _format_kept("pytorch_model.bin"):
return True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Require safetensors before skipping supervised download

When a previous cache contains only pytorch_model.bin for a repo that also publishes model.safetensors, this returns True and lets snapshot_download_with_xet_fallback hand the cached snapshot back without running the killable download child. A subsequent default Transformers load probes model.safetensors before falling back to .bin, so it can download the missing preferred file through the normal in-process Hub path and reintroduce the unkillable Xet hang this fallback is meant to avoid. Since the local cache alone cannot prove the safetensors file is absent remotely, a bin-only snapshot should not satisfy the conservative fast path.

Useful? React with 👍 / 👎.

Comment thread unsloth_zoo/hf_xet_fallback.py Outdated
rels.append(rel)
except OSError:
return False
return bool(_filter_paths(rels, None, ignore_patterns))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Require declared Diffusers components before accepting

When snapshot_download returns a stale Diffusers snapshot after a transient Hub error, this accepts the result as soon as any declared component weight is present. For example, if model_index.json declares weight-bearing components like unet and vae but the cache only contains unet/diffusion_pytorch_model.safetensors, _download_result_usable returns true and DiffusionPipeline.from_pretrained can then fetch the missing component weights in-process, potentially through the unkillable Xet path the fallback is trying to avoid. The post-download check should reject partial declared component sets instead of treating one component as proof that the pipeline is warm.

Useful? React with 👍 / 👎.

continue
try:
if entry.is_file():
weights.append(entry.relative_to(snapshot_dir).as_posix())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reject noncanonical weights for broad patterned warms

When callers use broad allow patterns such as ['*'] or ['*.safetensors'] to warm a default model, a stale snapshot containing only a selected noncanonical weight (for example adapter_model.safetensors, consolidated.safetensors, or a GGUF file) is appended here and then makes _download_result_usable return true. A subsequent default from_pretrained does not read those files as the base model weight and can fetch the missing canonical model.safetensors/pytorch_model.bin in-process over Xet, so broad patterned requests need the same canonical-weight guard used by the unpatterned path.

Useful? React with 👍 / 👎.

…fetensors is preferred

A default transformers load probes model.safetensors BEFORE pytorch_model.bin, but the
pre-download fast path treated a cache holding only pytorch_model.bin as complete and
handed it back without the killable child. For a repo that ALSO publishes model.safetensors
(which the local cache cannot rule out), the in-process load then fetches the preferred
model.safetensors over un-killable Xet -- the exact hang this fallback exists to prevent.

The strict pre gate (_cache_can_skip_download) now passes prefer_safetensors=True to
snapshot_dir_is_complete: a .bin weight satisfies the fast path only when safetensors is
IGNORED (use_safetensors=False reads .bin), otherwise a bin-only cache defers to the child,
which does the authoritative Hub check. A safetensors cache, a dual-format cache, and a
use_safetensors=False bin cache all still fast-path, so the common load is unaffected.

The flag is pre-only. The POST shard check (_has_incomplete_canonical_root_shards ->
snapshot_dir_is_complete) leaves it False, so a finished bin-only download -- a genuinely
bin-only repo -- is still accepted and never false-rejected into a DownloadStallError
(POST stays lenient; a bin-only cache is not positive breakage evidence there).

Adds a regression test (184 -> 185). Verified against the safety-invariant fuzz (0
false-accepts) and the stall->HTTP e2e sim.

Not taken (same review):
- Require ALL declared Diffusers components: a static weight-bearing-component heuristic
  false-rejects a complete download when a component is disabled at the repo level
  (safety_checker: [null, null]) or optional; POST stays lenient and one declared
  component weight already separates a real warm from a config-only / undeclared-leftover
  stale cache.
- Reject noncanonical weights for broad patterned warms: a base load passes
  allow_patterns=None (routed through the canonical-scoped _root_model_has_weight, which
  this change makes safetensors-aware), never a broad ['*'] / ['*.safetensors'] glob; a
  broad glob's intent is ambiguous (adapter / subfolder / diffusers), so requiring a
  canonical root weight would false-reject those legitimate broad-glob downloads.
@danielhanchen

Copy link
Copy Markdown
Member Author

Thanks for the review. Addressed in 26bd3d56.

Fixed (finding 1, real strict-PRE hang): bin-only cache when safetensors is preferred.
A default transformers load probes model.safetensors before pytorch_model.bin, but the pre-download fast path treated a cache holding only pytorch_model.bin as complete and loaded in-process. For a repo that also publishes model.safetensors (which the local cache cannot rule out), that in-process load then fetches the preferred safetensors over un-killable Xet, exactly the hang this fallback exists to prevent. The strict pre gate (_cache_can_skip_download) now passes prefer_safetensors=True to snapshot_dir_is_complete: a .bin weight satisfies the fast path only when safetensors is ignored (use_safetensors=False), otherwise a bin-only cache defers to the killable child, which does the authoritative Hub check. A safetensors cache, a dual-format cache, and a use_safetensors=False bin cache all still fast-path, so the common load is unaffected. The flag is pre-only: the POST shard check leaves it False, so a genuinely bin-only repo's finished download is still accepted and never false-rejected into a DownloadStallError. Added a regression test (185 pass); safety-invariant fuzz stays at 0 false-accepts and the stall to HTTP e2e sim stays green.

Not taken (finding 2): require all declared Diffusers components.
A static weight-bearing-component heuristic false-rejects a genuinely complete download: a component can be disabled at the repo level (safety_checker: [null, null] ships no weights) or be optional, and requiring its weights fails a download that correctly omitted them. The POST path stays lenient by design, and one declared component weight already separates a real warm from a config-only or undeclared-leftover stale cache.

Not taken (finding 3): reject noncanonical weights for broad patterned warms.
A base load passes allow_patterns=None (routed through the canonical-scoped _root_model_has_weight, which this change makes safetensors-aware), never a broad ['*'] / ['*.safetensors'] glob. A broad glob's intent is ambiguous (adapter / subfolder / diffusers), so requiring a canonical root weight there would false-reject those legitimate broad-glob downloads. No loader reaches the gate with a broad glob for a base load.

@danielhanchen

Copy link
Copy Markdown
Member Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 26bd3d56f9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread unsloth_zoo/hf_xet_fallback.py Outdated
rels.append(rel)
except OSError:
return False
return bool(_filter_paths(rels, None, ignore_patterns))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require missing Diffusers components before accepting cache

When model_index.json declares multiple weight-bearing components, this accepts the snapshot as soon as any declared component weight is present. For example, a stale cache from a prior vae/*-only download with unet also declared but absent passes _download_result_usable, so the later DiffusionPipeline.from_pretrained still has to fetch the missing unet weights in-process over Xet, bypassing the stall fallback this code is meant to enforce. Please verify the declared weight-bearing components needed by the pipeline, not just that one declared component has a weight.

Useful? React with 👍 / 👎.

# a numbered shard of the read (preferred) format with NO index in its directory: the load cannot
# enumerate the set and would fetch the index + remaining shards over Xet.
preferred = "safetensors" if "safetensors" in fmts else "bin"
if preferred not in index_fmts.get(dir_rel, set()):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match shard indexes to the selected shard set

This check keys index_fmts only by directory and format, so any safetensors index in the same directory can mask a selected lone shard even when it belongs to a different weight. For example, with allow_patterns=['adapter_model*'], a cache containing adapter_model-00001-of-00002.safetensors but no adapter_model.safetensors.index.json is accepted if an unrelated safetensors index is present, and the subsequent load will fetch the missing adapter index/shards in-process over Xet instead of using the fallback.

Useful? React with 👍 / 👎.

Comment on lines +1243 to +1244
if not allow or any(not isinstance(p, str) or _has_glob(p) for p in allow):
return True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require exact files even when allow_patterns has globs

When an allow list mixes a glob with exact required files, this returns early and never verifies the exact names. A common adapter warm like allow_patterns=['adapter_model*', 'adapter_config.json'] is accepted with only adapter_model.safetensors on disk, so the later PEFT load has to fetch the missing adapter_config.json in-process instead of going through the watched fallback.

Useful? React with 👍 / 👎.

continue
except OSError:
return False
return bool(_filter_paths(weights, allow_patterns, ignore_patterns))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Distinguish root model globs from adapter weights

For a root-level glob such as allow_patterns=['*.safetensors'], this treats any selected loadable weight as satisfying the default model load. If a stale cache contains only adapter_model.safetensors, _download_result_usable returns true even though from_pretrained will still probe model.safetensors / pytorch_model.bin and fetch the missing base weight in-process over Xet; root-model glob requests need to require a canonical root weight rather than any selected weight.

Useful? React with 👍 / 👎.

# No psutil / /proc: fall back to following only newly-created (post-baseline) partials.
owned = {name: n for name, n in sizes.items() if name not in baseline}
return (sum(owned.values()), len(owned) > 0)
return get_hf_download_state(repo_ids, repo_type = repo_type, cache_dir = cache_dir)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Scope snapshot stall progress to the child process

Snapshot attempts fall back to the repo-wide cache measurement here, so a concurrent same-repo download that is still writing any other blob keeps changing current_size and resets the stalled snapshot child's timer. In that environment the child with its own unchanged .incomplete can hang indefinitely without switching to HTTP until the sibling stops; the snapshot watchdog should use the same child-owned partial scoping as single-file downloads, or otherwise exclude sibling progress.

Useful? React with 👍 / 👎.

@danielhanchen

Copy link
Copy Markdown
Member Author

Thanks for the pass. Checked all five against the actual loader allow-lists (maybe_prefetch_hf_snapshot) and real HF layouts; none holds up, so no change on 26bd3d56. Details, with the empirical result each rests on:

1. Require missing Diffusers components (POST) - not taken.
model_index.json declares weightless components right alongside weight-bearing ones. A standard StableDiffusionPipeline declares unet/vae/text_encoder (weights) and scheduler/tokenizer/feature_extractor (configs only) and safety_checker: [null, null] (disabled). Requiring a weight per declared component false-rejects that complete pipeline on all four config-only components, and there is no static way to tell a required weight-bearing component from a declared-but-weightless one without instantiating the pipeline class. POST is the lenient silent-stale backstop (the child's snapshot_download already did the authoritative manifest compare); one declared component weight already separates a real warm from a config-only / undeclared-leftover stale cache. The vae-only-vs-unet example is two different downloads: a vae/*-scoped request accepting a vae-only result is correct, and a later full-pipeline load runs its own pre-download gate.

2. Match shard indexes to the selected shard set - not taken (not reachable).
The only same-directory index in a real layout is the root model index (model.safetensors.index.json / pytorch_model.bin.index.json), and it is already excluded from index_fmts (the _ROOT_MODEL_SHARD_INDEX_RE skip), so a lone adapter_model-00001-of-00002.safetensors without its index is flagged incomplete. Reproduced: with a root model index co-resident, _selected_shard_index_incomplete returns True. Masking only occurs if a differently-based non-root index (foobar.safetensors.index.json) sits in the same dir as the lone shard, which no real repo ships. An adapter/component dir holds exactly one weight base, so there is no second index to mask it.

3. Require exact files even when allow_patterns has globs - not taken.
The loader allow-list is a best-effort superset, not a manifest: the adapter warm is ["adapter_config.json", "adapter_model*", ...ROOT_AUX...], where ROOT_AUX lists ~25 optional aux names (generation_config.json, added_tokens.json, merges.txt, spiece.model, vocab.txt, ...) most repos never ship. Reproduced: requiring the exact names present false-rejects a complete adapter cache on 11 absent optional files, turning a good download into a DownloadStallError. That is why the exact-file check stays lenient once a glob is present. The shared helper is generic (also used by Studio) and cannot encode "adapter_config.json is mandatory but generation_config.json is optional"; the weight the load reads (the real stall risk) is already verified by _selected_readable_weight_complete.

4. Distinguish root model globs from adapter weights - not taken (not reachable).
No loader emits a bare ['*.safetensors'] for a base-model load. A bare from_pretrained passes allow_patterns=None (routed through the canonical-scoped root-weight check); only gguf / tokenizer / adapter / subfolder warms are patterned, and each is scoped (adapter_model*, <subfolder>/*, the gguf path). A broad root glob whose intent is "the base model" is never generated, so requiring a canonical root weight there would only false-reject legitimate broad-glob warms without protecting any real load.

5. Scope snapshot stall progress to the child process - not taken.
This is the documented dual-mode choice: a snapshot pull is every blob of the repo, so the repo-wide measurement tracks that one pull's progress; single-file downloads scope to the child's own partial precisely because a repo can hold unrelated siblings. The failure described needs two concurrent snapshot pulls of the same repo in separate processes, each running this fallback - uncommon, and HF's per-repo cache locks largely serialize same-repo pulls. Switching snapshots to child-pid fd scoping risks the opposite and worse failure: if any in-progress blob is briefly not among the child's open fds, a genuinely-progressing snapshot reads as stalled and a working snapshot_download gets killed and retried over HTTP. The repo-wide measure reliably observes progress across hf_xet's internal workers, which is the common single-process case the watchdog must never misfire on.

@danielhanchen

Copy link
Copy Markdown
Member Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 26bd3d56f9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread unsloth_zoo/hf_xet_fallback.py Outdated
rels: list = []
try:
for entry in snapshot_dir.rglob("*"):
if not _is_default_load_weight_file(entry.name):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Exclude variant component weights from plain diffusers checks

When a previous variant='fp16' diffusers download left only files such as unet/diffusion_pytorch_model.fp16.safetensors, a later plain warmup with variant=None accepts that snapshot because this check only verifies the suffix. _download_result_usable can then return true even though the plain pipeline load needs the non-variant component weight, so the subsequent in-process load may fetch the missing diffusion_pytorch_model.safetensors through Xet instead of retrying over HTTP.

Useful? React with 👍 / 👎.

Comment on lines +1275 to +1279
if not _is_loadable_weight_file(entry.name):
continue
try:
if entry.is_file():
weights.append(entry.relative_to(snapshot_dir).as_posix())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Ignore training checkpoints for patterned model warms

When a patterned model request is broad enough to include training checkpoints, such as allow_patterns=['*'] or ['*.bin'], this loop treats checkpoint-500/pytorch_model.bin as the selected readable model weight. Those checkpoint subtrees are not read by the subsequent base from_pretrained, so a stale checkpoint-only snapshot can pass _download_result_usable and then fetch the missing root weights in-process via Xet instead of retrying over HTTP.

Useful? React with 👍 / 👎.

Comment on lines +788 to +792
base = index_name.split(marker, 1)[0]
if not base:
return None
prefix = f"{dir_rel}/" if dir_rel else ""
return f"{prefix}{base}-00001-of-00002.{ext}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve variant tokens in malformed-index probes

For variant shard indices like diffusion_pytorch_model.safetensors.index.fp16.json, this builds a probe without the fp16 token. With an allow pattern such as unet/*fp16*, a truncated or malformed variant index is treated as out of scope; if one matching shard is present, the snapshot can be accepted even though the loader will read the broken index or missing shards in-process rather than retrying over HTTP.

Useful? React with 👍 / 👎.

Comment thread unsloth_zoo/hf_xet_fallback.py Outdated
Comment on lines +1485 to +1486
if _patterns_are_exact_names(allow_patterns):
return False # an exact-named subset defers to the exact-file presence check

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate exact sharded weight requests

When allow_patterns is an exact list, this returns before checking shard indices. An exact request that includes a sharded weight index plus only some of the listed shards can therefore pass _download_result_usable as soon as the named files exist, even though the loader will read the index and fetch the missing shards in-process through Xet instead of retrying the incomplete snapshot over HTTP.

Useful? React with 👍 / 👎.

…oad check

_has_diffusers_component_weight counted ANY loadable weight in a declared component subfolder,
including a variant one (unet/diffusion_pytorch_model.fp16.safetensors). So a PLAIN diffusers
warm (variant=None) whose silently-stale snapshot_download returned a cache left by a prior
variant='fp16' pull passed _download_result_usable, and the in-process DiffusionPipeline load --
which reads the NON-variant diffusion_pytorch_model.safetensors -- then fetched the missing plain
weight over un-killable Xet, the hang this fallback exists to prevent.

The plain component-presence check now counts only CANONICAL (non-variant) component weights via
_CANONICAL_COMPONENT_WEIGHT_RE (a base with no intermediate dotted token, single or numbered shard,
safetensors/bin). This mirrors the checks that were already variant-aware: the root check
(_CANONICAL_ROOT_MODEL_WEIGHT_RE excludes model.fp16.safetensors) and the plain component SHARD
regex in _diffusers_component_shards_incomplete. A complete plain pipeline, a pipeline shipping both
plain and variant weights, and the variant='fp16' warm of the same cache are all still accepted --
the variant analog (_has_diffusers_component_variant_weight) is untouched, so no complete download is
false-rejected into a DownloadStallError.

Adds a regression test (185 -> 186). Safety-invariant fuzz stays at 0 false-accepts; stall->HTTP
e2e sim stays green.

Not taken (same review):
- Ignore training checkpoints for patterned model warms (allow=['*'] / ['*.bin']): not
  loader-reachable -- a base load passes allow_patterns=None (root-scoped, checkpoint subtrees
  already excluded), never a broad ['*'] / ['*.bin']; the patterned warms are gguf / adapter /
  subfolder-scoped.
- Preserve variant tokens in malformed-index probes (allow=['unet/*fp16*']): not loader-reachable
  -- a variant diffusers warm passes allow=None or the subfolder ['unet/*'] glob, both of which
  already flag a malformed variant component index (the probe survives an allow=None or 'unet/*'
  filter); no loader emits a variant-infix glob.
- Validate exact sharded weight requests: not loader-reachable -- every loader warm carries a glob
  (*.py, ...), so _patterns_are_exact_names is never true for a loader; a direct caller that lists an
  index but only some of its shards is honoring exactly the files it named.
@danielhanchen

Copy link
Copy Markdown
Member Author

Thanks for the pass. Checked all four; one was real and is fixed in 21fcb558, the other three are not reachable from any loader.

Fixed (finding 1): variant component weights leaking into the plain Diffusers check.
_has_diffusers_component_weight counted any loadable weight in a declared component subfolder, including a variant one (unet/diffusion_pytorch_model.fp16.safetensors). So a plain warm (variant=None) whose silently-stale snapshot_download returned a cache left by a prior variant='fp16' pull passed _download_result_usable, and the in-process DiffusionPipeline load (which reads the non-variant diffusion_pytorch_model.safetensors) then fetched the missing plain weight over un-killable Xet. The plain presence check now counts only canonical (non-variant) component weights via _CANONICAL_COMPONENT_WEIGHT_RE, mirroring the checks that were already variant-aware: the root check (_CANONICAL_ROOT_MODEL_WEIGHT_RE excludes model.fp16.safetensors) and the plain component shard regex. Reproduced: the variant-only cache is now rejected for a plain load, while a complete plain pipeline, a pipeline shipping both plain and variant, and the variant='fp16' warm of the same cache all stay accepted (no false-reject). Regression test added (186 pass); safety-invariant fuzz stays at 0 false-accepts.

Not taken (finding 2): ignore training checkpoints for patterned model warms.
Reachability requires a base-model request with allow_patterns=['*'] or ['*.bin']. No loader emits that: a bare from_pretrained passes allow_patterns=None (routed through the root-scoped canonical check, which already excludes checkpoint-N/ subtrees), and the only patterned warms are gguf / adapter / subfolder-scoped. A broad glob that pulls checkpoint weights for a base load is never generated, so _has_selected_weight counting a checkpoint weight is not on any real load path.

Not taken (finding 3): preserve variant tokens in malformed-index probes.
The miss needs allow_patterns=['unet/*fp16*'] (a variant-infix glob). No loader emits one: a variant Diffusers warm passes allow_patterns=None or the subfolder glob ['unet/*'], and both already flag a malformed variant component index. Reproduced: with a malformed unet/diffusion_pytorch_model.safetensors.index.fp16.json, the allow=None path (_diffusers_component_shards_incomplete) filters the probe against None and the ['unet/*'] path matches the variant-stripped probe against unet/*, so both return incomplete. Only a hand-crafted *fp16* allow slips through, which no load path constructs.

Not taken (finding 4): validate exact sharded weight requests.
The exact fast-path (_patterns_are_exact_names -> skip the shard check) is only hit by a pure exact allow list. Every loader warm carries a glob (*.py, additional_chat_templates/*.jinja, adapter_model*, ...), so _patterns_are_exact_names is never true for a loader-generated request. A direct API caller that explicitly lists a shard index plus only some of its shards is requesting exactly those files -- snapshot_download fetches only what is named, and honoring that exact scope is correct rather than a bug.

@danielhanchen

Copy link
Copy Markdown
Member Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 21fcb5580d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

repo_id, probe_filename, repo_type = repo_type, revision = revision, cache_dir = cache_dir
)
if isinstance(cached, str) and os.path.exists(cached):
return cached

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Refresh mutable file revisions before returning cache

When revision is omitted or is a branch/tag, try_to_load_from_cache only resolves the local cached ref and does not perform the HEAD/etag check that hf_hub_download normally does. Returning here means a previously cached main file is reused forever after the Hub repo moves, unless the caller knows to pass force_download=True, so users can silently load stale weights/configs.

Useful? React with 👍 / 👎.

ignore_patterns = ignore_patterns,
variant = variant,
):
return cached_dir

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Refresh mutable snapshots before taking the fast path

For the common revision=None/main case, this local_files_only=True probe validates only the cached snapshot currently recorded in refs/main; it never checks whether the remote branch now points at a newer commit as snapshot_download normally would. A complete old snapshot therefore bypasses the child download and users keep loading stale model contents until they force a download.

Useful? React with 👍 / 👎.

logical = _exact_weight_logical(base)
groups.setdefault((parent, logical), []).append(rel)
return all(
any(candidate in present for candidate in candidates) for candidates in groups.values()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Allow unmatched exact snapshot filters

When allow_patterns is an exact list, Hugging Face treats those names as filters, not required files: an unmatched pattern is fine if the repo simply does not contain that optional file. This check instead requires every exact non-weight-equivalent name to be present after a successful child download, so a request such as ['tokenizer.json', 'tokenizer.model'] against a repo that only has one tokenizer file is retried over HTTP and then fails with DownloadStallError despite the snapshot being valid.

Useful? React with 👍 / 👎.

Comment thread unsloth_zoo/__init__.py
# spawning that child, never for a training/inference process. The
# unconditional HF cache redirect above still runs, so the child writes to the
# same cache as the parent.
_SKIP_GPU_INIT = os.environ.get("UNSLOTH_ZOO_DISABLE_GPU_INIT", "0") == "1"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make shared fallback importable without Unsloth

When Studio or another lightweight caller imports unsloth_zoo.hf_xet_fallback in a normal process, this opt-in gate leaves _SKIP_GPU_INIT false unless the caller already knew to set UNSLOTH_ZOO_DISABLE_GPU_INIT=1, so package import continues into the existing torch/unsloth requirements and can raise before the new shared fallback module is available. Since the new module is intended to be imported outside the training process, it needs a default import-light path for that context rather than only for spawned children.

Useful? React with 👍 / 👎.

continue
except OSError:
return False
return bool(_filter_paths(weights, allow_patterns, ignore_patterns))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require base weights for broad patterned model warms

When a model snapshot is checked with a broad allow_patterns such as ['*.safetensors'], this accepts any selected loadable weight as proof the model warm is usable. A stale cache containing only adapter_model.safetensors therefore passes _download_result_usable even though a default base-model load reads model.safetensors/pytorch_model.bin and will fetch the missing base weight in-process over Xet after the protective child returns.

Useful? React with 👍 / 👎.

@danielhanchen

Copy link
Copy Markdown
Member Author

Thanks. Checked all five against the loader paths and the module's actual consumers; none needs a change on 21fcb558.

Not taken (1 & 2): refresh mutable file/snapshot revisions before the fast path.
This is the intentional offline-capable warm short-circuit, and it is skipped under force_download=True (the escape hatch, alongside pinning a revision/SHA). The fast path is a pre-warm, not the loader: after it returns, the normal from_pretrained still runs and governs freshness exactly as it does today, so a moved main is handled by the load's own resolution, not silently pinned by the prefetch. The whole point of the short-circuit is to avoid spawning the killable child (and to stay offline-capable) when the cached ref is already complete; forcing a HEAD/etag check on every warm cache would defeat both and reintroduce the per-load Xet exposure this PR exists to bound. Same behavior HF itself gives a warm cache.

Not taken (3): allow unmatched exact snapshot filters.
Reproduced: this only bites a pure exact list such as ['tokenizer.json', 'tokenizer.model'], and no loader emits one. Every prefetch allow-list carries a glob (*.py, additional_chat_templates/*.jinja, adapter_model*, ...), so _requested_exact_files_present_grouped bails (returns accept) before requiring any name; the tokenizer-only warm with a repo shipping just tokenizer.json is accepted, not retried. The other real consumer (Studio) uses its own cache-state fork and does not call this. A direct caller passing a pure-alternative exact list is the only way to reach the strict branch, and none exists today; if one ever does, the fix belongs there rather than loosening the guard that catches a stale cache genuinely missing a named file.

Not taken (4): make the shared fallback importable without Unsloth.
The import-light path is deliberately opt-in (UNSLOTH_ZOO_DISABLE_GPU_INIT=1, off by default) so a normal training/inference import unsloth_zoo is byte-for-byte unchanged and still runs the required torch/allocator/device init. The download child sets that flag (that is exactly its purpose); Studio imports its own fork, not this module. Defaulting to the light path would regress the primary consumer (the training process) by skipping init it needs. There is no lightweight caller that both imports this module and cannot set the flag, so the current default is correct.

Not taken (5): require base weights for broad patterned model warms.
Not loader-reachable: a base-model load passes allow_patterns=None (routed through the canonical root-weight check), never a broad ['*.safetensors']. The patterned warms are gguf / adapter / subfolder-scoped, none of which is a base-model glob, so _has_selected_weight is never asked to prove a base model from a broad glob. This is the same case raised (and declined) in the prior two reviews.

…atchdog, and honor exact variant shard requests

Three fixes from a consolidated 15-reviewer pass (12x reviewer.py personas + 3 forked reviewers).

1. from_tf / from_flax POST false-reject (HIGH, loader-reachable). A base load with
   from_tf=True / from_flax=True prefetches with ignore=['*.safetensors','*.bin',...] and reads
   tf_model.h5 / flax_model.msgpack. _root_model_has_weight recognized a root weight only via
   _CANONICAL_ROOT_MODEL_WEIGHT_RE (safetensors/bin), so a COMPLETE h5/msgpack download read as
   "no weight" -> POST rejected it -> HTTP retry -> same h5-only result -> DownloadStallError on a
   working load. It now also counts a canonical TF/Flax root weight
   (_CANONICAL_ROOT_TF_FLAX_WEIGHT_RE), gated on _pytorch_root_weight_formats_ignored (BOTH
   model.safetensors and pytorch_model.bin dropped by the ignore filter -- the from_tf/from_flax
   signature). A normal load keeps a PyTorch format so the gate is False and behavior is unchanged;
   a stray leftover h5 under a normal load still does not count.

2. Single-file watchdog resumed-partial hang without pid ownership (Med, Windows/macOS without
   psutil). When a child_pid is supplied but its open files cannot be inspected (no psutil AND no
   /proc), _measure followed only post-baseline partials; a RESUMED partial reuses a baseline
   blob-hash name, so a frozen Xet resume was excluded forever and never tripped the stall -- the
   exact hang the fallback exists to prevent. That branch now falls back to the repo-wide measure
   (as snapshots use), so a resumed partial is watched. The child_pid=None path (no pid at all)
   keeps the post-baseline name filter unchanged; the precise open-fds path (psutil / /proc) is
   untouched.

3. Exact-named variant shard requests false-rejected (Low). The plain branch skipped shard-set
   completeness for an exact-name allow list, but the variant branch did not, so
   allow=['model.fp16-00001-of-00002.safetensors'] + variant='fp16' -- a satisfied exact request --
   was rejected as an incomplete shard set. The _patterns_are_exact_names escape now sits above both
   branches, so an exact-named subset (variant or plain) defers to the exact-file presence check.

Adds 3 regression tests (186 -> 189). Safety-invariant fuzz stays at 0 false-accepts (PRE stays
conservative -- these are all POST leniency / watchdog fixes); stall->HTTP e2e sim stays green.

Not taken (same review pass):
- Injected prepare_for_http_fn missing owned_incomplete_blobs (flagged by several personas): by
  design and documented at the call site -- an injected (Studio) hook owns its own marker-based cache
  accounting and keeps the plain (repo_type, repo_id) signature; the default hook gets the owned set.
- Missing HF kwargs (local_dir / max_workers / etag_timeout): the helper is a cache-based prefetch
  wrapper, not a full snapshot_download drop-in; local_dir bypasses the cache the completeness gate
  reasons about. Current callers pass only what the prefetch needs.
- token dropped on the in-process cache probes: both are local_files_only=True (cache-only, no
  network), so no auth is needed there.
- License metadata (LGPL vs AGPL): out of scope per maintainer.
- Snapshot repo-wide stall masked by a concurrent same-repo pull in a separate process: documented
  tradeoff; per-pid snapshot scoping would risk false-stalling a progressing snapshot.
…t-checkpoint shard sets, harden the process-group kill, and drop dead code

Four fixes from a second consolidated review pass (12x reviewer.py personas + 3 forked reviewers) on 930d027.

1. Sharded TF/Flax POST false-accept -> hang (from the prior from_tf/from_flax fix). TF/Flax
   weights can be SHARDED (transformers TF2_WEIGHTS_INDEX_NAME=tf_model.h5.index.json,
   FLAX_WEIGHTS_INDEX_NAME=flax_model.msgpack.index.json). _CANONICAL_ROOT_TF_FLAX_WEIGHT_RE matched
   the shard form, so a LONE shard read as a present weight and _download_result_usable accepted an
   INCOMPLETE sharded set -> the in-process load then fetched the missing shard over un-killable Xet.
   The regex is now single-file only (tf_model.h5 / flax_model.msgpack); a sharded set is judged
   through its index via _weight_shard_index_complete (every listed shard present). Complete sharded
   and single-file downloads are accepted; an incomplete shard set or a lone shard without its index
   is retried over HTTP.

2. Explicit-checkpoint shard set accepted incomplete -> hang. _selected_shard_index_incomplete
   skipped EVERY training-checkpoint subtree, but subfolder=checkpoint-N (allow=['checkpoint-N/*'])
   is an explicit load of that checkpoint. A lone checkpoint-7/model-00001-of-00002.safetensors with
   no index was accepted, then the load fetched the missing index/shards over Xet. The skip is now
   conditional (_request_scopes_into_dir): a checkpoint the request explicitly targets is
   completeness-checked, while a leftover checkpoint a base/adapter/other-subfolder warm does not
   read is still skipped (no false-reject).

3. _terminate_process_group could signal the WRONG process group. The SIGTERM path called
   os.killpg(pid) whenever pid was non-null; before the child runs os.setsid() its pid is not yet a
   pgid, and a recycled pid could collide with an unrelated group. It now signals the group only when
   os.getpgid(pid) == pid (the child is confirmed its own leader), else falls back to a single-process
   signal -- matching the caution the post-reap SIGKILL branch already documents.

4. Removed dead _has_any_weight (defined, never called; the readable-weight checks supersede it) and
   the two docstrings that named it.

Adds 2 regression tests (189 -> 191). Safety-invariant fuzz stays at 0 false-accepts; stall->HTTP
e2e sim (which exercises the kill path) stays green.

Not taken (same review pass):
- Require ALL declared Diffusers components in POST: a static heuristic false-rejects a complete
  pipeline that declares weightless components (scheduler/tokenizer/feature_extractor,
  safety_checker:[null,null]); POST stays lenient and one declared component weight already
  separates a real warm from a config-only stale cache.
- Injected prepare_for_http_fn missing owned_incomplete_blobs (5 personas): by design and documented
  at the call site -- an injected (Studio) hook owns its own marker-based cache accounting and keeps
  the plain (repo_type, repo_id) signature.
- Transient-error / crashed-child HTTP retry uses the coarse mtime guard rather than an owned-partial
  set: the stall path captures owned partials from the child's open fds, but a crashed/errored child
  has no fds to inspect, so the coarse guard (spares actively-progressing siblings) is the only
  option there -- inherent, not an asymmetry that can be closed.
- Broad ['*.safetensors'] base-model warm accepts an adapter-only cache: not loader-reachable (a base
  load passes allow_patterns=None, never a broad glob).
- Missing HF kwargs (local_dir / max_workers / etag_timeout): the helper is a cache-based prefetch
  wrapper, not a full snapshot_download drop-in; local_dir bypasses the cache the completeness gate
  reasons about. Current callers pass only what the prefetch needs.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b2089bfd95

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +832 to +836
if want_variant is None:
shard_file_re = re.compile(r"^[^.]+-\d{5}-of-\d{5}\.(?:safetensors|bin)$")
else:
v = re.escape(want_variant)
shard_file_re = re.compile(rf"^[^.]+\.{v}-\d{{5}}-of-\d{{5}}\.(?:safetensors|bin)$")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Check TF/Flax shard sets for patterned warms

For patterned from_tf/from_flax warms such as allow_patterns=['*.h5'] or ['*.msgpack'], this helper only recognizes safetensors/bin shard files and indices. A stale snapshot containing only tf_model-00001-of-00002.h5 (or an *.h5.index.json missing a shard) therefore passes _download_result_usable, and the later from_pretrained(..., from_tf=True) / from_flax=True load still has to fetch the index or remaining shard in-process over Xet instead of taking the supervised HTTP retry path. Please include the TF/Flax shard/index names in the selected-shard completeness check, not just the unpatterned root check.

Useful? React with 👍 / 👎.

Comment on lines +1165 to +1166
if index_path.is_file() and _weight_shard_index_complete(index_path):
return True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor ignores for TF/Flax shard indices

When the request excludes one of the TF/Flax formats, the single-file path above applies ignore_patterns, but this sharded-index path accepts any complete TF/Flax index regardless of whether that format was ignored. For example, a from_tf warm that ignores *.msgpack can be satisfied by a stale complete flax_model.msgpack.index.json while tf_model.h5 is absent, so the later from_pretrained(..., from_tf=True) fetches the missing TF weights in-process over Xet instead of using the supervised retry. Apply the same format filter to these index checks before accepting them.

Useful? React with 👍 / 👎.

Comment on lines +411 to +415
"adapter_model.safetensors",
"adapter_model.bin",
"consolidated.safetensors",
"consolidated.00.pth",
"diffusion_pytorch_model.safetensors",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Classify dashed adapter/component shard globs as weights

The probe set omits dashed adapter and component shard names, so filters like allow_patterns=['adapter_model-*', 'adapter_model.safetensors.index.json'] or ['unet/diffusion_pytorch_model-*', ...] are classified as weightless. In those patterned shard warms, _download_result_usable skips the readable-weight and shard-completeness checks entirely, so an index-only or config-only stale snapshot can be accepted and the later PEFT/Diffusers load fetches the missing shards in-process over Xet. Add representative dashed adapter/component shard probes so these requests stay weight-bearing.

Useful? React with 👍 / 👎.

Comment on lines +1546 to +1548
if not is_diffusers and _request_selects_canonical_root_shards(
allow_patterns, ignore_patterns
) and _has_incomplete_canonical_root_shards(snapshot_dir, ignore_patterns = ignore_patterns):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Scope root shard rejects to the allowed format

For patterned root warms, this only checks that the request selects some canonical root shard and then runs the root-shard completeness check without allow_patterns. With allow_patterns=['pytorch_model*'], a successful bin-only snapshot containing pytorch_model.bin is rejected if an unrelated stale model.safetensors.index.json is missing shards, even though that safetensors set is outside the filter and a bin load will not read it; after the HTTP retry this becomes a DownloadStallError for a valid filtered snapshot. Thread the allow scope into the root shard check or choose the selected format before failing.

Useful? React with 👍 / 👎.

Comment on lines +1042 to +1043
_CANONICAL_COMPONENT_WEIGHT_RE = re.compile(
r"^[^.]+(?:-\d{5}-of-\d{5})?\.(?:safetensors|bin)$"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject noncanonical weights in Diffusers components

This regex accepts any no-dot safetensors/bin basename inside a declared Diffusers component, including files such as unet/adapter_model.safetensors. A stale cache with model_index.json declaring unet and only that adapter sidecar present passes _download_result_usable, but DiffusionPipeline will still look for the component's canonical weight such as diffusion_pytorch_model.safetensors/model.safetensors and can fetch it in-process over Xet. Restrict declared component evidence to the weight basenames Diffusers/Transformers components actually load.

Useful? React with 👍 / 👎.

…n CI

Two fixes from a third consolidated review pass (12x reviewer.py personas + 3 forked reviewers) on b2089bf.

1. Nested checkpoint subfolder POST false-accept -> hang. _request_scopes_into_dir matched the
   checkpoint dir only against the FIRST path segment, so a nested subfolder=foo/checkpoint-7
   (allow=['foo/checkpoint-7/*']) did not register as an explicit checkpoint load: the checkpoint
   subtree was skipped, a lone incomplete shard there was accepted, and the in-process load then
   fetched the missing index/shards over un-killable Xet -- the same hang the flat-checkpoint fix
   closed, one directory deeper. The scope check now matches the target dir against ALL literal
   leading segments (up to the first glob), so an explicitly-loaded checkpoint at any depth is
   completeness-checked while a leftover checkpoint the request does not target is still skipped.

2. test_hf_xet_fallback.py now runs in CI. The 191-test fallback suite was only reached by
   `pytest --collect-only` (continue-on-error), so it could regress without failing CI. It is now in
   the behavioral HARD GATE alongside the other CPU-pure gated files, so a future change that breaks
   the watchdog / cache-completeness / Xet->HTTP fallback behavior blocks the merge.

Adds a nested-checkpoint regression assertion. Full suite 191 passed; safety-invariant fuzz stays at
0 false-accepts.

Known low-reachability edge (documented, not fixed): a from_tf/from_flax load with a non-root
subfolder AND a SHARDED TF/Flax weight AND a silent-stale transient can POST-accept an incomplete
shard set -- the sharded TF/Flax completeness check landed on the allow=None root path, and the
patterned-subfolder path's index detection is safetensors/bin-only. The conjunction (from_tf is
already niche + subfolder + sharding + a transient-error stale return) is essentially unreachable,
and PRE still defers safely; extending the generic patterned-shard machinery to h5/msgpack is not
worth its surface for that edge.

Not taken (same review pass):
- Retryable-error / crashed-child HTTP retry uses the coarse mtime guard, not an owned-partial set:
  the stall path captures owned partials from the LIVE child's open fds, but a child that returned a
  retryable error / crashed has already exited, so there are no fds to inspect -- the coarse guard
  (which spares an actively-progressing sibling by mtime) is the only option there. Inherent, not a
  closable asymmetry.
- Injected prepare_for_http_fn missing owned_incomplete_blobs: by design and documented at the call
  site -- a Studio hook owns its own marker-based cache accounting.
- Broad ['*.safetensors'] / ['model*','adapter_model*'] accepting one selected weight: not
  loader-reachable (a base load passes allow=None; no loader requests base+adapter together).
- Exact ['config.json','tokenizer.json'] over-requiring optional files: not loader-reachable (every
  loader allow-list carries a glob, so the exact-file check is never the deciding gate).
- import unsloth_zoo.hf_xet_fallback not import-light by default: intentional opt-in
  (UNSLOTH_ZOO_DISABLE_GPU_INIT=1); the download child sets it, Studio imports its own fork.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fb80cf9a16

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +456 to +457
if base.endswith(_WEIGHT_FILE_SUFFIXES):
return True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat trainer state files as weightless filters

When an exact allow_patterns entry names a trainer artifact such as optimizer.pt, training_args.bin, or rng_state_0.pth, this branch marks the request as weight-bearing solely because the basename has a weight-like suffix. The later readable-weight check excludes those same artifacts via _is_loadable_weight_file, so a successful filtered snapshot containing the requested state file is classified as incomplete and retried/failed with DownloadStallError. Apply the same non-weight basename/prefix exclusion here before returning true for exact suffix matches.

Useful? React with 👍 / 👎.

Comment on lines +407 to +410
"model.fp16.safetensors",
"model-00001-of-00002.safetensors",
"pytorch_model.bin",
"pytorch_model-00001-of-00002.bin",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Classify pytorch variant globs as weights

For variant-specific bin filters such as allow_patterns=['pytorch_model.fp16*'], this probe set has no representative pytorch_model.fp16.bin / sharded variant name, so _pattern_can_select_weight classifies the request as weightless. A stale config-only snapshot is then accepted even with variant='fp16', and the later from_pretrained(..., variant='fp16', use_safetensors=False) load can fetch the missing pytorch_model.fp16.bin in-process over Xet; add PyTorch-bin variant probes alongside the safetensors variant probe.

Useful? React with 👍 / 👎.

_active_incomplete_blob_sizes(repo_type, repo_id, params.get("cache_dir"))
)
owned = current - baseline_partials
params["_owned_incomplete_blobs"] = owned

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don't scope HTTP cleanup to an empty owned set

When a snapshot watchdog fires before the child has opened any .incomplete file, such as a stale dangling symlink from an older snapshot making has_active_incomplete_blobs true during a slow metadata call, _child_open_incomplete_blobs returns an empty set and this stores it as the owned cleanup scope. _default_prepare_for_http treats that empty set as authoritative and skips every stale blob/link, leaving the same broken symlink to retrip the HTTP attempt; use None for the no-owned-partials case or avoid child-owned scoping for snapshot-wide stale state.

Useful? React with 👍 / 👎.

Comment on lines +904 to +908
for shard_rels in by_fmt.get("safetensors") or by_fmt.get("bin") or []:
for shard in shard_rels:
try:
if not (snapshot_dir / shard).exists():
return True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Let single subfolder weights beat stale indices

When a patterned subfolder warm selects both encoder/model.safetensors and a stale encoder/model.safetensors.index.json, this loop rejects the snapshot as soon as any shard listed by the index is absent. Transformers checks the single safetensors file before the index in that same directory, so this valid warmed subfolder is retried and can end in DownloadStallError; mirror the root-weight precedence here and ignore a selected index when the corresponding single weight of the preferred format is present.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fb80cf9a16

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if base.endswith(_WEIGHT_FILE_SUFFIXES):
return True
if any(ch in base for ch in _GLOB_CHARS):
return any(fnmatch.fnmatchcase(probe, base) for probe in _WEIGHT_PATTERN_PROBES)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat directory-prefix globs as weight-bearing

When allow_patterns uses a directory-prefix glob without a slash, such as ['checkpoint-*'] or ['unet*'], Hugging Face's filter selects files below matching directories, but this check only compares the final basename to weight-file probes and returns False. That makes request_can_include_weights treat these weight-bearing snapshot requests as weightless, so a stale config-only subset can be accepted and the later checkpoint/component load may fetch the missing weights in-process instead of through the fallback.

Useful? React with 👍 / 👎.

target = Path(os.readlink(link))
if not target.is_absolute():
target = link.parent / target
incomplete_partner = target.with_name(target.name + INCOMPLETE_SUFFIX)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Spare unique incomplete partners for live symlinks

In supported newer huggingface_hub versions, the temporary partial for blob <etag> is named like <etag>.<uuid>.incomplete, not exactly <etag>.incomplete. When default HTTP prep runs without an owned-partial set, a concurrent same-repo sibling can have a dangling snapshot symlink to <etag> plus a fresh unique partial, but this exact lookup misses the active partner and unlinks the sibling's live symlink instead of sparing it, leaving that sibling snapshot vulnerable to being returned with the pointer removed.

Useful? React with 👍 / 👎.

"retrying with HF_HUB_DISABLE_XET=1", label
)
_safe_status(on_status, f"{label}: download crashed, retrying over HTTP")
disable_xet = True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clear crashed snapshot partials before HTTP retry

When a snapshot child crashes shortly after creating a partial, this retry path flips to HTTP without recording any child-owned partials. The next loop's HTTP prep treats the just-written .incomplete as a fresh active sibling and spares it, so the snapshot watchdog still sees a constant repo-wide incomplete blob and can declare the HTTP retry stalled during metadata or lock wait even though the crashed child's partial should have been removed before retrying.

Useful? React with 👍 / 👎.

The shared Xet->HTTP fallback does not require a version bump; keep the base version.
Condense docstrings and inline comments across hf_xet_fallback.py, hf_cache_state.py, the __init__.py opt-in block, and the test suite: drop restating-the-obvious comments, collapse multi-line blocks, cut small-helper docstrings to one line. Comments/docstrings only, no code changed (AST-verified). Preserves the load-bearing rationale tersely (STRICT pre-download / LENIENT post-download, the un-killable Xet hang reason, and the scoping / format / variant / from_tf / diffusers / checkpoint / redaction notes). Full suite 191 passed; safety fuzz 0 false-accepts.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

Consolidated review pass (12x reviewer.py personas + 3 forked reviewers). 8 of 12
reviewers independently reproduced a post-download false-accept:
_has_diffusers_component_weight returned True as soon as ANY declared component
held a weight, so a stale partial pipeline (model_index.json declaring unet + vae
with only unet present, or vae present with its config but no weight) was
accepted -- then the in-process DiffusionPipeline load fetched the missing
component over un-killable Xet, the exact hang this fallback prevents. Diffusion
is a real loader consumer (diffusion.py warms with allow_patterns=None), so it is
reachable.

An unpatterned diffusers warm now requires that every DECLARED ACTIVE component
(a [library, class] spec with both non-null) is materialised as a non-empty
subfolder AND that each model-style component (one carrying config.json) holds a
readable weight of the read format. A [null, null] disabled component (e.g.
safety_checker) and weightless components (scheduler / tokenizer /
feature_extractor, which carry *_config.json, not config.json) are not required,
so a complete or safety-checker-disabled pipeline is not false-rejected into a
DownloadStallError. A variant load accepts a component's canonical weight as
diffusers' per-component fallback, so a mixed variant/canonical pipeline still
passes. A malformed / empty model_index.json fails OPEN to the prior lenient
any-weight check.

Adds a regression over component-absent, config-only, complete-with-null-safety-
checker, variant-absent, and mixed-variant cases. Full suite 192 passed; safety-
invariant fuzz stays at 0 false-accepts.

Not taken (re-verified, no new evidence):
- Injected prepare_for_http_fn dropping owned_incomplete_blobs: by design; a
  Studio hook owns its marker-based cache accounting (documented at the call site).
- Retryable / crashed-child HTTP prep using the coarse mtime guard: an exited
  child has no open fds to scope owned partials; inherent, not closable.
- Missing hf_hub kwargs / non-import-light default: a purpose-built prefetch
  helper (local_dir breaks the cache model) and an intentional opt-in
  (UNSLOTH_ZOO_DISABLE_GPU_INIT), not a general hf_hub_download replacement.
- Patterned TF/Flax sharded-subfolder edge: documented low-reachability; PRE
  never accepts tf/flax, so it can only ever cause a safe retry.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

Follow-up review pass (12x reviewer.py personas + 3 forks) on the diffusers
component-completeness gate. A residual false-accept remained: an active component
dir that is non-empty but carries NEITHER a top-level config.json NOR a readable
weight NOR a weightless sidecar (only a stray file or a nested training-checkpoint)
passed the presence rule and skipped the config-gated weight rule, so a stale
partial mid-download component was accepted and the in-process pipeline load then
fetched its weight over un-killable Xet.

A component now satisfies the warm only if it holds a readable weight of the read
format, OR is WEIGHTLESS-shaped (a scheduler / tokenizer / feature_extractor
sidecar: a *_config.json / tokenizer file and no config.json). A dir with neither
is an incomplete model component and is retried over HTTP. Complete pipelines
(including SDXL dual text encoders, sharded components, fp16 mixed / canonical
fallback) still accept; the safety-invariant fuzz stays at 0 false-accepts and the
suite is 193 passed.

Not changed (re-verified this pass): the variant floor requiring at least one
variant weight is correct -- diffusers itself raises on a zero-variant repo, so a
canonical-only request for a variant is not a load diffusers would complete. The
recurring reviewer.py "stale-branch revert" flag is a harness artifact (GitHub
reports the PR mergeable / clean).
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant