Add shared Xet to HTTP stall fallback (hf_xet_fallback, hf_cache_state)#829
Add shared Xet to HTTP stall fallback (hf_xet_fallback, hf_cache_state)#829danielhanchen wants to merge 82 commits into
Conversation
Hugging Face Xet downloads can hang on a blob with no progress and no exception, and a blocked native Xet thread cannot be killed in-process. Unsloth Studio already recovers from this; this lands the same logic in unsloth_zoo so both Unsloth main and Studio can share one implementation. - hf_cache_state.py: sparse-aware HF cache primitives (st_blocks based byte accounting, active .incomplete detection) over the active HF_HUB_CACHE. - hf_xet_fallback.py: a 180s no-progress watchdog plus a spawn-child download that keeps Xet primary and falls back to plain HTTP exactly once on a stall. hf_hub_download_with_xet_fallback handles a single file; the new snapshot_download_with_xet_fallback warms a whole repo in a killable child (the entrypoint Unsloth from_pretrained uses before its in-process load), with an in-process local_files_only fast path for a warm cache. Honors an UNSLOTH_DISABLE_XET knob and injectable prepare-for-http / scrub hooks so Studio can pass its marker-aware cache management. Both files are AGPL-3.0 (per-file), matching the Studio source. Tests: CPU-only, no network, no real subprocess (the download seam is faked).
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request introduces sparse-aware Hugging Face cache primitives and a robust fallback mechanism from Xet to HTTP downloads on no-progress stalls, accompanied by a comprehensive test suite. The feedback focuses on enhancing the reliability and robustness of the implementation. Key recommendations include removing a redundant import check, catching broader OSError exceptions during file cleanup to handle permission issues, improving Windows path detection, resetting the watchdog timer during unmeasurable states to prevent false stalls, and adding a small timeout when draining the process queue to avoid race conditions.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| try: | ||
| from huggingface_hub import constants as hf_constants | ||
| except ImportError: | ||
| return None |
There was a problem hiding this comment.
The try...except ImportError block for huggingface_hub is redundant because huggingface_hub is a required dependency of the project. Importing it directly is cleaner and adheres to the general rules.
from huggingface_hub import constants as hf_constantsReferences
- Avoid using try...except ImportError for libraries that are required dependencies of the project, as the check is redundant.
| try: | ||
| blob.unlink() | ||
| except FileNotFoundError: | ||
| continue |
There was a problem hiding this comment.
If blob.unlink() raises a PermissionError (which is common on Windows or when files are locked), the outer except Exception block will catch it and abort the entire cleanup loop. This leaves other incomplete files untouched. Catching OSError (which covers both FileNotFoundError and PermissionError) inside the loop ensures maximum cleanup effort.
| try: | |
| blob.unlink() | |
| except FileNotFoundError: | |
| continue | |
| try: | |
| blob.unlink() | |
| except OSError: | |
| continue |
| if not repo_id or repo_id.startswith(("/", ".", "~")) or "\\" in repo_id: | ||
| continue |
There was a problem hiding this comment.
Windows absolute paths using forward slashes (e.g., C:/path) or containing drive letters won't be skipped by the current check, potentially leading to unnecessary cache directory lookups. Checking for : (which is never present in valid HF repo IDs) is a robust way to skip Windows absolute paths.
| if not repo_id or repo_id.startswith(("/", ".", "~")) or "\\" in repo_id: | |
| continue | |
| if not repo_id or repo_id.startswith(("/", ".", "~")) or "\\" in repo_id or ":" in repo_id: | |
| continue |
| if state is None: | ||
| if on_heartbeat is not None: | ||
| on_heartbeat(f"Downloading ({transport} transport)...") | ||
| continue |
There was a problem hiding this comment.
If get_hf_download_state returns None due to transient issues, last_change is not updated. If this state persists longer than stall_timeout, the watchdog will immediately trigger a false stall as soon as the state becomes measurable again. Resetting last_change = now during unmeasurable periods avoids this.
| if state is None: | |
| if on_heartbeat is not None: | |
| on_heartbeat(f"Downloading ({transport} transport)...") | |
| continue | |
| if state is None: | |
| last_change = now | |
| if on_heartbeat is not None: | |
| on_heartbeat(f"Downloading ({transport} transport)...") | |
| continue |
| result = result_queue.get_nowait() | ||
| except queue.Empty: | ||
| result = None | ||
| finally: |
There was a problem hiding this comment.
Under heavy CPU/disk load, there can be a slight delay between the child process exiting and the OS flushing the queue's pipe. get_nowait() will fail instantly if the pipe is temporarily empty, whereas a small timeout (e.g., 1.0s) provides robustness against scheduling delays.
| result = result_queue.get_nowait() | |
| except queue.Empty: | |
| result = None | |
| finally: | |
| try: | |
| result = result_queue.get(timeout = 1.0) | |
| except queue.Empty: | |
| result = None |
|
|
||
| def test_get_state_skips_local_paths(hf_cache): | ||
| # Filesystem paths are not HF repo IDs and must be ignored without error. | ||
| assert xf.get_hf_download_state(["/abs/path", "./rel", "~user", "c:\\x"]) == (0, False) |
There was a problem hiding this comment.
Update the test assertion to verify that Windows absolute paths using forward slashes (e.g., c:/x) are also correctly skipped.
| assert xf.get_hf_download_state(["/abs/path", "./rel", "~user", "c:\\x"]) == (0, False) | |
| assert xf.get_hf_download_state(["/abs/path", "./rel", "~user", "c:\\x", "c:/x"]) == (0, False) |
…d sys import - The no-progress watchdog and the HTTP-retry cache purge now honor a caller supplied snapshot cache_dir, not just HF_HUB_CACHE, so a stall under a custom cache is detected and the partial is purged before the HTTP fallback. Threaded through hf_cache_root / iter_active_repo_cache_dirs / has_active_incomplete_blobs (hf_cache_state) and get_hf_download_state / start_watchdog / _default_prepare_for_http / _run_download_attempt (hf_xet_fallback). - Remove the unused 'import sys' (ruff F401). - Add a regression test that a stall under a custom cache_dir is watched and cleaned.
… on file path - Critical: the HTTP retry could still use Xet. The spawn child re-imports the (heavy) unsloth_zoo package -- importing huggingface_hub, which reads HF_HUB_DISABLE_XET into a module constant at import time -- BEFORE the child body ran, so a child-side os.environ assignment landed too late. Now set the transport env in the parent (under a lock) around proc.start() so the child inherits it from creation; the child still sets it defensively. Tests assert the child inherits HF_HUB_DISABLE_XET=1 on the HTTP retry and is left untouched on the Xet attempt. - hf_hub_download_with_xet_fallback now accepts cache_dir (symmetric with the snapshot variant); threaded into try_to_load_from_cache and the child download.
|
@codex review |
|
/gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8af1b5debd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
| _CTX = mp.get_context("spawn") |
There was a problem hiding this comment.
Avoid forcing spawn for library downloads
When this helper is called from the normal unguarded top-level model-loading scripts that FastModel.from_pretrained will use, the spawn context re-imports the user's __main__ module in the child before _download_child_entry runs. That re-executes the same top-level download/model load and typically hits multiprocessing's bootstrap RuntimeError or exits without putting a queue result, so the wrapper reports a download error instead of downloading. Use an entry point that does not import the caller's main module, or avoid forced spawn where safe.
Useful? React with 👍 / 👎.
| # The generic purge is cache_dir-aware; an injected (Studio) hook owns | ||
| # its own cache accounting and keeps the (repo_type, repo_id) signature. |
There was a problem hiding this comment.
Abort when HTTP prep cannot remove unsafe partials
If _default_prepare_for_http or an injected prep hook fails after a Xet stall, this handler only logs at debug level and still starts the HTTP retry. In the locked-file/permission-error cases where the .incomplete sparse partial is still present, the retry can resume over exactly the unsafe partial the preceding comment says can silently corrupt the blob, leaving a bad model in cache instead of failing safely.
Useful? React with 👍 / 👎.
| return (blobs_dir.is_dir() and _blob_dir_is_partial(blobs_dir)) or ( | ||
| _repo_dir_has_broken_snapshot_symlinks(repo_dir) | ||
| ) |
There was a problem hiding this comment.
Don't treat stale snapshot gaps as active downloads
In a cache that already has a broken snapshot symlink from a previous interrupted or evicted download but no *.incomplete file, this still reports has_incomplete=True. The watchdog is supposed to reset while no active partial exists, but this state starts the stall timer before the new attempt has written anything and can kill/retry otherwise healthy downloads that spend stall_timeout on metadata or lock waits.
Useful? React with 👍 / 👎.
| for blob in blobs_dir.iterdir(): | ||
| if blob.is_file() and blob.name.endswith(INCOMPLETE_SUFFIX): | ||
| try: | ||
| blob.unlink() |
There was a problem hiding this comment.
Limit HTTP cleanup to this stalled attempt
When two files from the same repo are downloading concurrently, a stall in one wrapper call enters the HTTP retry and this loop unlinks every *.incomplete blob in the repo cache, including a sibling file that is still actively making progress. That can make the other download fail or lose its partial even though it did not stall, so the cleanup needs to be scoped or locked to the attempt being retried.
Useful? React with 👍 / 👎.
- prepare_for_http: continue past a locked/permission-denied partial
(catch OSError, not just FileNotFoundError) so one bad blob does not
abort cleanup of the rest.
- get_hf_download_state: also skip drive-letter (":") repo ids so Windows
absolute paths like C:/models are not mistaken for Hub ids.
- watchdog: reset the no-progress timer on an unmeasurable tick so a long
unmeasurable gap cannot trip a false stall the instant state is readable.
- result drain: get(timeout=1.0) instead of get_nowait() so a child that
exits microseconds before its queue feeder flushes is not misreported as
'exited without a result'.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6266d0811d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return _download_with_xet_fallback( | ||
| repo_id = repo_id, | ||
| label = repo_id, | ||
| kind = "snapshot", |
There was a problem hiding this comment.
Preserve local model paths before Hub fallback
When this warm-up helper receives a filesystem model path (for example /mnt/model), the local_files_only probe above raises HFValidationError and this fall-through sends the path into the child snapshot_download, which rejects local paths instead of letting a normal from_pretrained('/mnt/model') proceed. Since the docstring says this is the from_pretrained cache warm-up path, guard or return existing local directories before entering the Hub download loop.
Useful? React with 👍 / 👎.
| out = re.sub(r"hf_[A-Za-z0-9]{8,}", "***", out) | ||
| out = re.sub(r"([Bb]earer\s+)[A-Za-z0-9._\-]+", r"\1***", out) |
There was a problem hiding this comment.
Redact signed URL query credentials
When running outside Studio, child exceptions fall back to this default scrubber, but these patterns only cover hf_ tokens and Bearer headers. If an HTTP/Xet exception includes the resolved presigned blob URL (for example a 403/timeout message with ...?X-Amz-Signature=...), the signed query string is returned to the parent and can be logged with the raised RuntimeError, despite the child-entry contract saying signed URLs are not exposed; please strip query strings or known signing parameters here as well.
Useful? React with 👍 / 👎.
| cache_dir = cache_dir, | ||
| allow_patterns = allow_patterns, | ||
| ignore_patterns = ignore_patterns, | ||
| local_files_only = True, |
There was a problem hiding this comment.
Preserve Hub freshness checks for moving revisions
When revision is omitted or is a branch/tag like main, this local_files_only fast path returns any cached snapshot without the normal Hub ref/etag refresh. After a model is updated upstream, snapshot_download_with_xet_fallback can keep serving the stale cached snapshot indefinitely instead of matching snapshot_download/from_pretrained semantics; restrict this no-network shortcut to immutable commit hashes or offline mode, or perform the refresh in the child.
Useful? React with 👍 / 👎.
| os.environ.update(child_env) | ||
| proc.start() |
There was a problem hiding this comment.
Avoid leaking temporary Xet env to sibling downloads
During an HTTP retry, this mutates process-global os.environ to HF_HUB_DISABLE_XET=1 while spawning the child. The lock only serializes writers in this helper; a concurrent download can call xet_force_disabled() or import huggingface_hub during this window and incorrectly force/cache HTTP for its own first attempt, so multi-model loads can lose Xet because a sibling stalled. Use a spawn bootstrap that does not expose the flag process-wide, or guard all reads/imports with the same lock.
Useful? React with 👍 / 👎.
| target = target_dir_name(repo_type, repo_id) | ||
| try: | ||
| for entry in root.iterdir(): | ||
| if entry.name.lower() == target: |
There was a problem hiding this comment.
Match Hub cache dirs case-sensitively
Hugging Face accepts uppercase repo IDs and its cache folder naming preserves case (for example models--Foo--Bar and models--foo--bar are distinct), but this lowercased comparison makes both directories match either repo. In a cache containing repos that differ only by case, the watchdog and HTTP cleanup can count or unlink the other repo's active partials, causing false stalls or cross-repo download disruption; compare against the exact repo_cache_dir_name instead.
Useful? React with 👍 / 👎.
| if isinstance(cached, str) and os.path.exists(cached): | ||
| return cached |
There was a problem hiding this comment.
Honor cancellation before cached fast paths
If cancel_event is already set and this file is cached, the helper returns the cached path here before reaching _download_with_xet_fallback's cancellation check, contradicting its documented RuntimeError("Cancelled") behavior. In cancellation-driven model loads this can continue into expensive local loading after the caller has already aborted; check the event before the cache short-circuit (and similarly before the snapshot fast path).
Useful? React with 👍 / 👎.
|
@codex review |
1 similar comment
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6266d0811d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
| _CTX = mp.get_context("spawn") |
There was a problem hiding this comment.
Avoid spawn re-importing unguarded training scripts
When this helper is used from FastModel.from_pretrained in a regular Python training script that calls it at module top level, the spawn context re-imports and executes the user's __main__ in the child before _download_child_entry runs. That re-enters the download during multiprocessing bootstrapping and Python aborts with the usual safe-import RuntimeError, so the child exits without warming the cache or reaching the fallback. Please launch a dedicated helper subprocess that only imports this module, or avoid spawn on POSIX for this path.
Useful? React with 👍 / 👎.
| except OSError: | ||
| # A locked / permission-denied blob (common on Windows) | ||
| # must not abort cleanup of the rest of the partials. | ||
| continue |
There was a problem hiding this comment.
Abort HTTP fallback when cleanup cannot remove partials
When a stalled .incomplete is locked or permission-denied, this except leaves it in place and reports success to the caller. The fallback then launches HTTP against the same cache even though this function documents that HTTP resume over a sparse Xet/hf_transfer partial can corrupt the blob, so Windows locks or restricted caches can turn a cleanup failure into a corrupted model file instead of a safe failure.
Useful? React with 👍 / 👎.
| for blob in blobs_dir.iterdir(): | ||
| if blob.is_file() and blob.name.endswith(INCOMPLETE_SUFFIX): | ||
| try: | ||
| blob.unlink() |
There was a problem hiding this comment.
Limit cleanup to the stalled attempt
When two files from the same repo are downloaded concurrently and only one child stalls, this repo-wide loop removes every *.incomplete, including a sibling file that another live child is still writing. On Unix that can unlink the active writer's target and on Windows it collides with the locked-file path above, so the single-file fallback is not independent under parallel shard downloads; cleanup needs to target only the stalled attempt's partials or coordinate at repo scope.
Useful? React with 👍 / 👎.
| if hf_token: | ||
| out = out.replace(hf_token, "***") |
There was a problem hiding this comment.
Handle boolean Hugging Face tokens when scrubbing
Hugging Face download APIs accept token=True to use the saved token, and callers commonly pass that through from from_pretrained. If the child then raises a deterministic error, this path calls str.replace with a bool, causing a TypeError inside the exception handler before result_queue.put; the parent then reports that the process exited without a result instead of the real 401/404/network error. Guard the replacement with isinstance(hf_token, str).
Useful? React with 👍 / 👎.
| with _SPAWN_ENV_LOCK: | ||
| saved_env = {k: os.environ.get(k) for k in child_env} | ||
| try: | ||
| os.environ.update(child_env) | ||
| proc.start() |
There was a problem hiding this comment.
Keep temporary Xet env out of other threads
During an HTTP retry this mutates process-wide os.environ while the child is started, but other threads are not protected by _SPAWN_ENV_LOCK; for example, another concurrent helper can call xet_force_disabled() during this window and incorrectly decide to skip Xet and run HTTP cleanup for an unrelated download. Pass the environment to an isolated helper process or keep the transport decision and spawn under a synchronization scheme that other callers also use.
Useful? React with 👍 / 👎.
…ce_download (review 3) - spawn from notebook / stdin / python -c / unguarded script: multiprocessing 'spawn' re-opens __main__.__file__, which is missing or '<stdin>' there, so proc.start() raised before the child ran and the download never happened. Point __main__ at this importable module just for the spawn (serialized under the existing lock) and restore it. Verified from a real <stdin> context: the child now runs and returns the expected HF error instead of failing to spawn. - _default_prepare_for_http only deletes from an exact-case cache dir, or a single unambiguous case-insensitive match, so preparing HTTP for Org/Repo no longer purges a case-colliding org/repo on a case-sensitive filesystem. - HTTP prep now also clears broken snapshot symlinks, which the incomplete-state detector counts as active; otherwise the HTTP retry inherited stale state and re-tripped the watchdog. - snapshot/file fallbacks accept and forward force_download, skipping the warm cache short-circuit so force_download=True re-fetches in the killable child. - Added regression tests for all four (30 passed, 1 skipped).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2619c825f8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
|
|
||
| def repo_cache_dir_name(repo_type: str, repo_id: str) -> str: | ||
| return f"{repo_type}s--{repo_id.replace('/', '--')}" |
There was a problem hiding this comment.
Normalize repo_type=None before cache scans
When callers forward Hugging Face's default repo_type=None for a model download, the child hf_hub_download/snapshot_download normalizes it to "model" and writes under models--..., but the watchdog and HTTP-prep cache probes use this formatter first and look for Nones--... instead. In that scenario an actual stalled model download never exposes has_incomplete=True, so the Xet child is not killed and the HTTP fallback this helper is meant to provide does not run; normalize None to "model" before building cache directory names.
Useful? React with 👍 / 👎.
| for blob in blobs_dir.iterdir(): | ||
| try: | ||
| if blob.is_file(): | ||
| total += blob_bytes_present(blob) |
There was a problem hiding this comment.
Scope single-file progress to the watched file
When two hf_hub_download_with_xet_fallback calls download different files from the same repo, this total includes every blob in the repo cache. A stalled child sees a sibling's growing .incomplete as progress and keeps resetting its stall timer, so it will not retry over HTTP after stall_timeout until all other repo downloads stop; large parallel shard downloads can therefore hang on the stuck shard much longer than configured.
Useful? React with 👍 / 👎.
| cache_dir = cache_dir, | ||
| allow_patterns = allow_patterns, | ||
| ignore_patterns = ignore_patterns, | ||
| local_files_only = True, |
There was a problem hiding this comment.
Do not treat partial snapshots as warm caches
When a previous interrupted download left a snapshot directory for this revision with only some files linked and no active .incomplete, Hugging Face's local_files_only path returns the snapshot folder as soon as it exists and does not verify that every allow_patterns file is present. This fast path then skips the supervised child entirely, so the later model load either fails on missing files or downloads them in-process, defeating the Xet stall fallback for exactly the partial-cache case it is meant to recover.
Useful? React with 👍 / 👎.
…safe (review 4) - repo_type=None (Hugging Face's default model repo) resolved to a bogus Nones--<id> cache dir, so get_hf_download_state/the watchdog missed the real models--<id> partial and could let a stalled Xet child run forever. Normalize repo_type to 'model' in repo_cache_dir_name and at the public entrypoints. - The destructive HTTP-prep path was case-collision safe but the read/watchdog path still yielded every case-insensitive match, so a stale partial in a colliding org/repo could make the watchdog kill an active Org/Repo download. Move the exact-or-unambiguous guard into iter_active_repo_cache_dirs so both paths share one rule, and drop the now-redundant _destructive_repo_cache_dirs. - Added regression tests for both (32 passed, 1 skipped).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dc0fddb108
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| import time | ||
| from typing import Any, Callable, Optional | ||
|
|
||
| from unsloth_zoo.hf_cache_state import ( |
There was a problem hiding this comment.
Keep the fallback importable without full Unsloth init
This new module is meant to be shared with Studio/standalone callers, but a normal import unsloth_zoo.hf_xet_fallback executes unsloth_zoo/__init__.py before this submodule can load; that pre-existing init raises at unsloth_zoo/__init__.py:200-202 when the full unsloth package is not installed and also performs heavy Torch/GPU setup otherwise. The tests only avoid this by loading the file via spec_from_file_location, so CPU/Studio workers that import the documented package path can fail before any fallback code runs; expose this through a lightweight import path or make the package init skip heavy checks for these cache/download submodules.
Useful? React with 👍 / 👎.
| if proc.is_alive(): | ||
| _signal_group(getattr(signal, "SIGKILL", signal.SIGTERM)) | ||
| proc.join(timeout = 5.0) |
There was a problem hiding this comment.
Kill the group even after the wrapper exits
When the Xet transfer has spawned a helper in the child's process group and the Python wrapper exits on SIGTERM before that helper does, this guard skips the SIGKILL pass because proc.is_alive() is already false. The surviving helper can keep the stalled writer alive while the parent starts HTTP cleanup/retry, so the fallback should still send the post-grace SIGKILL to the process group (or otherwise reap the group) even if the leader has exited.
Useful? React with 👍 / 👎.
| return | ||
|
|
||
| if on_heartbeat is not None: | ||
| on_heartbeat(f"Downloading ({transport} transport)...") |
There was a problem hiding this comment.
Isolate status hook failures from the watchdog
If a caller-supplied on_status/heartbeat hook raises here (for example because a Studio client disconnected), the exception terminates the daemon watchdog thread before it can set the stalled event. In that environment a genuinely hung Xet child can remain supervised only by the parent poll loop and never trigger the HTTP retry, so nonessential status callbacks should be wrapped/logged instead of being allowed to kill stall detection.
Useful? React with 👍 / 👎.
| if cache_dir is not None: | ||
| root = Path(cache_dir) |
There was a problem hiding this comment.
Normalize cache_dir before scanning it
When callers pass a custom cache_dir that needs user expansion or resolution, such as ~/hf-cache, Hugging Face expands and resolves that path before writing, but this cache-state code scans the literal Path(cache_dir). In that scenario the child writes the expanded cache while the watchdog reports (0, False) and HTTP prep misses the partial, so a stalled Xet download under that custom cache never triggers or cleans up the fallback.
Useful? React with 👍 / 👎.
…(review 4 P2) - _terminate_process_group always sends the post-grace SIGKILL to the whole process group, even when the Python leader already exited on SIGTERM, so a surviving Xet helper cannot keep the stalled writer alive during HTTP cleanup. - Watchdog status/heartbeat callbacks are wrapped: a raising on_status (e.g. a disconnected Studio client) no longer kills the daemon thread and stops stall detection for a genuinely hung child. - hf_cache_root expands ~ in a custom cache_dir (as huggingface_hub does on write), so a stall under e.g. ~/hf-cache is still seen and cleaned. - Added regression tests (34 passed, 1 skipped).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 27c2795880
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if on_status is not None: | ||
| on_status(f"{label}: Xet stalled, retrying over HTTP") |
There was a problem hiding this comment.
Guard the retry status callback
When a stalled Xet attempt reaches the HTTP fallback, this direct on_status(...) call can raise (for example if a Studio client disconnected) and abort the retry path before disable_xet is set for the next attempt. The heartbeat callback is now isolated with _safe_status, but this remaining direct call still lets a nonessential status hook turn an otherwise recoverable Xet stall into a failed download.
Useful? React with 👍 / 👎.
| def hf_hub_download_with_xet_fallback( | ||
| repo_id: str, | ||
| filename: str, | ||
| token: Optional[str], |
There was a problem hiding this comment.
Default the optional token argument
When callers use this single-file helper like Hugging Face's hf_hub_download for a public repo and omit token, the current signature raises TypeError before it can hit the cache or start the supervised download. The snapshot helper already defaults token=None; doing the same here keeps unauthenticated single-file downloads from requiring a dummy third argument.
Useful? React with 👍 / 👎.
The Xet fallback runs the download in a spawn child -- a fresh interpreter that re-imports unsloth_zoo.hf_xet_fallback and so re-runs unsloth_zoo/__init__.py, which on any non-MLX host imports torch + transformers and does device init the child never uses (~5-7s per download child). The cache/download submodules only need stdlib + huggingface_hub. Reuse the existing MLX lightweight-import path: __init__ now also honors an opt-in UNSLOTH_ZOO_DISABLE_GPU_INIT=1 (off by default, so normal CUDA/CPU runs are byte-for-byte unchanged), and the parent sets it in the child's environment only around spawning the download child (restored immediately, like the transport env). The unconditional HF cache redirect still runs, so the child writes the same cache. MLX stub injection is now gated on actual MLX, not the generic skip flag. Verified: normal import still runs full init (DEVICE_TYPE=cuda, torch+transformers imported); with the flag the helper imports with torch/transformers absent; a real download through the spawn child completes with zero heavy-init banners (child stays light) and the parent env is restored. 35 passed, 1 skipped.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e93e669ee6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if main_module is not None: | ||
| main_file = getattr(main_module, "__file__", None) | ||
| if not main_file or str(main_file).startswith("<"): | ||
| saved_main_file = main_file | ||
| main_module.__file__ = __file__ |
There was a problem hiding this comment.
Avoid importing caller scripts in spawn children
When this wrapper is invoked from a regular Python script at module top level, __main__.__file__ is a real path, so this branch leaves it unchanged. The spawn child then imports that user script as __mp_main__, re-executes the top-level download/from_pretrained call during multiprocessing bootstrap, and exits with the standard “start a new process before bootstrapping” error; the parent then reports exited ... without a result instead of downloading. This breaks the common unguarded script pattern the comment says this workaround should handle, so the helper should avoid re-opening the caller script even when __file__ is a normal path.
Useful? React with 👍 / 👎.
| cache_dir = cache_dir, | ||
| allow_patterns = allow_patterns, | ||
| ignore_patterns = ignore_patterns, | ||
| local_files_only = True, |
There was a problem hiding this comment.
Verify cached snapshots before the fast path
When a previous snapshot download was interrupted after Hub wrote refs/<branch> and created snapshots/<sha>, snapshot_download(..., local_files_only=True) returns that directory without checking that all allowed files are present. This fast path therefore skips the killable child even while the cache is partial or has stale incomplete state, so from_pretrained can proceed with missing model files instead of completing the download; validate the snapshot or active incomplete state before returning here.
Useful? React with 👍 / 👎.
| # spawning that child, never for a training/inference process. The | ||
| # unconditional HF cache redirect above still runs, so the child writes to the | ||
| # same cache as the parent. | ||
| _SKIP_GPU_INIT = os.environ.get("UNSLOTH_ZOO_DISABLE_GPU_INIT", "0") == "1" |
There was a problem hiding this comment.
Allow the shared downloader to import standalone
When Studio or a lightweight utility imports unsloth_zoo.hf_xet_fallback directly in an environment without the unsloth package, package import still leaves _SKIP_GPU_INIT false unless the caller already set this child-only env var, so the later find_spec("unsloth") guard raises before the new shared module can load. That makes the advertised standalone/Studio import path unusable unless consumers know a private pre-import knob; select the lightweight path for these cache/download modules without requiring that external setup.
Useful? React with 👍 / 👎.
| if on_status is not None: | ||
| on_status(f"{label}: Xet stalled, retrying over HTTP") |
There was a problem hiding this comment.
Isolate stall status callback failures
When a transfer stalls, this direct on_status call runs outside _safe_status. If a UI/status callback raises at that moment (for example a disconnected Studio client, which the watchdog heartbeat path already tolerates), the exception escapes before disable_xet = True, so the HTTP retry is skipped and a recoverable Xet stall fails; wrap this notification with the same safe callback handling used for heartbeats.
Useful? React with 👍 / 👎.
| blob.unlink() | ||
| except OSError: |
There was a problem hiding this comment.
Do not resume HTTP when cleanup failed
When a stalled Xet partial cannot be removed (for example a locked file on Windows or a permission error), this branch suppresses the unlink failure and the caller still proceeds to the HTTP retry. The surrounding comments note that resuming HTTP over a sparse Xet/hf_transfer partial can silently corrupt the blob, so the retry should not continue unless the unsafe partial was actually cleared.
Useful? React with 👍 / 👎.
| if hf_token: | ||
| out = out.replace(hf_token, "***") |
There was a problem hiding this comment.
Handle boolean HF tokens while scrubbing errors
Hugging Face callers commonly pass token=True to use the configured token; if such a download fails in the child, this scrubber treats the boolean as a replacement string and raises TypeError, so the child exits without putting the real auth/network error on the queue and the parent reports a misleading “exited without a result”. Guard this replacement to string tokens only.
Useful? React with 👍 / 👎.
…ety, robust callbacks/tokens P1: - Spawn child now always repoints __main__.__file__ at this side-effect-free module (not only for stdin/notebook callers). An unguarded top-level script (python script.py with a bare from_pretrained) previously had its file re-imported as __mp_main__ in the child, re-running the download and failing with the multiprocessing bootstrapping error; now the child imports this module instead and downloads. Verified end-to-end with a real unguarded script. - snapshot fast path no longer short-circuits a cached-but-incomplete snapshot (an interrupted download leaving .incomplete blobs / broken symlinks): it checks has_active_incomplete_blobs and completes the download in the killable child rather than letting from_pretrained load missing files. P2: - Xet to HTTP retry status callback wrapped in _safe_status so a raising on_status (disconnected client) cannot abort the recoverable retry. - hf_hub_download_with_xet_fallback defaults token to None (parity with hf_hub_download). - If an unsafe partial cannot be cleared before HTTP, force a clean re-download instead of an unsafe resume over a sparse partial. - _default_scrub_secrets tolerates token=True (no TypeError in the child scrubber). - Docstring: state the package imports with unsloth installed; the spawn child uses the UNSLOTH_ZOO_DISABLE_GPU_INIT lightweight path. 41 passed, 1 skipped.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e2300e5580
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| repo_type = repo_type, | ||
| token = token, | ||
| revision = params.get("revision"), | ||
| cache_dir = params.get("cache_dir"), |
There was a problem hiding this comment.
Normalize snapshot cache_dir before spawning
When snapshot_download_with_xet_fallback is called with a cache_dir containing ~, the watchdog and HTTP cleanup scan Path(cache_dir).expanduser() via hf_cache_root, but the child passes the original value to snapshot_download here. In the supported huggingface_hub code path, snapshot_download builds its storage folder from that string without expanding it, so a stalled snapshot download writes under the literal ~/... path while the parent watches $HOME/... and never trips the Xet fallback; normalize the cache_dir once and use the same value for both the child and cache-state probes.
Useful? React with 👍 / 👎.
The snapshot fast path and the incomplete-state detector only inspected the newest snapshot by mtime. A caller requesting an older revision whose snapshot had a dangling symlink (an interrupted download) while a newer revision was clean would read the repo as healthy and load with missing files. - hf_cache_state.py: add snapshot_dir_has_broken_symlinks(snapshot_dir) and make the broken-symlink detector iterate every snapshot, not just the latest. - hf_xet_fallback.py: validate the exact returned revision dir in the local_files_only fast path; clear broken symlinks across all snapshots during HTTP prep; write the forced-stall test partial under the caller's cache_dir and the repo_type-correct dir name so the watchdog actually sees it. - tests: add regression coverage for the per-snapshot primitive, broken older snapshot detection, fast-path rejection of a broken requested revision, and all-snapshot symlink cleanup.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a04a983576
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if len(entries) <= 1: | ||
| return entries |
There was a problem hiding this comment.
Reject folded-only cache matches
When downloading Org/Repo on a case-sensitive cache that only has models--org--repo from another repo/call, this branch returns that sole folded match because len(entries) <= 1. The watchdog/HTTP prep for Org/Repo then counts or deletes the lowercase repo's .incomplete partial before HF has created the exact models--Org--Repo directory, so a stale case-colliding cache can still disrupt the wrong download. Fresh evidence in the current patch is that the new exact-match helper still accepts a folded-only match here; it should require entry.name == target for Hub cache dirs.
Useful? React with 👍 / 👎.
| if ( | ||
| not snapshot_dir_has_broken_symlinks(Path(cached_dir)) | ||
| and not has_active_incomplete_blobs(repo_type, repo_id, cache_dir = cache_dir) | ||
| ): |
There was a problem hiding this comment.
Verify absent snapshot files before returning cache
When an interrupted snapshot wrote refs/<rev> and the snapshot directory but died before creating a pointer/symlink for one allowed file, snapshot_download(..., local_files_only=True) returns the snapshot folder; these checks pass because there is no dangling symlink and no remaining .incomplete. This still short-circuits to a partial snapshot, so later from_pretrained can fail or download in-process instead of under the fallback. Fresh evidence is that the new guard only validates broken symlinks/partials, not missing expected files.
Useful? React with 👍 / 👎.
| if main_module is not None: | ||
| saved_main_file = getattr(main_module, "__file__", _UNSET) | ||
| main_module.__file__ = __file__ |
There was a problem hiding this comment.
When the caller is launched with python -m train or any __main__ that has __spec__.name, multiprocessing's spawn preparation uses that module name in preference to __file__. This assignment therefore does not repoint the child; it re-imports the user's module as __mp_main__, re-running top-level from_pretrained and hitting the bootstrapping RuntimeError instead of reaching _download_child_entry. Clear or replace __spec__ as well, or avoid spawn importing __main__.
Useful? React with 👍 / 👎.
| def snapshot_download_with_xet_fallback( | ||
| repo_id: str, | ||
| *, | ||
| revision: Optional[str] = None, | ||
| token: Optional[str] = None, |
There was a problem hiding this comment.
When a from_pretrained caller sets local_files_only=True (common for offline runs), this wrapper has no way to receive or forward that setting. If the snapshot is not fully cached, it falls through to _download_with_xet_fallback and starts a network-capable child download instead of matching Hugging Face/Transformers local-only semantics, or a caller that forwards the kwarg gets TypeError; add and honor local_files_only before spawning.
Useful? React with 👍 / 👎.
| cached = try_to_load_from_cache( | ||
| repo_id, filename, repo_type = repo_type, revision = revision, cache_dir = cache_dir | ||
| ) |
There was a problem hiding this comment.
Expand cache_dir before probing file cache
When cache_dir is passed as ~/hf-cache, a finalized file already present in the expanded cache is missed here because try_to_load_from_cache joins the string directly, while the child hf_hub_download expands/resolves it before writing. Warm single-file loads then spawn a child and may perform Hub metadata/network work instead of the documented no-child/no-network cache short-circuit; normalize the cache dir before this probe and reuse the same value for the attempt.
Useful? React with 👍 / 👎.
|
/gemini review |
|
One of the four is real and fixed in 61a629a; the other three I did not take (the license one per maintainer direction). Fixed:
Not taken:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 61a629a050
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| cached = try_to_load_from_cache( | ||
| repo_id, probe_filename, repo_type = repo_type, revision = revision, cache_dir = cache_dir | ||
| ) | ||
| if isinstance(cached, str) and os.path.exists(cached): | ||
| return cached |
There was a problem hiding this comment.
Don't skip online checks for mutable file revisions
When revision is omitted or is a branch/tag, try_to_load_from_cache only resolves the locally cached ref and this returns without contacting the Hub. Unlike hf_hub_download, this bypasses the online HEAD/ETag check for mutable refs, so after main or a tag moves callers can keep receiving a stale file instead of refreshing it under the supervised child; restrict this fast path to immutable commit hashes or perform the ref check in the child.
Useful? React with 👍 / 👎.
| return all( | ||
| any(candidate in present for candidate in candidates) for candidates in groups.values() | ||
| ) |
There was a problem hiding this comment.
Treat exact allow patterns as filters after download
When allow_patterns enumerates optional exact files, such as ['tokenizer.json', 'tokenizer_config.json'] for a repo that only publishes one of them, snapshot_download succeeds because the allow list filters the repo contents rather than requiring every pattern to exist. This all(...) turns that valid subset into an incomplete result, so the wrapper retries over HTTP and then raises DownloadStallError even though both transports produced the complete filtered snapshot.
Useful? React with 👍 / 👎.
|
Neither of these two is a bug I'm taking; reasoning below.
|
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 61a629a050
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if "pytorch_model.bin" in root_files and _format_kept("pytorch_model.bin"): | ||
| return True |
There was a problem hiding this comment.
Require safetensors before skipping supervised download
When a previous cache contains only pytorch_model.bin for a repo that also publishes model.safetensors, this returns True and lets snapshot_download_with_xet_fallback hand the cached snapshot back without running the killable download child. A subsequent default Transformers load probes model.safetensors before falling back to .bin, so it can download the missing preferred file through the normal in-process Hub path and reintroduce the unkillable Xet hang this fallback is meant to avoid. Since the local cache alone cannot prove the safetensors file is absent remotely, a bin-only snapshot should not satisfy the conservative fast path.
Useful? React with 👍 / 👎.
| rels.append(rel) | ||
| except OSError: | ||
| return False | ||
| return bool(_filter_paths(rels, None, ignore_patterns)) |
There was a problem hiding this comment.
Require declared Diffusers components before accepting
When snapshot_download returns a stale Diffusers snapshot after a transient Hub error, this accepts the result as soon as any declared component weight is present. For example, if model_index.json declares weight-bearing components like unet and vae but the cache only contains unet/diffusion_pytorch_model.safetensors, _download_result_usable returns true and DiffusionPipeline.from_pretrained can then fetch the missing component weights in-process, potentially through the unkillable Xet path the fallback is trying to avoid. The post-download check should reject partial declared component sets instead of treating one component as proof that the pipeline is warm.
Useful? React with 👍 / 👎.
| continue | ||
| try: | ||
| if entry.is_file(): | ||
| weights.append(entry.relative_to(snapshot_dir).as_posix()) |
There was a problem hiding this comment.
Reject noncanonical weights for broad patterned warms
When callers use broad allow patterns such as ['*'] or ['*.safetensors'] to warm a default model, a stale snapshot containing only a selected noncanonical weight (for example adapter_model.safetensors, consolidated.safetensors, or a GGUF file) is appended here and then makes _download_result_usable return true. A subsequent default from_pretrained does not read those files as the base model weight and can fetch the missing canonical model.safetensors/pytorch_model.bin in-process over Xet, so broad patterned requests need the same canonical-weight guard used by the unpatterned path.
Useful? React with 👍 / 👎.
…fetensors is preferred A default transformers load probes model.safetensors BEFORE pytorch_model.bin, but the pre-download fast path treated a cache holding only pytorch_model.bin as complete and handed it back without the killable child. For a repo that ALSO publishes model.safetensors (which the local cache cannot rule out), the in-process load then fetches the preferred model.safetensors over un-killable Xet -- the exact hang this fallback exists to prevent. The strict pre gate (_cache_can_skip_download) now passes prefer_safetensors=True to snapshot_dir_is_complete: a .bin weight satisfies the fast path only when safetensors is IGNORED (use_safetensors=False reads .bin), otherwise a bin-only cache defers to the child, which does the authoritative Hub check. A safetensors cache, a dual-format cache, and a use_safetensors=False bin cache all still fast-path, so the common load is unaffected. The flag is pre-only. The POST shard check (_has_incomplete_canonical_root_shards -> snapshot_dir_is_complete) leaves it False, so a finished bin-only download -- a genuinely bin-only repo -- is still accepted and never false-rejected into a DownloadStallError (POST stays lenient; a bin-only cache is not positive breakage evidence there). Adds a regression test (184 -> 185). Verified against the safety-invariant fuzz (0 false-accepts) and the stall->HTTP e2e sim. Not taken (same review): - Require ALL declared Diffusers components: a static weight-bearing-component heuristic false-rejects a complete download when a component is disabled at the repo level (safety_checker: [null, null]) or optional; POST stays lenient and one declared component weight already separates a real warm from a config-only / undeclared-leftover stale cache. - Reject noncanonical weights for broad patterned warms: a base load passes allow_patterns=None (routed through the canonical-scoped _root_model_has_weight, which this change makes safetensors-aware), never a broad ['*'] / ['*.safetensors'] glob; a broad glob's intent is ambiguous (adapter / subfolder / diffusers), so requiring a canonical root weight would false-reject those legitimate broad-glob downloads.
|
Thanks for the review. Addressed in Fixed (finding 1, real strict-PRE hang): bin-only cache when safetensors is preferred. Not taken (finding 2): require all declared Diffusers components. Not taken (finding 3): reject noncanonical weights for broad patterned warms. |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 26bd3d56f9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| rels.append(rel) | ||
| except OSError: | ||
| return False | ||
| return bool(_filter_paths(rels, None, ignore_patterns)) |
There was a problem hiding this comment.
Require missing Diffusers components before accepting cache
When model_index.json declares multiple weight-bearing components, this accepts the snapshot as soon as any declared component weight is present. For example, a stale cache from a prior vae/*-only download with unet also declared but absent passes _download_result_usable, so the later DiffusionPipeline.from_pretrained still has to fetch the missing unet weights in-process over Xet, bypassing the stall fallback this code is meant to enforce. Please verify the declared weight-bearing components needed by the pipeline, not just that one declared component has a weight.
Useful? React with 👍 / 👎.
| # a numbered shard of the read (preferred) format with NO index in its directory: the load cannot | ||
| # enumerate the set and would fetch the index + remaining shards over Xet. | ||
| preferred = "safetensors" if "safetensors" in fmts else "bin" | ||
| if preferred not in index_fmts.get(dir_rel, set()): |
There was a problem hiding this comment.
Match shard indexes to the selected shard set
This check keys index_fmts only by directory and format, so any safetensors index in the same directory can mask a selected lone shard even when it belongs to a different weight. For example, with allow_patterns=['adapter_model*'], a cache containing adapter_model-00001-of-00002.safetensors but no adapter_model.safetensors.index.json is accepted if an unrelated safetensors index is present, and the subsequent load will fetch the missing adapter index/shards in-process over Xet instead of using the fallback.
Useful? React with 👍 / 👎.
| if not allow or any(not isinstance(p, str) or _has_glob(p) for p in allow): | ||
| return True |
There was a problem hiding this comment.
Require exact files even when allow_patterns has globs
When an allow list mixes a glob with exact required files, this returns early and never verifies the exact names. A common adapter warm like allow_patterns=['adapter_model*', 'adapter_config.json'] is accepted with only adapter_model.safetensors on disk, so the later PEFT load has to fetch the missing adapter_config.json in-process instead of going through the watched fallback.
Useful? React with 👍 / 👎.
| continue | ||
| except OSError: | ||
| return False | ||
| return bool(_filter_paths(weights, allow_patterns, ignore_patterns)) |
There was a problem hiding this comment.
Distinguish root model globs from adapter weights
For a root-level glob such as allow_patterns=['*.safetensors'], this treats any selected loadable weight as satisfying the default model load. If a stale cache contains only adapter_model.safetensors, _download_result_usable returns true even though from_pretrained will still probe model.safetensors / pytorch_model.bin and fetch the missing base weight in-process over Xet; root-model glob requests need to require a canonical root weight rather than any selected weight.
Useful? React with 👍 / 👎.
| # No psutil / /proc: fall back to following only newly-created (post-baseline) partials. | ||
| owned = {name: n for name, n in sizes.items() if name not in baseline} | ||
| return (sum(owned.values()), len(owned) > 0) | ||
| return get_hf_download_state(repo_ids, repo_type = repo_type, cache_dir = cache_dir) |
There was a problem hiding this comment.
Scope snapshot stall progress to the child process
Snapshot attempts fall back to the repo-wide cache measurement here, so a concurrent same-repo download that is still writing any other blob keeps changing current_size and resets the stalled snapshot child's timer. In that environment the child with its own unchanged .incomplete can hang indefinitely without switching to HTTP until the sibling stops; the snapshot watchdog should use the same child-owned partial scoping as single-file downloads, or otherwise exclude sibling progress.
Useful? React with 👍 / 👎.
|
Thanks for the pass. Checked all five against the actual loader allow-lists ( 1. Require missing Diffusers components (POST) - not taken. 2. Match shard indexes to the selected shard set - not taken (not reachable). 3. Require exact files even when allow_patterns has globs - not taken. 4. Distinguish root model globs from adapter weights - not taken (not reachable). 5. Scope snapshot stall progress to the child process - not taken. |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 26bd3d56f9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| rels: list = [] | ||
| try: | ||
| for entry in snapshot_dir.rglob("*"): | ||
| if not _is_default_load_weight_file(entry.name): |
There was a problem hiding this comment.
Exclude variant component weights from plain diffusers checks
When a previous variant='fp16' diffusers download left only files such as unet/diffusion_pytorch_model.fp16.safetensors, a later plain warmup with variant=None accepts that snapshot because this check only verifies the suffix. _download_result_usable can then return true even though the plain pipeline load needs the non-variant component weight, so the subsequent in-process load may fetch the missing diffusion_pytorch_model.safetensors through Xet instead of retrying over HTTP.
Useful? React with 👍 / 👎.
| if not _is_loadable_weight_file(entry.name): | ||
| continue | ||
| try: | ||
| if entry.is_file(): | ||
| weights.append(entry.relative_to(snapshot_dir).as_posix()) |
There was a problem hiding this comment.
Ignore training checkpoints for patterned model warms
When a patterned model request is broad enough to include training checkpoints, such as allow_patterns=['*'] or ['*.bin'], this loop treats checkpoint-500/pytorch_model.bin as the selected readable model weight. Those checkpoint subtrees are not read by the subsequent base from_pretrained, so a stale checkpoint-only snapshot can pass _download_result_usable and then fetch the missing root weights in-process via Xet instead of retrying over HTTP.
Useful? React with 👍 / 👎.
| base = index_name.split(marker, 1)[0] | ||
| if not base: | ||
| return None | ||
| prefix = f"{dir_rel}/" if dir_rel else "" | ||
| return f"{prefix}{base}-00001-of-00002.{ext}" |
There was a problem hiding this comment.
Preserve variant tokens in malformed-index probes
For variant shard indices like diffusion_pytorch_model.safetensors.index.fp16.json, this builds a probe without the fp16 token. With an allow pattern such as unet/*fp16*, a truncated or malformed variant index is treated as out of scope; if one matching shard is present, the snapshot can be accepted even though the loader will read the broken index or missing shards in-process rather than retrying over HTTP.
Useful? React with 👍 / 👎.
| if _patterns_are_exact_names(allow_patterns): | ||
| return False # an exact-named subset defers to the exact-file presence check |
There was a problem hiding this comment.
Validate exact sharded weight requests
When allow_patterns is an exact list, this returns before checking shard indices. An exact request that includes a sharded weight index plus only some of the listed shards can therefore pass _download_result_usable as soon as the named files exist, even though the loader will read the index and fetch the missing shards in-process through Xet instead of retrying the incomplete snapshot over HTTP.
Useful? React with 👍 / 👎.
…oad check _has_diffusers_component_weight counted ANY loadable weight in a declared component subfolder, including a variant one (unet/diffusion_pytorch_model.fp16.safetensors). So a PLAIN diffusers warm (variant=None) whose silently-stale snapshot_download returned a cache left by a prior variant='fp16' pull passed _download_result_usable, and the in-process DiffusionPipeline load -- which reads the NON-variant diffusion_pytorch_model.safetensors -- then fetched the missing plain weight over un-killable Xet, the hang this fallback exists to prevent. The plain component-presence check now counts only CANONICAL (non-variant) component weights via _CANONICAL_COMPONENT_WEIGHT_RE (a base with no intermediate dotted token, single or numbered shard, safetensors/bin). This mirrors the checks that were already variant-aware: the root check (_CANONICAL_ROOT_MODEL_WEIGHT_RE excludes model.fp16.safetensors) and the plain component SHARD regex in _diffusers_component_shards_incomplete. A complete plain pipeline, a pipeline shipping both plain and variant weights, and the variant='fp16' warm of the same cache are all still accepted -- the variant analog (_has_diffusers_component_variant_weight) is untouched, so no complete download is false-rejected into a DownloadStallError. Adds a regression test (185 -> 186). Safety-invariant fuzz stays at 0 false-accepts; stall->HTTP e2e sim stays green. Not taken (same review): - Ignore training checkpoints for patterned model warms (allow=['*'] / ['*.bin']): not loader-reachable -- a base load passes allow_patterns=None (root-scoped, checkpoint subtrees already excluded), never a broad ['*'] / ['*.bin']; the patterned warms are gguf / adapter / subfolder-scoped. - Preserve variant tokens in malformed-index probes (allow=['unet/*fp16*']): not loader-reachable -- a variant diffusers warm passes allow=None or the subfolder ['unet/*'] glob, both of which already flag a malformed variant component index (the probe survives an allow=None or 'unet/*' filter); no loader emits a variant-infix glob. - Validate exact sharded weight requests: not loader-reachable -- every loader warm carries a glob (*.py, ...), so _patterns_are_exact_names is never true for a loader; a direct caller that lists an index but only some of its shards is honoring exactly the files it named.
|
Thanks for the pass. Checked all four; one was real and is fixed in Fixed (finding 1): variant component weights leaking into the plain Diffusers check. Not taken (finding 2): ignore training checkpoints for patterned model warms. Not taken (finding 3): preserve variant tokens in malformed-index probes. Not taken (finding 4): validate exact sharded weight requests. |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 21fcb5580d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| repo_id, probe_filename, repo_type = repo_type, revision = revision, cache_dir = cache_dir | ||
| ) | ||
| if isinstance(cached, str) and os.path.exists(cached): | ||
| return cached |
There was a problem hiding this comment.
Refresh mutable file revisions before returning cache
When revision is omitted or is a branch/tag, try_to_load_from_cache only resolves the local cached ref and does not perform the HEAD/etag check that hf_hub_download normally does. Returning here means a previously cached main file is reused forever after the Hub repo moves, unless the caller knows to pass force_download=True, so users can silently load stale weights/configs.
Useful? React with 👍 / 👎.
| ignore_patterns = ignore_patterns, | ||
| variant = variant, | ||
| ): | ||
| return cached_dir |
There was a problem hiding this comment.
Refresh mutable snapshots before taking the fast path
For the common revision=None/main case, this local_files_only=True probe validates only the cached snapshot currently recorded in refs/main; it never checks whether the remote branch now points at a newer commit as snapshot_download normally would. A complete old snapshot therefore bypasses the child download and users keep loading stale model contents until they force a download.
Useful? React with 👍 / 👎.
| logical = _exact_weight_logical(base) | ||
| groups.setdefault((parent, logical), []).append(rel) | ||
| return all( | ||
| any(candidate in present for candidate in candidates) for candidates in groups.values() |
There was a problem hiding this comment.
Allow unmatched exact snapshot filters
When allow_patterns is an exact list, Hugging Face treats those names as filters, not required files: an unmatched pattern is fine if the repo simply does not contain that optional file. This check instead requires every exact non-weight-equivalent name to be present after a successful child download, so a request such as ['tokenizer.json', 'tokenizer.model'] against a repo that only has one tokenizer file is retried over HTTP and then fails with DownloadStallError despite the snapshot being valid.
Useful? React with 👍 / 👎.
| # spawning that child, never for a training/inference process. The | ||
| # unconditional HF cache redirect above still runs, so the child writes to the | ||
| # same cache as the parent. | ||
| _SKIP_GPU_INIT = os.environ.get("UNSLOTH_ZOO_DISABLE_GPU_INIT", "0") == "1" |
There was a problem hiding this comment.
Make shared fallback importable without Unsloth
When Studio or another lightweight caller imports unsloth_zoo.hf_xet_fallback in a normal process, this opt-in gate leaves _SKIP_GPU_INIT false unless the caller already knew to set UNSLOTH_ZOO_DISABLE_GPU_INIT=1, so package import continues into the existing torch/unsloth requirements and can raise before the new shared fallback module is available. Since the new module is intended to be imported outside the training process, it needs a default import-light path for that context rather than only for spawned children.
Useful? React with 👍 / 👎.
| continue | ||
| except OSError: | ||
| return False | ||
| return bool(_filter_paths(weights, allow_patterns, ignore_patterns)) |
There was a problem hiding this comment.
Require base weights for broad patterned model warms
When a model snapshot is checked with a broad allow_patterns such as ['*.safetensors'], this accepts any selected loadable weight as proof the model warm is usable. A stale cache containing only adapter_model.safetensors therefore passes _download_result_usable even though a default base-model load reads model.safetensors/pytorch_model.bin and will fetch the missing base weight in-process over Xet after the protective child returns.
Useful? React with 👍 / 👎.
|
Thanks. Checked all five against the loader paths and the module's actual consumers; none needs a change on Not taken (1 & 2): refresh mutable file/snapshot revisions before the fast path. Not taken (3): allow unmatched exact snapshot filters. Not taken (4): make the shared fallback importable without Unsloth. Not taken (5): require base weights for broad patterned model warms. |
…atchdog, and honor exact variant shard requests Three fixes from a consolidated 15-reviewer pass (12x reviewer.py personas + 3 forked reviewers). 1. from_tf / from_flax POST false-reject (HIGH, loader-reachable). A base load with from_tf=True / from_flax=True prefetches with ignore=['*.safetensors','*.bin',...] and reads tf_model.h5 / flax_model.msgpack. _root_model_has_weight recognized a root weight only via _CANONICAL_ROOT_MODEL_WEIGHT_RE (safetensors/bin), so a COMPLETE h5/msgpack download read as "no weight" -> POST rejected it -> HTTP retry -> same h5-only result -> DownloadStallError on a working load. It now also counts a canonical TF/Flax root weight (_CANONICAL_ROOT_TF_FLAX_WEIGHT_RE), gated on _pytorch_root_weight_formats_ignored (BOTH model.safetensors and pytorch_model.bin dropped by the ignore filter -- the from_tf/from_flax signature). A normal load keeps a PyTorch format so the gate is False and behavior is unchanged; a stray leftover h5 under a normal load still does not count. 2. Single-file watchdog resumed-partial hang without pid ownership (Med, Windows/macOS without psutil). When a child_pid is supplied but its open files cannot be inspected (no psutil AND no /proc), _measure followed only post-baseline partials; a RESUMED partial reuses a baseline blob-hash name, so a frozen Xet resume was excluded forever and never tripped the stall -- the exact hang the fallback exists to prevent. That branch now falls back to the repo-wide measure (as snapshots use), so a resumed partial is watched. The child_pid=None path (no pid at all) keeps the post-baseline name filter unchanged; the precise open-fds path (psutil / /proc) is untouched. 3. Exact-named variant shard requests false-rejected (Low). The plain branch skipped shard-set completeness for an exact-name allow list, but the variant branch did not, so allow=['model.fp16-00001-of-00002.safetensors'] + variant='fp16' -- a satisfied exact request -- was rejected as an incomplete shard set. The _patterns_are_exact_names escape now sits above both branches, so an exact-named subset (variant or plain) defers to the exact-file presence check. Adds 3 regression tests (186 -> 189). Safety-invariant fuzz stays at 0 false-accepts (PRE stays conservative -- these are all POST leniency / watchdog fixes); stall->HTTP e2e sim stays green. Not taken (same review pass): - Injected prepare_for_http_fn missing owned_incomplete_blobs (flagged by several personas): by design and documented at the call site -- an injected (Studio) hook owns its own marker-based cache accounting and keeps the plain (repo_type, repo_id) signature; the default hook gets the owned set. - Missing HF kwargs (local_dir / max_workers / etag_timeout): the helper is a cache-based prefetch wrapper, not a full snapshot_download drop-in; local_dir bypasses the cache the completeness gate reasons about. Current callers pass only what the prefetch needs. - token dropped on the in-process cache probes: both are local_files_only=True (cache-only, no network), so no auth is needed there. - License metadata (LGPL vs AGPL): out of scope per maintainer. - Snapshot repo-wide stall masked by a concurrent same-repo pull in a separate process: documented tradeoff; per-pid snapshot scoping would risk false-stalling a progressing snapshot.
…t-checkpoint shard sets, harden the process-group kill, and drop dead code Four fixes from a second consolidated review pass (12x reviewer.py personas + 3 forked reviewers) on 930d027. 1. Sharded TF/Flax POST false-accept -> hang (from the prior from_tf/from_flax fix). TF/Flax weights can be SHARDED (transformers TF2_WEIGHTS_INDEX_NAME=tf_model.h5.index.json, FLAX_WEIGHTS_INDEX_NAME=flax_model.msgpack.index.json). _CANONICAL_ROOT_TF_FLAX_WEIGHT_RE matched the shard form, so a LONE shard read as a present weight and _download_result_usable accepted an INCOMPLETE sharded set -> the in-process load then fetched the missing shard over un-killable Xet. The regex is now single-file only (tf_model.h5 / flax_model.msgpack); a sharded set is judged through its index via _weight_shard_index_complete (every listed shard present). Complete sharded and single-file downloads are accepted; an incomplete shard set or a lone shard without its index is retried over HTTP. 2. Explicit-checkpoint shard set accepted incomplete -> hang. _selected_shard_index_incomplete skipped EVERY training-checkpoint subtree, but subfolder=checkpoint-N (allow=['checkpoint-N/*']) is an explicit load of that checkpoint. A lone checkpoint-7/model-00001-of-00002.safetensors with no index was accepted, then the load fetched the missing index/shards over Xet. The skip is now conditional (_request_scopes_into_dir): a checkpoint the request explicitly targets is completeness-checked, while a leftover checkpoint a base/adapter/other-subfolder warm does not read is still skipped (no false-reject). 3. _terminate_process_group could signal the WRONG process group. The SIGTERM path called os.killpg(pid) whenever pid was non-null; before the child runs os.setsid() its pid is not yet a pgid, and a recycled pid could collide with an unrelated group. It now signals the group only when os.getpgid(pid) == pid (the child is confirmed its own leader), else falls back to a single-process signal -- matching the caution the post-reap SIGKILL branch already documents. 4. Removed dead _has_any_weight (defined, never called; the readable-weight checks supersede it) and the two docstrings that named it. Adds 2 regression tests (189 -> 191). Safety-invariant fuzz stays at 0 false-accepts; stall->HTTP e2e sim (which exercises the kill path) stays green. Not taken (same review pass): - Require ALL declared Diffusers components in POST: a static heuristic false-rejects a complete pipeline that declares weightless components (scheduler/tokenizer/feature_extractor, safety_checker:[null,null]); POST stays lenient and one declared component weight already separates a real warm from a config-only stale cache. - Injected prepare_for_http_fn missing owned_incomplete_blobs (5 personas): by design and documented at the call site -- an injected (Studio) hook owns its own marker-based cache accounting and keeps the plain (repo_type, repo_id) signature. - Transient-error / crashed-child HTTP retry uses the coarse mtime guard rather than an owned-partial set: the stall path captures owned partials from the child's open fds, but a crashed/errored child has no fds to inspect, so the coarse guard (spares actively-progressing siblings) is the only option there -- inherent, not an asymmetry that can be closed. - Broad ['*.safetensors'] base-model warm accepts an adapter-only cache: not loader-reachable (a base load passes allow_patterns=None, never a broad glob). - Missing HF kwargs (local_dir / max_workers / etag_timeout): the helper is a cache-based prefetch wrapper, not a full snapshot_download drop-in; local_dir bypasses the cache the completeness gate reasons about. Current callers pass only what the prefetch needs.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b2089bfd95
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if want_variant is None: | ||
| shard_file_re = re.compile(r"^[^.]+-\d{5}-of-\d{5}\.(?:safetensors|bin)$") | ||
| else: | ||
| v = re.escape(want_variant) | ||
| shard_file_re = re.compile(rf"^[^.]+\.{v}-\d{{5}}-of-\d{{5}}\.(?:safetensors|bin)$") |
There was a problem hiding this comment.
Check TF/Flax shard sets for patterned warms
For patterned from_tf/from_flax warms such as allow_patterns=['*.h5'] or ['*.msgpack'], this helper only recognizes safetensors/bin shard files and indices. A stale snapshot containing only tf_model-00001-of-00002.h5 (or an *.h5.index.json missing a shard) therefore passes _download_result_usable, and the later from_pretrained(..., from_tf=True) / from_flax=True load still has to fetch the index or remaining shard in-process over Xet instead of taking the supervised HTTP retry path. Please include the TF/Flax shard/index names in the selected-shard completeness check, not just the unpatterned root check.
Useful? React with 👍 / 👎.
| if index_path.is_file() and _weight_shard_index_complete(index_path): | ||
| return True |
There was a problem hiding this comment.
Honor ignores for TF/Flax shard indices
When the request excludes one of the TF/Flax formats, the single-file path above applies ignore_patterns, but this sharded-index path accepts any complete TF/Flax index regardless of whether that format was ignored. For example, a from_tf warm that ignores *.msgpack can be satisfied by a stale complete flax_model.msgpack.index.json while tf_model.h5 is absent, so the later from_pretrained(..., from_tf=True) fetches the missing TF weights in-process over Xet instead of using the supervised retry. Apply the same format filter to these index checks before accepting them.
Useful? React with 👍 / 👎.
| "adapter_model.safetensors", | ||
| "adapter_model.bin", | ||
| "consolidated.safetensors", | ||
| "consolidated.00.pth", | ||
| "diffusion_pytorch_model.safetensors", |
There was a problem hiding this comment.
Classify dashed adapter/component shard globs as weights
The probe set omits dashed adapter and component shard names, so filters like allow_patterns=['adapter_model-*', 'adapter_model.safetensors.index.json'] or ['unet/diffusion_pytorch_model-*', ...] are classified as weightless. In those patterned shard warms, _download_result_usable skips the readable-weight and shard-completeness checks entirely, so an index-only or config-only stale snapshot can be accepted and the later PEFT/Diffusers load fetches the missing shards in-process over Xet. Add representative dashed adapter/component shard probes so these requests stay weight-bearing.
Useful? React with 👍 / 👎.
| if not is_diffusers and _request_selects_canonical_root_shards( | ||
| allow_patterns, ignore_patterns | ||
| ) and _has_incomplete_canonical_root_shards(snapshot_dir, ignore_patterns = ignore_patterns): |
There was a problem hiding this comment.
Scope root shard rejects to the allowed format
For patterned root warms, this only checks that the request selects some canonical root shard and then runs the root-shard completeness check without allow_patterns. With allow_patterns=['pytorch_model*'], a successful bin-only snapshot containing pytorch_model.bin is rejected if an unrelated stale model.safetensors.index.json is missing shards, even though that safetensors set is outside the filter and a bin load will not read it; after the HTTP retry this becomes a DownloadStallError for a valid filtered snapshot. Thread the allow scope into the root shard check or choose the selected format before failing.
Useful? React with 👍 / 👎.
| _CANONICAL_COMPONENT_WEIGHT_RE = re.compile( | ||
| r"^[^.]+(?:-\d{5}-of-\d{5})?\.(?:safetensors|bin)$" |
There was a problem hiding this comment.
Reject noncanonical weights in Diffusers components
This regex accepts any no-dot safetensors/bin basename inside a declared Diffusers component, including files such as unet/adapter_model.safetensors. A stale cache with model_index.json declaring unet and only that adapter sidecar present passes _download_result_usable, but DiffusionPipeline will still look for the component's canonical weight such as diffusion_pytorch_model.safetensors/model.safetensors and can fetch it in-process over Xet. Restrict declared component evidence to the weight basenames Diffusers/Transformers components actually load.
Useful? React with 👍 / 👎.
…n CI Two fixes from a third consolidated review pass (12x reviewer.py personas + 3 forked reviewers) on b2089bf. 1. Nested checkpoint subfolder POST false-accept -> hang. _request_scopes_into_dir matched the checkpoint dir only against the FIRST path segment, so a nested subfolder=foo/checkpoint-7 (allow=['foo/checkpoint-7/*']) did not register as an explicit checkpoint load: the checkpoint subtree was skipped, a lone incomplete shard there was accepted, and the in-process load then fetched the missing index/shards over un-killable Xet -- the same hang the flat-checkpoint fix closed, one directory deeper. The scope check now matches the target dir against ALL literal leading segments (up to the first glob), so an explicitly-loaded checkpoint at any depth is completeness-checked while a leftover checkpoint the request does not target is still skipped. 2. test_hf_xet_fallback.py now runs in CI. The 191-test fallback suite was only reached by `pytest --collect-only` (continue-on-error), so it could regress without failing CI. It is now in the behavioral HARD GATE alongside the other CPU-pure gated files, so a future change that breaks the watchdog / cache-completeness / Xet->HTTP fallback behavior blocks the merge. Adds a nested-checkpoint regression assertion. Full suite 191 passed; safety-invariant fuzz stays at 0 false-accepts. Known low-reachability edge (documented, not fixed): a from_tf/from_flax load with a non-root subfolder AND a SHARDED TF/Flax weight AND a silent-stale transient can POST-accept an incomplete shard set -- the sharded TF/Flax completeness check landed on the allow=None root path, and the patterned-subfolder path's index detection is safetensors/bin-only. The conjunction (from_tf is already niche + subfolder + sharding + a transient-error stale return) is essentially unreachable, and PRE still defers safely; extending the generic patterned-shard machinery to h5/msgpack is not worth its surface for that edge. Not taken (same review pass): - Retryable-error / crashed-child HTTP retry uses the coarse mtime guard, not an owned-partial set: the stall path captures owned partials from the LIVE child's open fds, but a child that returned a retryable error / crashed has already exited, so there are no fds to inspect -- the coarse guard (which spares an actively-progressing sibling by mtime) is the only option there. Inherent, not a closable asymmetry. - Injected prepare_for_http_fn missing owned_incomplete_blobs: by design and documented at the call site -- a Studio hook owns its own marker-based cache accounting. - Broad ['*.safetensors'] / ['model*','adapter_model*'] accepting one selected weight: not loader-reachable (a base load passes allow=None; no loader requests base+adapter together). - Exact ['config.json','tokenizer.json'] over-requiring optional files: not loader-reachable (every loader allow-list carries a glob, so the exact-file check is never the deciding gate). - import unsloth_zoo.hf_xet_fallback not import-light by default: intentional opt-in (UNSLOTH_ZOO_DISABLE_GPU_INIT=1); the download child sets it, Studio imports its own fork.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fb80cf9a16
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if base.endswith(_WEIGHT_FILE_SUFFIXES): | ||
| return True |
There was a problem hiding this comment.
Treat trainer state files as weightless filters
When an exact allow_patterns entry names a trainer artifact such as optimizer.pt, training_args.bin, or rng_state_0.pth, this branch marks the request as weight-bearing solely because the basename has a weight-like suffix. The later readable-weight check excludes those same artifacts via _is_loadable_weight_file, so a successful filtered snapshot containing the requested state file is classified as incomplete and retried/failed with DownloadStallError. Apply the same non-weight basename/prefix exclusion here before returning true for exact suffix matches.
Useful? React with 👍 / 👎.
| "model.fp16.safetensors", | ||
| "model-00001-of-00002.safetensors", | ||
| "pytorch_model.bin", | ||
| "pytorch_model-00001-of-00002.bin", |
There was a problem hiding this comment.
Classify pytorch variant globs as weights
For variant-specific bin filters such as allow_patterns=['pytorch_model.fp16*'], this probe set has no representative pytorch_model.fp16.bin / sharded variant name, so _pattern_can_select_weight classifies the request as weightless. A stale config-only snapshot is then accepted even with variant='fp16', and the later from_pretrained(..., variant='fp16', use_safetensors=False) load can fetch the missing pytorch_model.fp16.bin in-process over Xet; add PyTorch-bin variant probes alongside the safetensors variant probe.
Useful? React with 👍 / 👎.
| _active_incomplete_blob_sizes(repo_type, repo_id, params.get("cache_dir")) | ||
| ) | ||
| owned = current - baseline_partials | ||
| params["_owned_incomplete_blobs"] = owned |
There was a problem hiding this comment.
Don't scope HTTP cleanup to an empty owned set
When a snapshot watchdog fires before the child has opened any .incomplete file, such as a stale dangling symlink from an older snapshot making has_active_incomplete_blobs true during a slow metadata call, _child_open_incomplete_blobs returns an empty set and this stores it as the owned cleanup scope. _default_prepare_for_http treats that empty set as authoritative and skips every stale blob/link, leaving the same broken symlink to retrip the HTTP attempt; use None for the no-owned-partials case or avoid child-owned scoping for snapshot-wide stale state.
Useful? React with 👍 / 👎.
| for shard_rels in by_fmt.get("safetensors") or by_fmt.get("bin") or []: | ||
| for shard in shard_rels: | ||
| try: | ||
| if not (snapshot_dir / shard).exists(): | ||
| return True |
There was a problem hiding this comment.
Let single subfolder weights beat stale indices
When a patterned subfolder warm selects both encoder/model.safetensors and a stale encoder/model.safetensors.index.json, this loop rejects the snapshot as soon as any shard listed by the index is absent. Transformers checks the single safetensors file before the index in that same directory, so this valid warmed subfolder is retried and can end in DownloadStallError; mirror the root-weight precedence here and ignore a selected index when the corresponding single weight of the preferred format is present.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fb80cf9a16
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if base.endswith(_WEIGHT_FILE_SUFFIXES): | ||
| return True | ||
| if any(ch in base for ch in _GLOB_CHARS): | ||
| return any(fnmatch.fnmatchcase(probe, base) for probe in _WEIGHT_PATTERN_PROBES) |
There was a problem hiding this comment.
Treat directory-prefix globs as weight-bearing
When allow_patterns uses a directory-prefix glob without a slash, such as ['checkpoint-*'] or ['unet*'], Hugging Face's filter selects files below matching directories, but this check only compares the final basename to weight-file probes and returns False. That makes request_can_include_weights treat these weight-bearing snapshot requests as weightless, so a stale config-only subset can be accepted and the later checkpoint/component load may fetch the missing weights in-process instead of through the fallback.
Useful? React with 👍 / 👎.
| target = Path(os.readlink(link)) | ||
| if not target.is_absolute(): | ||
| target = link.parent / target | ||
| incomplete_partner = target.with_name(target.name + INCOMPLETE_SUFFIX) |
There was a problem hiding this comment.
Spare unique incomplete partners for live symlinks
In supported newer huggingface_hub versions, the temporary partial for blob <etag> is named like <etag>.<uuid>.incomplete, not exactly <etag>.incomplete. When default HTTP prep runs without an owned-partial set, a concurrent same-repo sibling can have a dangling snapshot symlink to <etag> plus a fresh unique partial, but this exact lookup misses the active partner and unlinks the sibling's live symlink instead of sparing it, leaving that sibling snapshot vulnerable to being returned with the pointer removed.
Useful? React with 👍 / 👎.
| "retrying with HF_HUB_DISABLE_XET=1", label | ||
| ) | ||
| _safe_status(on_status, f"{label}: download crashed, retrying over HTTP") | ||
| disable_xet = True |
There was a problem hiding this comment.
Clear crashed snapshot partials before HTTP retry
When a snapshot child crashes shortly after creating a partial, this retry path flips to HTTP without recording any child-owned partials. The next loop's HTTP prep treats the just-written .incomplete as a fresh active sibling and spares it, so the snapshot watchdog still sees a constant repo-wide incomplete blob and can declare the HTTP retry stalled during metadata or lock wait even though the crashed child's partial should have been removed before retrying.
Useful? React with 👍 / 👎.
The shared Xet->HTTP fallback does not require a version bump; keep the base version.
Condense docstrings and inline comments across hf_xet_fallback.py, hf_cache_state.py, the __init__.py opt-in block, and the test suite: drop restating-the-obvious comments, collapse multi-line blocks, cut small-helper docstrings to one line. Comments/docstrings only, no code changed (AST-verified). Preserves the load-bearing rationale tersely (STRICT pre-download / LENIENT post-download, the un-killable Xet hang reason, and the scoping / format / variant / from_tf / diffusers / checkpoint / redaction notes). Full suite 191 passed; safety fuzz 0 false-accepts.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Consolidated review pass (12x reviewer.py personas + 3 forked reviewers). 8 of 12 reviewers independently reproduced a post-download false-accept: _has_diffusers_component_weight returned True as soon as ANY declared component held a weight, so a stale partial pipeline (model_index.json declaring unet + vae with only unet present, or vae present with its config but no weight) was accepted -- then the in-process DiffusionPipeline load fetched the missing component over un-killable Xet, the exact hang this fallback prevents. Diffusion is a real loader consumer (diffusion.py warms with allow_patterns=None), so it is reachable. An unpatterned diffusers warm now requires that every DECLARED ACTIVE component (a [library, class] spec with both non-null) is materialised as a non-empty subfolder AND that each model-style component (one carrying config.json) holds a readable weight of the read format. A [null, null] disabled component (e.g. safety_checker) and weightless components (scheduler / tokenizer / feature_extractor, which carry *_config.json, not config.json) are not required, so a complete or safety-checker-disabled pipeline is not false-rejected into a DownloadStallError. A variant load accepts a component's canonical weight as diffusers' per-component fallback, so a mixed variant/canonical pipeline still passes. A malformed / empty model_index.json fails OPEN to the prior lenient any-weight check. Adds a regression over component-absent, config-only, complete-with-null-safety- checker, variant-absent, and mixed-variant cases. Full suite 192 passed; safety- invariant fuzz stays at 0 false-accepts. Not taken (re-verified, no new evidence): - Injected prepare_for_http_fn dropping owned_incomplete_blobs: by design; a Studio hook owns its marker-based cache accounting (documented at the call site). - Retryable / crashed-child HTTP prep using the coarse mtime guard: an exited child has no open fds to scope owned partials; inherent, not closable. - Missing hf_hub kwargs / non-import-light default: a purpose-built prefetch helper (local_dir breaks the cache model) and an intentional opt-in (UNSLOTH_ZOO_DISABLE_GPU_INIT), not a general hf_hub_download replacement. - Patterned TF/Flax sharded-subfolder edge: documented low-reachability; PRE never accepts tf/flax, so it can only ever cause a safe retry.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Follow-up review pass (12x reviewer.py personas + 3 forks) on the diffusers component-completeness gate. A residual false-accept remained: an active component dir that is non-empty but carries NEITHER a top-level config.json NOR a readable weight NOR a weightless sidecar (only a stray file or a nested training-checkpoint) passed the presence rule and skipped the config-gated weight rule, so a stale partial mid-download component was accepted and the in-process pipeline load then fetched its weight over un-killable Xet. A component now satisfies the warm only if it holds a readable weight of the read format, OR is WEIGHTLESS-shaped (a scheduler / tokenizer / feature_extractor sidecar: a *_config.json / tokenizer file and no config.json). A dir with neither is an incomplete model component and is retried over HTTP. Complete pipelines (including SDXL dual text encoders, sharded components, fp16 mixed / canonical fallback) still accept; the safety-invariant fuzz stays at 0 false-accepts and the suite is 193 passed. Not changed (re-verified this pass): the variant floor requiring at least one variant weight is correct -- diffusers itself raises on a zero-variant repo, so a canonical-only request for a variant is not a load diffusers would complete. The recurring reviewer.py "stale-branch revert" flag is a harness artifact (GitHub reports the PR mergeable / clean).
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Summary
Hugging Face Xet (
hf_xet) is the fast default transport, but a Xet transfer can hang on a blob with no progress and no exception, and a blocked native Xet thread cannot be killed in-process. Unsloth Studio already recovers from this (a no-progress watchdog respawns the download over plain HTTP withHF_HUB_DISABLE_XET=1). This PR lands the same logic inunsloth_zooso it has a single home that both Unsloth main and Studio can share, rather than living only inside Studio.This is the foundation PR. A follow-up in the
unslothrepo wiresFastModel.from_pretrainedto it and re-points Studio's copy to import from here.What this adds
unsloth_zoo/hf_cache_state.py: the sparse-aware HF cache primitives. On-disk byte accounting prefersst_blocksso a partially written Xet /hf_transfer.incompleteis not mistaken for full-size progress, plus active.incompletedetection over the currentHF_HUB_CACHEroot.unsloth_zoo/hf_xet_fallback.py:start_watchdog: a 180s no-progress watchdog that fires only while a.incompleteis present and the on-disk size is unchanged, and resets when none is present (so model init is never misread as a stall).hf_hub_download_with_xet_fallback: a single-file download. A cached file short-circuits with no child; otherwise the download runs in aspawnchild (its own process group) supervised by the watchdog. On a stall it kills the child group, makes the partial safe for HTTP, and respawns once withHF_HUB_DISABLE_XET=1. Cancellation and deterministic errors (auth / missing / disk full) propagate without a fallback.snapshot_download_with_xet_fallback: the whole-repo variant. It warms a repo in a killable child (the entrypoint Unslothfrom_pretrainedwill use before its in-process load), with an in-processlocal_files_onlyfast path so a warm cache spawns nothing.UNSLOTH_DISABLE_XETknob (also honorsUNSLOTH_STABLE_DOWNLOADSandHF_HUB_DISABLE_XET) to force HTTP from the first attempt, plus injectableprepare_for_http_fn/ scrub hooks so Studio can pass its marker-aware cache management.Both new files are AGPL-3.0 (per-file SPDX), matching the Studio source they are lifted from; the rest of
unsloth_zoostays LGPL-3.0.Tests
tests/test_hf_xet_fallback.py: CPU-only, no network, no real subprocess (the per-attempt download seam is monkeypatched). Covers the watchdog state machine (constant vs growing vs absent.incomplete, sparse-aware bytes, fires at most once), the transport decision (cached short-circuit, cancel and deterministic errors take no fallback, exactly one Xet to HTTP retry, per-file independence), the injected prepare seam, theUNSLOTH_DISABLE_XETknob, the snapshot fast path and stall fallback, and theHF_HUB_DISABLE_XETprecondition in a fresh interpreter.Result: 22 passed, 1 skipped (the sparse-blocks assertion skips on filesystems that report
st_blocks == 0).