Studio: local diffusion image generation (CI validation)#88
Open
danielhanchen wants to merge 59 commits into
Open
Studio: local diffusion image generation (CI validation)#88danielhanchen wants to merge 59 commits into
danielhanchen wants to merge 59 commits into
Conversation
Backend - core/inference/diffusion.py: DiffusionBackend singleton that loads diffusion GGUFs from Hugging Face via diffusers.GGUFQuantizationConfig and runs them on the active CUDA / MPS / CPU device. Supports FLUX.2, FLUX.2 klein, FLUX.1, Qwen-Image, Stable Diffusion 3, and SDXL. - routes/inference.py: POST /api/inference/images/load, POST /api/inference/images/generate, POST /api/inference/images/unload, GET /api/inference/images/status mirroring the llama-server lifecycle. - models/inference.py: DiffusionLoadRequest, DiffusionGenerateRequest, DiffusionGenerateResponse pydantic schemas with prompt / step / size validation up front so callers get clear 422s rather than VAE crashes. - requirements/no-torch-runtime.txt: pin gguf alongside the existing diffusers entry so GGUFQuantizationConfig works out of the box. - tests/test_diffusion_backend.py + tests/test_diffusion_routes.py: 27 unit tests covering family detection, validation, lifecycle, and the full FastAPI round trip with the backend stubbed. No torch / diffusers / GPU required to run. Frontend - features/images/: standalone images-page.tsx with curated model picker (FLUX.2 klein 4B / 9B, FLUX.2 dev, FLUX.1 dev), HF token field, prompt + negative prompt, resolution presets, steps + guidance sliders, seed input, and a result gallery that renders base64 PNGs inline. - app/routes/images.tsx: lazy /images route wired into router.tsx. - components/app-sidebar.tsx: PaintBrush02Icon nav item between Recipes and Export, hidden in chat-only mode.
for more information, see https://pre-commit.ci
SectionCard requires an icon prop. Pass GpuIcon, PaintBrush02Icon, and SparklesIcon for the three sections so tsc -b stops failing on TS2741 'Property icon is missing'.
Backend - Fix FLUX.2 klein family default base_repo: black-forest-labs/FLUX.2-klein does not exist on the Hub. Point at the Apache 2.0 4B Base instead so the from_pretrained call works out of the box for ungated users. - Serialise concurrent load_model calls with a dedicated _load_lock so two /images/load requests cannot both reach pipeline_cls.from_pretrained at the same time (would double-spend VRAM and corrupt _pipe). - When the caller passes a full diffusers repo (no gguf_filename), use repo_id directly instead of silently substituting the family default. Closes the load-the-wrong-model regression flagged by review. - Drop negative_prompt from the pipeline call when the loaded pipeline does not accept it (FLUX.2 / FLUX.2 klein). Inspect __call__ via inspect.signature so we do not maintain a manual class list. - Best-effort unload the chat backend (llama-server) before a diffusion load so a 24 GB consumer GPU can swap between chat and diffusion without manual unload steps. Frontend - Replace the four curated entries with the actual filenames published on the Hub (lowercase flux-2-klein-Nb-Q4_K_S.gguf and flux2-dev*). - Add an explicit base_repo per curated entry so the backend never falls back to the family default for the curated picker. - Add the Apache 2.0 FLUX.2 klein base 4B entry so first-time users have an ungated, no-token-required default. - Hide the negative prompt field for FLUX.2 / FLUX.2 klein and show a small explanatory note instead. Tests - Add 6 new backend tests: base_repo override, full-repo (no GGUF) no-substitution, concurrent serialise race, signature-based kwarg filter, negative_prompt strip on FLUX.2, negative_prompt preserved on supporting pipelines. 33 tests passing.
for more information, see https://pre-commit.ci
- unload_model now takes _load_lock so it cannot race with an in-flight load_model and have the load thread overwrite cleared state after unload returned is_loaded=false. - Move stable-diffusion-xl out of _FAMILIES into _FULL_REPO_FAMILIES. SDXL uses a UNet (no transformer GGUF path is wired); listing it in the GGUF families panel was misleading. SDXL full-repo loads still work via family_override='stable-diffusion-xl'. - Result gallery now uses h-auto + object-contain so portrait / landscape outputs render at their true aspect ratio instead of being cropped into a square thumbnail.
_release_chat_backend_for_diffusion now unloads both the GGUF chat backend (llama-server) and the safetensors / HF chat backend (get_inference_backend) before a diffusion load. Mirror the behaviour on the chat-load side: both the Unsloth/transformers load path and the GGUF load path now unload the diffusion pipeline before claiming GPU memory. Closes the OOM-on-swap path flagged by reviewers in both directions.
- _smart_base_repo: pick 9B base for unsloth/FLUX.2-klein-9B-GGUF and -base- variants per the repo id, instead of always falling back to the 4B family default. - pipe_kwargs use_safetensors=True so diffusers refuses pickle .bin weights at load time (defends against compromised base_repo). - Release the previous pipeline BEFORE allocating the new one so peak VRAM stays at one model's worth instead of two on swap. - Reject empty gguf_filename when repo_id ends with -GGUF; the prior behavior tried from_pretrained on a GGUF-only repo and 500'd deep in diffusers with a confusing model-index error. - Status returns gguf_filename (basename) instead of gguf_path so the local cache path / username does not leak to authenticated Studio sessions. - requirements/no-torch-runtime.txt: pin diffusers>=0.37.0 so older installs cannot resolve a version without Flux2KleinPipeline. - Frontend curated distilled klein entries now point at the matching non-base diffusers repos (FLUX.2-klein-4B / -9B) per the published model cards. Update api.ts to mirror the renamed status field.
for more information, see https://pre-commit.ci
- routes/export.py load_checkpoint now unloads the diffusion pipeline alongside the existing inference + training unloads, so an export load after Images does not OOM the export subprocess. - Remove the 'sd3.5' alias from the stable-diffusion-3 family. SD3.5 needs its own family + base_repo (and its own smoke test); pairing it with the SD3 Medium base produced a misleading load.
_release_chat_backend_for_diffusion was importing get_inference_backend from core.inference.inference (the in-subprocess class) and calling unload_model() without the required model_name argument. The TypeError was swallowed and the active chat model stayed resident, defeating the chat-to-diffusion lifecycle handoff. Switch to the orchestrator's accessor at core.inference and pass active_model_name through, mirroring the GGUF chat-load path. Add a regression test that stubs both backends and verifies unload_model is called with the active model name.
for more information, see https://pre-commit.ci
- detect_family adds _FAMILY_EXCLUDE so 'stable-diffusion-3.5' no longer matches the SD3 Medium family and 'qwen-image-edit' no longer matches Qwen-Image. Both were misleading silent loads. - from_single_file now forwards config=<effective_base>, subfolder='transformer', and the HF token. Diffusers-format GGUFs (FLUX.2 klein, Qwen-Image, SD3) need the matching base config or the transformer load picks the wrong shapes; gated GGUFs need the token both for download and config read. - Move _release_chat_backend_for_diffusion + new _release_other_gpu_owners_for_diffusion to AFTER the GGUF download and pipeline class lookup so a typo or transient Hub error does not kill the user's currently-loaded chat model. Peak VRAM still stays at one model's worth because the releases run right before from_pretrained. - _release_other_gpu_owners_for_diffusion: shut down the export subprocess and any active training subprocess before a diffusion load. Symmetric with the export load path. - routes/training.py: unload diffusion before starting training so the new subprocess does not race FLUX/Qwen for VRAM. - routes/export.py: also unload the GGUF llama-server before export load (the existing inference-backend unload only covered the safetensors path).
for more information, see https://pre-commit.ci
…othai#5754 When a swap load fails after the previous pipeline is released, status() previously reported is_loaded=false on top of the OLD repo/family/base_repo metadata, which the frontend then rendered as a misleading 'still loaded: X' label. Clear all metadata atomically with the pipe drop so a failed swap reports a clean empty status plus last_error. Add regression test.
for more information, see https://pre-commit.ci
… precision for PR unslothai#5754 - DiffusionBackend.status() now takes _lock so frontend polling cannot observe a torn snapshot mid-swap. - Scrub hf_token / pipe_kwargs / single_file_kwargs from frame locals before logger.exception() so rich tracebacks and structlog formatters that render locals do not leak hf_... tokens into logs. - routes/models.py delete_cached_repo: refuse to delete the cache underlying a currently-loaded diffusion pipeline (both the GGUF repo and the matching diffusers base_repo). Symmetric with the existing chat-load + GGUF guard. - Frontend seed validation: reject non-integer and out-of-safe- integer-range inputs instead of silently rounding via Number(), which would otherwise send a different seed than what the user typed.
for more information, see https://pre-commit.ci
…e prompt (PR unslothai#5754) QwenImagePipeline and FluxPipeline treat guidance_scale as the distilled CFG factor and expose true_cfg_scale as the real classifier-free guidance knob. Negative prompts only steer the output when true_cfg_scale > 1, so forwarding only guidance_scale left Qwen-Image on the default true_cfg_scale=4.0 and the user's slider value silently ineffective for negative prompts. When the loaded pipeline accepts both negative_prompt and true_cfg_scale and the caller supplies a non-empty negative prompt, forward guidance_scale through both kwargs so the negative prompt actually steers generation. When no negative prompt is supplied, true_cfg_scale is left at the model default to avoid switching distilled CFG models into real-CFG mode (which would double inference cost and degrade quality). Adds two regression tests covering the forward-when-negative and skip-when-no-negative paths.
2329753 to
fb0a31b
Compare
…slothai#5754 The Studio backend's no-torch-runtime.txt is installed via pip --no-deps so the diffusion stack's transitive imports must be pinned explicitly. huggingface_hub's blob downloader (used by diffusers.GGUFQuantizationConfig and by every from_single_file call) imports requests + urllib3 + charset_normalizer at module load time; a fresh --no-deps install would 500 on the first /api/inference/images/load with PackageNotFoundError: 'requests'. Adds requests, urllib3, and charset_normalizer to the transitive-deps block next to the existing httpx chain.
| if not raw or raw == "[DONE]": | ||
| continue | ||
| try: | ||
| parsed = json.loads(raw) |
…nslothai#5754 Round 5 reviewer findings, mostly symmetric-lifecycle and input validation gaps the earlier rounds left open. Backend lifecycle (P1) * routes/training.py: training start now also unloads the GGUF llama-server subprocess; was previously only unloading the safetensors backend, so starting training while a GGUF chat model was loaded kept the subprocess pinned to VRAM. * routes/inference.py: new _raise_if_training_active helper. Both GGUF and standard chat loads, plus /api/inference/images/load, now refuse with HTTP 409 when training is active instead of silently stopping training to free VRAM. * core/inference/diffusion.py: _release_other_gpu_owners_for_ diffusion no longer stops active training. The route layer refuses the request first, so reaching the helper with training live would only happen from programmatic backend calls; better to surface OOM than terminate a long training run. * core/inference/diffusion.py: BF16 dtype is now gated on torch.cuda.is_bf16_supported. Pascal/Turing GPUs report is_available()=True but lack BF16 ALUs; FLUX kernels then fail inside from_pretrained. Falls back to FP16 instead of refusing. * core/inference/diffusion.py: GGUF transformer allocation and pipeline allocation now run AFTER releasing chat/export GPU owners; previously from_single_file ran first and could OOM before the intended VRAM handoff happened. * routes/models.py: /delete-cached now also blocks delete when diffusion is_loading=True (not just is_loaded); concurrent delete during hf_hub_download / from_single_file would have raced the rmtree. * routes/models.py: /delete-finetuned now also checks the diffusion backend before unlinking a Studio outputs/exports path. A user who exported a FLUX LoRA locally and loaded it via /images/load could previously rmtree the directory the diffusion backend was reading from. Backend correctness / safety (P2) * core/inference/diffusion.py: _FAMILY_EXCLUDE for qwen-image now also covers qwen_image_edit / qwenimageedit underscore spellings so '...qwen_image_edit-GGUF' no longer misdetects as Qwen-Image. * core/inference/diffusion.py: detect_family now scans _FULL_REPO_FAMILIES in addition to _FAMILIES, so SDXL repos (stabilityai/stable-diffusion-xl-base-1.0) are auto-detected instead of failing with 'Could not infer a diffusion family'. * core/inference/diffusion.py: generate_image now uses a separate _generate_lock for the pipeline forward instead of holding _lock for the whole call. status() polls and concurrent unload requests no longer block for the full minutes-long generation. * routes/models.py: diffusion delete guard now uses exact repo-id match instead of prefix match; previously loading 'org/model-v2' would block deleting unrelated cached 'org/model'. * models/inference.py: DiffusionLoadRequest now rejects ASCII control characters in repo_id / gguf_filename / base_repo / family via field_validator (closes log-injection surface from authenticated callers). Also caps lengths at 256 chars. * models/inference.py: DiffusionGenerateRequest seed is now bounded to the int64/uint64 range; previously a huge seed (e.g. 2**100) passed Pydantic then crashed inside torch.Generator.manual_seed with 'Overflow when unpacking long long'. Frontend (P2) * features/images/images-page.tsx: Custom HF repo panel now exposes a Pipeline family override dropdown; previously the backend supported it via DiffusionLoadRequest.family but the UI had no way to send it, so custom repos whose names did not contain a hard-coded substring failed to load. * features/images/images-page.tsx: handleLoad now re-fetches status on error. The backend clears its old pipeline before allocating the replacement; a failed swap previously left the UI showing 'Loaded:' with Generate enabled until manual refresh. Tests (10 new) * underscore qwen-image-edit exclusion + SDXL full-repo detection * BF16 fallback when is_bf16_supported() returns False * status() does not block while generate_image holds _generate_lock * route layer rejects control chars in repo_id * route layer rejects 2**100 seeds (uint64-max boundary accepted) * route layer happy-path with negative-prompt true_cfg_scale forwarding (Qwen/Flux) and skip-when-no-neg (distilled CFG)
for more information, see https://pre-commit.ci
… PR unslothai#5754 Round 6 reviewers identified several races between load / unload / generate and several fail-open delete guards. This commit closes them by widening the lock scope, publishing the pending load target through status(), and switching delete guards to fail-closed. Lifecycle (P1) * core/inference/diffusion.py: load_model now also takes _generate_lock. Previous behavior released and reallocated the pipeline while a generation forward was still iterating denoising steps, corrupting scheduler state and stacking VRAM. The forward only briefly touches _lock, so taking it on the load path does not introduce a deadlock. * core/inference/diffusion.py: unload_model now also takes _generate_lock. Without it, /images/unload returned is_loaded=False while a slow forward was still running, which let chat / training / export handoffs allocate VRAM on top of the still-resident pipeline. * core/inference/diffusion.py: previous pipeline release now happens BEFORE from_single_file / from_pretrained. Switching FLUX.2 klein 4B -> 9B on a 16-24 GB GPU was failing because the new transformer allocation overlapped the old pipe's residency. * core/inference/diffusion.py: failed pipeline from_pretrained now explicitly releases the just-loaded transformer; previously its weights stayed pinned to GPU until GC and made the next load more likely to OOM. Pending-target / delete guards (P1) * core/inference/diffusion.py: load_model now publishes _pending_repo_id / _pending_base_repo / _pending_gguf_filename under _lock at the start of the call (and refreshes _pending_base_repo when the smart-base / repo defaults resolve). status() exposes those as 'repo_id' / 'base_repo' / 'gguf_filename' during is_loading=True so delete guards can see the target before _repo_id is set on success. * routes/models.py /delete-cached + /delete-finetuned: diffusion status check now fails CLOSED (HTTP 503) when status() raises. Both guards previously logged and continued, which could let a delete proceed against a repo whose status was unverifiable. * routes/models.py: is_loading is also blocked on both guards so a mid-download / mid-from_pretrained rmtree is refused. Symmetric handoffs (P1) * routes/export.py: /load-checkpoint now refuses with HTTP 409 when training is active instead of calling stop_training(). Chat and /images/load did the same after round 5; export was the remaining asymmetry that would silently kill a long training run. * routes/training.py, routes/inference.py (GGUF and standard chat), routes/export.py: diffusion handoff now treats is_loading as is_loaded. The diffusion backend's unload waits on _load_lock + _generate_lock so an in-flight load completes first. Requirements (P1) * requirements/studio.txt: pin python-multipart explicitly. The Studio routes package's eager router imports include routes/datasets.py whose FastAPI UploadFile/File validation crashes with RuntimeError without it in fresh test envs. Frontend (P2) * features/images/api.ts + images-page.tsx: seed handling now accepts the full [-2^63, 2^64 - 1] range via BigInt. The previous safe-integer cap rejected valid uint64 seeds the backend accepts. A small stringify helper emits BigInts as JSON integers without touching the rest of the payload. Tests * test_diffusion_routes.py: load routes/inference.py via importlib.spec_from_file_location to avoid triggering routes/__init__.py (which would pull in training / datasets / data_recipe imports unrelated to diffusion tests). * test_diffusion_backend.py: status() during is_loading shows pending repo + base; unload waits for in-flight generation.
0c78013 to
04de106
Compare
for more information, see https://pre-commit.ci
acad7c6 to
b7207e3
Compare
Round 15 reviewer aggregate (logs/review_round15_aggregate.md): P1 fixes: - core/inference/llama_cpp.py publishes loading_model_identifier + loading_hf_variant AFTER acquiring _serial_load_lock; previously a queued second load could overwrite or clear the identifier currently in flight, breaking delete-safety and GPU handoff guards. - routes/models.py /delete-finetuned compares the pending llama load against loading_hf_variant (new), not the stale hf_variant from the previous loaded model. Without this, a Q4-loaded directory loading Q8 would still accept a Q8 delete. - core/inference/diffusion.py _release_other_gpu_owners_for_diffusion now also raises when training is active so direct backend callers cannot bypass the route layer's 409 guard. Mirrors the export-active check the same helper already enforces. - routes/models.py /delete-cached diffusion guard compares owned diffusion paths against the HF cache root for the target repo via _all_hf_cache_scans + _is_path_under. Without this, loading from a local models--owner--model/snapshots/<sha> path let the cache delete proceed while the snapshot was still mmap'd. - models/inference.py DiffusionLoadRequest refuses URL-embedded hf_xxxxx tokens in repo_id / base_repo at the API boundary, so the value never reaches self._repo_id and status() can never echo it back to other authenticated sessions. P2 fixes: - core/inference/diffusion.py status() routes UI-facing repo_id / base_repo through _display_repo_id, which collapses absolute local paths to the leaf name (delete guards still see the full path via active_*/pending_*). - routes/inference.py /images/load maps backend RuntimeError that reports an export/training conflict to HTTP 409 instead of 400. - core/inference/diffusion.py detect_family now uses token-boundary matching so owner/flux.20-model does not collide with flux.2. P3 fixes: - tests/test_diffusion_routes.py drops the partial routes.inference module from sys.modules if exec_module() raises, so the real ImportError surfaces instead of a misleading AttributeError on follow-up tests. Tests: - 5 new regression cases (display_repo_id, token-boundary family detection, training-active raise from backend helper, embedded HF token rejection). - All 72 diffusion backend + route tests pass.
b7207e3 to
4c75b61
Compare
for more information, see https://pre-commit.ci
| for holding ``_serial_load_lock`` and for publishing / | ||
| clearing ``_loading_model_identifier`` + ``_loading_hf_variant`` | ||
| in the surrounding try/finally.""" | ||
| if True: |
…lothai#5754 Round 15 split LlamaCppBackend.load_model into a thin wrapper that publishes _loading_model_identifier + _loading_hf_variant under _serial_load_lock and an inner _load_model_impl_locked body that actually launches llama-server. The pre-existing source-inspection regression tests inspected only load_model and broke because the flag literals and _wait_for_vram_settle call now live in the inner method: - tests/test_llama_cpp_no_context_shift.py test_no_context_shift_is_in_load_model test_flag_sits_inside_the_base_cmd_list - tests/test_llama_cpp_wait_for_vram_settle.py test_load_model_calls_helper_outside_lock_and_uses_last_kill_timestamp Update both helpers to concatenate the source of load_model AND _load_model_impl_locked so the assertions still cover the launch path without weakening their scope to the full module.
for more information, see https://pre-commit.ci
4c75b61 to
aa24f21
Compare
Round 16 reviewer aggregate (logs/review_round16_aggregate.md): P1 fixes: - routes/models.py /delete-cached llama guard pairs loading_id with loading_hf_variant so deleting a different cached quant (Q8_0) while another variant (Q4_K_M) is loading is no longer blocked. - core/inference/diffusion.py load_model now calls _release_other_gpu_owners_for_diffusion BEFORE _release_chat_backend_for_diffusion. The other-owners helper RAISES on active training/export, so a route -> worker race or direct backend caller no longer drops the user's chat model before the diffusion load is refused. - routes/models.py /delete-cached diffusion guard fails CLOSED (503) on HF cache scan failure instead of silently falling through to repo-id-only matching, which could miss a loaded local snapshot path. - routes/inference.py _release_llama_for and _release_safetensors_chat_for now raise 503 on actual unload failure (exception or False return), so new GPU workloads do not start while the old chat process still owns VRAM. - core/inference/diffusion.py status() now takes include_internal=False by default and only exposes the guard-facing active_*/pending_* paths when callers opt in. The public /api/inference/images/status route gets the redacted payload; routes/models.py delete guards pass include_internal=True so they still see the raw paths. - core/inference/diffusion.py generate_image_with_metadata routes the response model through _display_repo_id so /images/generate cannot echo back an absolute local path. P2 fixes: - routes/inference.py /images/load now maps backend "Could not verify training/export status" to 503 instead of 409, matching the route-level pre-check. - core/inference/diffusion.py _release_other_gpu_owners_for_diffusion raises "Could not verify export status" when the is_export_active() probe itself raises, instead of silently treating it as active export. - core/inference/diffusion.py detect_family compares compact family spellings (Flux2Klein) against per-token compact strings so unsloth/Flux2Klein-GGUF matches the flux.2-klein family without matching the embedded substring inside flux.20. - main.py installs a RequestValidationError handler that scrubs hf_xxxxx tokens out of the 422 response body so a rejected ``repo_id`` containing a URL-embedded HF token does not echo it back to the browser. Tests: - 3 new regression cases (Flux2Klein compact alias, public status redaction, generate_image_with_metadata redaction). - All 75 diffusion backend + route tests pass.
for more information, see https://pre-commit.ci
aa24f21 to
3c6a47d
Compare
Two diffusion tests broke on the Windows runner after round 16: - test_display_repo_id_collapses_absolute_path used hardcoded POSIX absolute paths; Windows reads /home/... as drive- relative so Path.is_absolute() returns False. Use pytest's tmp_path so the path is platform-correct. - test_load_publishes_pending_target_during_loading regressed because round 16 moved _release_other_gpu_owners_for_diffusion ahead of the chat unload. That helper imports core.training and core.export; on Windows CI the import resolved to a real but partially configured backend, which raised inside the new status-verification path and aborted the load before from_pretrained ran. Stub both modules with idle backends in _install_fake_diffusers. Also updated test_public_status_does_not_leak_local_path_via _active_fields and test_generate_image_with_metadata_redacts_ local_path to use tmp_path for the same Windows reason.
3c6a47d to
b841562
Compare
P1: route-layer chat/diffusion/export releases were still asymmetric. Training start and export load called ``diff_backend.unload_model`` inside a best-effort try/except so a wedged diffusion backend let the next workload allocate over the top of the resident pipeline and OOM. Both now use the strict ``_release_diffusion_for`` helper from routes.inference, which raises HTTPException 503 on status/unload failure or post-check mismatch. P2 #9: diffusion load exceptions can include the absolute local repo / base / gguf path verbatim (FileNotFoundError, OSError from diffusers / safetensors). The path flows into ``_last_error``, which ``status()`` returns to every authenticated session. Collapse the known repo_id / effective_base / gguf_filename paths to their leaf name before storing the error, mirroring the ``_display_repo_id`` convention used for the public repo label. P2 #10: when ``repo_id`` is an absolute local path, ``detect_family`` matched _FAMILY_EXCLUDE deny lists against the full path, so models stored under a parent directory containing ``qwen-image-edit`` or ``3.5`` were misclassified as None. Reduce the family-detection needle to the leaf directory when the input looks like a filesystem path; Hub-style ``owner/repo`` ids continue to use the original needle so existing detection rules keep working. P2 #12: ``gguf_filename`` was missing from the ``_reject_embedded_hf_token`` validator. A URL-form quant path like ``https://hf_xxxxx@huggingface.co/.../flux.gguf`` would be stored on ``DiffusionBackend._gguf_filename`` and surface in status() / log lines. Extend the validator to gguf_filename so the token is dropped before it can leak. All 85 diffusion-relevant backend tests pass locally.
for more information, see https://pre-commit.ci
P1 #1: ``_release_llama_for()`` now verifies ``llama.unload_model`` did not return False AND that ``is_loaded`` / ``is_active`` / ``loading_model_identifier`` are all cleared after the call. The previous version only treated raised exceptions as failure, so a subprocess refusing to terminate or an in-flight GGUF download let the next workload allocate on top. P1 #2: ``DiffusionBackend._release_other_gpu_owners_for_diffusion`` now raises RuntimeError when ``exp._shutdown_subprocess`` fails on a settled checkpoint. Direct backend callers used to log at debug level and proceed toward diffusion allocation while the export checkpoint still owned VRAM. P1 #3 + P1 #7: ``/images/load`` no longer drops chat + idle export before the cheap backend validation runs. ``DiffusionBackend.load_model`` already calls the strict ``_release_other_gpu_owners_for_diffusion`` and ``_release_chat_backend_for_diffusion`` helpers AFTER family inference and GGUF filename checks pass, so the GPU is still freed before allocation and a malformed payload no longer silently unloads the user's chat / chat-export pair. P1 #4: ``_release_chat_backend_for_diffusion`` now also rejects a post-unload state where ``loading_model_identifier`` is still set, matching the route-level ``_release_llama_for`` strictness. A GGUF download mid-flight before the diffusion handoff used to slip through and end up double-owning VRAM after diffusion allocated. P1 #5: ``_release_diffusion_for`` no longer swallows a post-unload ``status()`` failure as ``after = {}``. Training / chat / export handoffs need proof that the diffusion pipeline released VRAM; the helper now raises HTTP 503 when the verification status call itself raises, so the caller retries. P1 #6: ``DiffusionBackend._release_other_gpu_owners_for_diffusion`` raises RuntimeError when ``get_export_backend()`` itself raises. Direct backend callers used to silently ``return`` here and proceed to GPU allocation without being able to verify export ownership. P1 #8: ``/training/start`` releases settled export BEFORE chat, matching the chat-load helpers. If idle export shutdown fails the user's chat model is preserved instead of being dropped for a training run that never starts. P2 #9: GGUF load-error scrubber also collapses ``local_gguf_path``, the resolved HF cache path passed to ``transformer_cls.from_single_file()``. Without this an exception like ``OSError: cannot load /home/alice/.cache/huggingface/.../flux.gguf`` would leak the operator's filesystem layout through ``last_error`` and ``/images/status``. All 85 diffusion-relevant backend tests pass locally.
P1 #1: ``_release_safetensors_chat_for`` now re-reads ``active_model_name`` and ``loading_models`` after each unload AND runs a final sweep against the initial owned-name set. The previous helper trusted ``unload_model() -> True`` even though the orchestrator can respond ``unloaded`` while still holding weights or a concurrent ``load`` can repopulate the tracker between calls. Per-name and global post-state mismatches now raise HTTP 503 so the caller retries. P1 #2: same post-state guarantee inside ``_release_chat_backend_for_diffusion`` for direct backend callers. ``DiffusionBackend.load_model`` now raises RuntimeError when the safetensors tracker still owns a previously-resident name after the unload, matching the route-level helper. The route layer's existing classifier maps the new wording to HTTP 503. P1 #3: ``DiffusionBackend.load_model`` now preflights the full diffusers repo (or explicit GGUF ``base_repo``) via ``hf_hub_download(filename="model_index.json")`` BEFORE the chat / export unload runs. The GGUF path was already covered by the existing ``hf_hub_download(gguf_filename)`` round-trip; the full-repo path used to skip validation and let a typo / private / gated repo only surface inside ``from_pretrained`` AFTER the user's chat model was already dropped. Local paths are checked structurally (must be a directory containing ``model_index.json``) so we do not network-round-trip for an on-disk miss. Error messages route through ``_display_repo_id`` so an absolute filesystem path does not leak the operator's layout. P1 #6: ``/api/inference/unload`` (the direct chat unload endpoint) now treats ``unload_model() -> False`` AND a leftover state (``is_loaded`` / ``is_active`` / ``loading_model_identifier`` for GGUF, ``active_model_name`` / ``loading_models`` for safetensors) as 503 instead of unconditionally responding ``status="unloaded"``. The UI used to show the model as gone while the backend still owned VRAM. P2 #7: extended the /images/load RuntimeError -> HTTPException marker list with ``still active or loading after unload`` and ``still loading after unload``. Round 18 introduced these exact phrasings on the backend side; without the extension a retryable unload failure was returning HTTP 400 to the user instead of 503. P2 #8: removed the unused ``unsloth_backend = get_inference_backend()`` eager construction in the GGUF chat-load branch. Eager construction made the GGUF-only path needlessly fail or pay startup cost when the safetensors backend was unavailable / lazy; ``_release_safetensors_chat_for`` already handles that case as a no-op. All 85 diffusion-relevant + 98 related backend tests pass locally.
P1 #1: ``_preflight_full_diffusers_repo(effective_base, hf_token)`` now runs for every load mode, including the GGUF-with-auto-base path. Round 19 only preflighted the full repo or an explicit ``base_repo``, so an auto-picked companion that turned out to be gated / private / missing still unloaded the user's chat model before ``from_pretrained`` failed. ``effective_base`` is the same value that feeds every downstream allocation, so preflighting it unconditionally catches all three modes. P1 #2: ``diffusers.GGUFQuantizationConfig`` (which imports the ``gguf`` package at construction time) is now built up front, inside the same try block that surfaces "Re-run Studio setup". Previously the missing-dependency exception fired AFTER ``_release_other_gpu_owners_for_diffusion`` and ``_release_chat_backend_for_diffusion`` had already taken the chat / export models down. The downstream from_single_file call reuses the same ``quant_config`` reference. P1 #4: ``studio/backend/requirements/studio.txt`` now lists ``diffusers>=0.37.0`` and ``gguf>=0.10.0``. These were only in the extras files, so fresh standard Studio installs failed on /images/load with the round 20 P1 #2 dependency error message. P1 #5: ``LoadRequest``, ``UnloadRequest``, and ``ValidateModelRequest`` now apply the same control-character + embedded-HF-token validators that ``DiffusionLoadRequest`` already had. /api/inference/load, /api/inference/validate, and /api/inference/unload used to accept newline / tab / control characters in ``model_path`` (log-line smuggling) and URL-form ``https://hf_xxxxx@huggingface.co/...`` (credential leak through structured log sinks). P2 #6: ``_collapse_local`` in the diffusion load-error scrubber now resolves relative candidates and adds the absolute form to the substring set. A relative ``exports/my-flux`` used to leak ``/mnt/disks/.../exports/my-flux/...`` via downstream library errors because the scrubber only matched the original literal. Replacement is longest-first so a leaf-only context survives. All 85 diffusion-relevant + 35 related model-validation tests pass locally. (P1 #3 cross-workload GPU handoff lock is deferred: deserves a focused design pass across /images/load, /chat/load (both branches), /training/start, and /export/load to pick a lock boundary that does not deadlock against the backend load locks or stall the SSE log stream.)
P1 #1 + #2: ``LoadRequest._no_embedded_hf_tokens`` and ``ValidateModelRequest._no_embedded_hf_tokens`` now cover ``gguf_variant`` in addition to ``model_path``. A caller could pass a variant like ``Q4_K_M-hf_xxxxxxxx`` that flowed into structured log sinks via the GGUF resolver path; the matching ``DiffusionLoadRequest`` validator already covered every string field, so this restores parity. P1 #3: ``/api/inference/unload`` now also matches the llama ``loading_model_identifier`` when picking the GGUF branch. A pending GGUF download (``is_active`` still False, ``loading_model_identifier`` populated) used to fall through to the safetensors branch and respond ``status="unloaded"`` while llama-server kept downloading. P1 #4 + #5: the final safetensors-handoff sweeps (route-level ``_release_safetensors_chat_for`` and backend ``_release_chat_backend_for_diffusion``) now check ``active_model_name`` and ``loading_models`` WITHOUT the initial ``owned_names`` filter. A concurrent ``/load`` that landed AFTER the snapshot was previously ignored, so a chat model that began loading during the unload window let training / export / GGUF chat / diffusion start anyway and race the new chat for VRAM. P2 #6: added ``_preflight_diffusers_subfolder_config`` and invoked it for GGUF loads with a transformer class (``effective_base``, ``"transformer"``). A custom base companion that had ``model_index.json`` but lacked ``transformer/config.json`` previously passed the round 19 preflight, unloaded chat, then failed inside ``from_single_file``. P2 #7: ``_scrub_validation_obj`` in main.py also scrubs string dict KEYS. Pydantic ``string_type`` errors surface ``input`` verbatim, and a malformed payload like ``{"repo_id": {"hf_xxxxx": "owner/repo"}}`` would otherwise leak the token through the 422 response body. All 85 diffusion-relevant + 35 model-validation tests pass locally. Existing fakes for ``hf_hub_download`` updated to accept the new ``subfolder=`` kwarg the round 21 preflight uses. (P1 #3 cross-workload GPU handoff lock from round 20 is still deferred; round 21's P1 #4 / #5 raised the sweep-level guarantee, which closes the most common race without the deadlock risk of holding a process-wide lock across the entire load.)
P1 #1: ``TrainingStartRequest.model_name`` now runs the same control-character and embedded-HF-token validators that the chat and diffusion request models gained in rounds 5 / 15 / 20 / 21. ``/api/training/start`` previously accepted newline / tab / control characters and URL-form ``hf_xxxxx`` tokens that flowed into structured-log sinks via "Loading model %s" lines. P1 #2: ``_run_with_helper`` in ``utils/datasets/llm_assist.py`` now skips the helper GGUF when the diffusion image backend reports loaded / loading. The public chat / training / export routes already do this through ``_release_diffusion_for``, but this dataset-side helper loaded llama-server directly with no diffusion guard, so an Images-page allocation would race the helper for VRAM. New ``_diffusion_image_model_busy`` helper fails closed (treats status() failure as busy) so the resident image model is preserved instead of being overwritten. P1 #3: same ``_diffusion_image_model_busy`` guard added to ``_run_multi_pass_advisor`` (the dataset conversion advisor), which has the same direct llama.cpp load shape. P2 #4: the early "Could not infer a diffusion family" RuntimeError now routes ``repo_id`` through ``_display_repo_id`` before formatting. A local absolute path that did not match any known family used to leak the operator's filesystem layout via the 400 response body, last_error, and log line. All 97 diffusion + training-validation + related tests pass locally.
P1 #1 + #2 + #6: extended the chat / diffusion / training identifier hardening to every export-side request model. ExportCommonOptions (parent of ExportMergedModelRequest / ExportBaseModelRequest / ExportLoRAAdapterRequest) now applies _no_control_chars and _reject_embedded_hf_token to repo_id and base_model_id; ExportGGUFRequest gets the same on its repo_id plus a control-char check on quantization_method; and LoadCheckpointRequest validates checkpoint_path. Previously "/api/export/*" accepted newline-smuggled identifiers and URL-form ``hf_xxxxx`` tokens that flowed into log lines. P1 #3 + #4: ``_run_with_helper`` and ``_run_multi_pass_advisor`` now use a shared ``_gpu_workload_busy_for_helper`` that gates on diffusion (round 22 already), training, AND export. The round 22 guard only checked diffusion, so the dataset helper / advisor could still load llama-server on top of an active training run or a resident export checkpoint. Each step fails closed (unverifiable status counts as busy) so the user's primary workload is preserved. P1 #5: PublishDatasetRequest in models/data_recipe.py also applies the identifier hardening to repo_id; the publish path previously accepted control characters and URL-form tokens. P1 #7-10: added _validate_logged_identifier helper to routes/models.py and applied it to the path / query parameter endpoints that flow into logger.info(...) calls -- ``/config/{model_name}``, ``/check-vision/{model_name}``, ``/check-embedding/{model_name}``, ``/gguf-variants``. Mapped the validator's ValueError to HTTP 422 so the client sees the same shape as a Pydantic validation failure. P2 #11 + #12: ``Loading diffusion model %s`` and ``Diffusion load failed for %s`` log lines route ``repo_id`` / ``effective_base`` through ``_display_repo_id`` (collapses absolute local paths to the leaf, still scrubs HF tokens) instead of plain ``_redact_hf_tokens``. The error path was already collapsed in the user-facing 400 / RuntimeError, but the structured-log lines kept the full path. All 97 diffusion + training-validation + related tests pass locally.
P1 #1: ``_gpu_workload_busy_for_helper`` in ``utils/datasets/llm_assist.py`` now also gates on the GGUF chat backend (llama-server) AND the safetensors chat backend. Round 23 extended it to training + export but missed Chat, so a helper / advisor GGUF could still race a loaded chat model for VRAM. Both checks fail closed when status is unverifiable. P1 #2 / #3 / #4 / #5: re-ordered the route-level GPU-handoff unloads so the diffusion release runs BEFORE the chat releases. A wedged diffusion unload used to fire AFTER chat was already gone, so the user lost both on a single failure. Drop chat last so an earlier failure preserves it. Applied to ``/training/start`` (training.py), ``/export/load`` (export.py), ``/chat/load`` GGUF branch and ``/chat/load`` safetensors branch (routes/inference.py). P1 #7 + P2 #13: ``/delete-finetuned`` body now hardens ``model_path`` and ``gguf_variant`` via the shared ``_validate_logged_identifier`` helper, so control characters and URL-form HF tokens can no longer log-line-smuggle. P1 #8 + #10: ``/delete-cached`` body hardens ``repo_id`` and ``variant`` the same way. P1 #9: ``/download-progress`` ``repo_id`` query parameter is also hardened; the value flows into log lines deep inside ``_get_repo_size_cached`` on lookup failure. P1 #11: ``CheckFormatRequest.dataset_name`` and ``AiAssistMappingRequest.{dataset_name, model_name}`` in ``models/datasets.py`` now apply the same control-char + embedded-HF-token validators, matching every other public request-body model. All 115 diffusion + training-validation + cached_gguf + export + inference model-validation tests pass locally. (P1 #6 native-path-lease enforcement for diffusion local paths and P1 #12 React Compiler frontend lint deferred -- both need focused design / frontend touchups separate from this batch.)
| return | ||
| try: | ||
| del obj | ||
| except Exception: |
| # supplied filename (e.g. ``BF16/model.gguf``) is kept | ||
| # separately as ``active_gguf_filename`` for delete | ||
| # guards. | ||
| gguf_basename = Path(self._gguf_path).name if self._gguf_path else None |
| self._cpu_offload_enabled = False | ||
| self._loaded_at = None | ||
| _release(old) | ||
| old = None # noqa: F841 |
| backend = get_inference_backend() | ||
| active_model_name = getattr(backend, "active_model_name", None) | ||
| loading_models = set(getattr(backend, "loading_models", set()) or set()) | ||
| owned_names = {name for name in ({active_model_name} | loading_models) if name} |
|
|
||
| active_model_name = getattr(inf, "active_model_name", None) | ||
| loading_models = set(getattr(inf, "loading_models", set()) or set()) | ||
| owned_names = {name for name in ({active_model_name} | loading_models) if name} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Staging-only PR to validate the Studio diffusion image generation pipeline on Ubuntu, macOS, and Windows GitHub Actions runners before opening the upstream PR.
Backend
Frontend
CI scope (staging only)
Three targeted workflows pinned to ubuntu-latest, macos-14, and windows-latest. Each scoped via paths to studio/backend/** so unrelated changes do not re-trigger them. The inherited workflow bloat from the staging fork was cleared so we stay under the 5-Windows-runner cap.
Not for merge upstream from this fork. Upstream PR will follow once the runners go green.