Skip to content

Studio: local diffusion image generation (CI validation)#88

Open
danielhanchen wants to merge 59 commits into
mainfrom
studio-diffusion-images-staging
Open

Studio: local diffusion image generation (CI validation)#88
danielhanchen wants to merge 59 commits into
mainfrom
studio-diffusion-images-staging

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Staging-only PR to validate the Studio diffusion image generation pipeline on Ubuntu, macOS, and Windows GitHub Actions runners before opening the upstream PR.

Backend

  • core/inference/diffusion.py - DiffusionBackend that loads diffusion GGUFs from Hugging Face via diffusers.GGUFQuantizationConfig and runs them on the active CUDA / MPS / CPU device. Supports FLUX.2, FLUX.2 klein, FLUX.1, Qwen-Image, Stable Diffusion 3, and SDXL.
  • routes/inference.py - POST /api/inference/images/load, POST /api/inference/images/generate, POST /api/inference/images/unload, GET /api/inference/images/status mirroring the existing llama-server lifecycle.
  • models/inference.py - DiffusionLoadRequest / DiffusionGenerateRequest / DiffusionGenerateResponse schemas with prompt and size validation up front.
  • requirements/no-torch-runtime.txt - pin gguf alongside the existing diffusers entry so GGUFQuantizationConfig works out of the box.
  • tests/test_diffusion_backend.py + tests/test_diffusion_routes.py - 27 CPU-only unit tests covering family detection, lifecycle, validation, and the full FastAPI round trip with diffusers stubbed.

Frontend

  • features/images/ - standalone Images page with curated FLUX.2 picker, HF token, prompt + negative prompt, resolution presets, steps + guidance sliders, seed input, and a result gallery rendering base64 PNGs inline.
  • app/routes/images.tsx - lazy /images route wired into router.tsx.
  • components/app-sidebar.tsx - PaintBrush02Icon nav item between Recipes and Export.

CI scope (staging only)

Three targeted workflows pinned to ubuntu-latest, macos-14, and windows-latest. Each scoped via paths to studio/backend/** so unrelated changes do not re-trigger them. The inherited workflow bloat from the staging fork was cleared so we stay under the 5-Windows-runner cap.

Not for merge upstream from this fork. Upstream PR will follow once the runners go green.

danielhanchen and others added 3 commits May 24, 2026 14:26
Backend
- core/inference/diffusion.py: DiffusionBackend singleton that loads
  diffusion GGUFs from Hugging Face via diffusers.GGUFQuantizationConfig
  and runs them on the active CUDA / MPS / CPU device. Supports FLUX.2,
  FLUX.2 klein, FLUX.1, Qwen-Image, Stable Diffusion 3, and SDXL.
- routes/inference.py: POST /api/inference/images/load,
  POST /api/inference/images/generate, POST /api/inference/images/unload,
  GET /api/inference/images/status mirroring the llama-server lifecycle.
- models/inference.py: DiffusionLoadRequest, DiffusionGenerateRequest,
  DiffusionGenerateResponse pydantic schemas with prompt / step / size
  validation up front so callers get clear 422s rather than VAE crashes.
- requirements/no-torch-runtime.txt: pin gguf alongside the existing
  diffusers entry so GGUFQuantizationConfig works out of the box.
- tests/test_diffusion_backend.py + tests/test_diffusion_routes.py:
  27 unit tests covering family detection, validation, lifecycle, and
  the full FastAPI round trip with the backend stubbed. No torch /
  diffusers / GPU required to run.

Frontend
- features/images/: standalone images-page.tsx with curated model picker
  (FLUX.2 klein 4B / 9B, FLUX.2 dev, FLUX.1 dev), HF token field,
  prompt + negative prompt, resolution presets, steps + guidance
  sliders, seed input, and a result gallery that renders base64 PNGs
  inline.
- app/routes/images.tsx: lazy /images route wired into router.tsx.
- components/app-sidebar.tsx: PaintBrush02Icon nav item between
  Recipes and Export, hidden in chat-only mode.
SectionCard requires an icon prop. Pass GpuIcon, PaintBrush02Icon,
and SparklesIcon for the three sections so tsc -b stops failing on
TS2741 'Property icon is missing'.
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/llama_cpp.py Fixed
Comment thread studio/backend/core/training/trainer.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
danielhanchen and others added 4 commits May 24, 2026 23:40
Backend
- Fix FLUX.2 klein family default base_repo: black-forest-labs/FLUX.2-klein
  does not exist on the Hub. Point at the Apache 2.0 4B Base instead so
  the from_pretrained call works out of the box for ungated users.
- Serialise concurrent load_model calls with a dedicated _load_lock so
  two /images/load requests cannot both reach pipeline_cls.from_pretrained
  at the same time (would double-spend VRAM and corrupt _pipe).
- When the caller passes a full diffusers repo (no gguf_filename),
  use repo_id directly instead of silently substituting the family
  default. Closes the load-the-wrong-model regression flagged by review.
- Drop negative_prompt from the pipeline call when the loaded pipeline
  does not accept it (FLUX.2 / FLUX.2 klein). Inspect __call__ via
  inspect.signature so we do not maintain a manual class list.
- Best-effort unload the chat backend (llama-server) before a diffusion
  load so a 24 GB consumer GPU can swap between chat and diffusion
  without manual unload steps.

Frontend
- Replace the four curated entries with the actual filenames published
  on the Hub (lowercase flux-2-klein-Nb-Q4_K_S.gguf and flux2-dev*).
- Add an explicit base_repo per curated entry so the backend never
  falls back to the family default for the curated picker.
- Add the Apache 2.0 FLUX.2 klein base 4B entry so first-time users
  have an ungated, no-token-required default.
- Hide the negative prompt field for FLUX.2 / FLUX.2 klein and show a
  small explanatory note instead.

Tests
- Add 6 new backend tests: base_repo override, full-repo (no GGUF)
  no-substitution, concurrent serialise race, signature-based kwarg
  filter, negative_prompt strip on FLUX.2, negative_prompt preserved
  on supporting pipelines. 33 tests passing.
- unload_model now takes _load_lock so it cannot race with an in-flight
  load_model and have the load thread overwrite cleared state after
  unload returned is_loaded=false.
- Move stable-diffusion-xl out of _FAMILIES into _FULL_REPO_FAMILIES.
  SDXL uses a UNet (no transformer GGUF path is wired); listing it in
  the GGUF families panel was misleading. SDXL full-repo loads still
  work via family_override='stable-diffusion-xl'.
- Result gallery now uses h-auto + object-contain so portrait /
  landscape outputs render at their true aspect ratio instead of
  being cropped into a square thumbnail.
_release_chat_backend_for_diffusion now unloads both the GGUF
chat backend (llama-server) and the safetensors / HF chat backend
(get_inference_backend) before a diffusion load. Mirror the
behaviour on the chat-load side: both the Unsloth/transformers
load path and the GGUF load path now unload the diffusion pipeline
before claiming GPU memory. Closes the OOM-on-swap path flagged
by reviewers in both directions.
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
danielhanchen and others added 5 commits May 25, 2026 00:04
- _smart_base_repo: pick 9B base for unsloth/FLUX.2-klein-9B-GGUF
  and -base- variants per the repo id, instead of always falling
  back to the 4B family default.
- pipe_kwargs use_safetensors=True so diffusers refuses pickle .bin
  weights at load time (defends against compromised base_repo).
- Release the previous pipeline BEFORE allocating the new one so
  peak VRAM stays at one model's worth instead of two on swap.
- Reject empty gguf_filename when repo_id ends with -GGUF; the prior
  behavior tried from_pretrained on a GGUF-only repo and 500'd deep
  in diffusers with a confusing model-index error.
- Status returns gguf_filename (basename) instead of gguf_path so
  the local cache path / username does not leak to authenticated
  Studio sessions.
- requirements/no-torch-runtime.txt: pin diffusers>=0.37.0 so older
  installs cannot resolve a version without Flux2KleinPipeline.
- Frontend curated distilled klein entries now point at the
  matching non-base diffusers repos (FLUX.2-klein-4B / -9B) per
  the published model cards. Update api.ts to mirror the renamed
  status field.
- routes/export.py load_checkpoint now unloads the diffusion
  pipeline alongside the existing inference + training unloads, so
  an export load after Images does not OOM the export subprocess.
- Remove the 'sd3.5' alias from the stable-diffusion-3 family.
  SD3.5 needs its own family + base_repo (and its own smoke test);
  pairing it with the SD3 Medium base produced a misleading load.
_release_chat_backend_for_diffusion was importing
get_inference_backend from core.inference.inference (the in-subprocess
class) and calling unload_model() without the required model_name
argument. The TypeError was swallowed and the active chat model
stayed resident, defeating the chat-to-diffusion lifecycle handoff.

Switch to the orchestrator's accessor at core.inference and pass
active_model_name through, mirroring the GGUF chat-load path. Add a
regression test that stubs both backends and verifies unload_model
is called with the active model name.
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
danielhanchen and others added 7 commits May 25, 2026 00:31
- detect_family adds _FAMILY_EXCLUDE so 'stable-diffusion-3.5' no
  longer matches the SD3 Medium family and 'qwen-image-edit' no
  longer matches Qwen-Image. Both were misleading silent loads.
- from_single_file now forwards config=<effective_base>,
  subfolder='transformer', and the HF token. Diffusers-format GGUFs
  (FLUX.2 klein, Qwen-Image, SD3) need the matching base config or
  the transformer load picks the wrong shapes; gated GGUFs need the
  token both for download and config read.
- Move _release_chat_backend_for_diffusion + new
  _release_other_gpu_owners_for_diffusion to AFTER the GGUF download
  and pipeline class lookup so a typo or transient Hub error does
  not kill the user's currently-loaded chat model. Peak VRAM still
  stays at one model's worth because the releases run right before
  from_pretrained.
- _release_other_gpu_owners_for_diffusion: shut down the export
  subprocess and any active training subprocess before a diffusion
  load. Symmetric with the export load path.
- routes/training.py: unload diffusion before starting training so
  the new subprocess does not race FLUX/Qwen for VRAM.
- routes/export.py: also unload the GGUF llama-server before export
  load (the existing inference-backend unload only covered the
  safetensors path).
…othai#5754

When a swap load fails after the previous pipeline is released,
status() previously reported is_loaded=false on top of the OLD
repo/family/base_repo metadata, which the frontend then rendered
as a misleading 'still loaded: X' label. Clear all metadata
atomically with the pipe drop so a failed swap reports a clean
empty status plus last_error. Add regression test.
… precision for PR unslothai#5754

- DiffusionBackend.status() now takes _lock so frontend polling
  cannot observe a torn snapshot mid-swap.
- Scrub hf_token / pipe_kwargs / single_file_kwargs from frame
  locals before logger.exception() so rich tracebacks and structlog
  formatters that render locals do not leak hf_... tokens into logs.
- routes/models.py delete_cached_repo: refuse to delete the cache
  underlying a currently-loaded diffusion pipeline (both the GGUF
  repo and the matching diffusers base_repo). Symmetric with the
  existing chat-load + GGUF guard.
- Frontend seed validation: reject non-integer and out-of-safe-
  integer-range inputs instead of silently rounding via Number(),
  which would otherwise send a different seed than what the user
  typed.
…e prompt (PR unslothai#5754)

QwenImagePipeline and FluxPipeline treat guidance_scale as the
distilled CFG factor and expose true_cfg_scale as the real
classifier-free guidance knob. Negative prompts only steer the
output when true_cfg_scale > 1, so forwarding only guidance_scale
left Qwen-Image on the default true_cfg_scale=4.0 and the user's
slider value silently ineffective for negative prompts.

When the loaded pipeline accepts both negative_prompt and
true_cfg_scale and the caller supplies a non-empty negative
prompt, forward guidance_scale through both kwargs so the
negative prompt actually steers generation. When no negative
prompt is supplied, true_cfg_scale is left at the model default
to avoid switching distilled CFG models into real-CFG mode (which
would double inference cost and degrade quality).

Adds two regression tests covering the forward-when-negative and
skip-when-no-negative paths.
@danielhanchen danielhanchen force-pushed the studio-diffusion-images-staging branch from 2329753 to fb0a31b Compare May 25, 2026 00:48
…slothai#5754

The Studio backend's no-torch-runtime.txt is installed via
pip --no-deps so the diffusion stack's transitive imports must
be pinned explicitly. huggingface_hub's blob downloader (used by
diffusers.GGUFQuantizationConfig and by every from_single_file
call) imports requests + urllib3 + charset_normalizer at module
load time; a fresh --no-deps install would 500 on the first
/api/inference/images/load with PackageNotFoundError: 'requests'.

Adds requests, urllib3, and charset_normalizer to the
transitive-deps block next to the existing httpx chain.
if not raw or raw == "[DONE]":
continue
try:
parsed = json.loads(raw)
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/tests/test_diffusion_backend.py Fixed
danielhanchen and others added 2 commits May 25, 2026 01:05
…nslothai#5754

Round 5 reviewer findings, mostly symmetric-lifecycle and input
validation gaps the earlier rounds left open.

Backend lifecycle (P1)
  * routes/training.py: training start now also unloads the GGUF
    llama-server subprocess; was previously only unloading the
    safetensors backend, so starting training while a GGUF chat
    model was loaded kept the subprocess pinned to VRAM.
  * routes/inference.py: new _raise_if_training_active helper. Both
    GGUF and standard chat loads, plus /api/inference/images/load,
    now refuse with HTTP 409 when training is active instead of
    silently stopping training to free VRAM.
  * core/inference/diffusion.py: _release_other_gpu_owners_for_
    diffusion no longer stops active training. The route layer
    refuses the request first, so reaching the helper with training
    live would only happen from programmatic backend calls; better
    to surface OOM than terminate a long training run.
  * core/inference/diffusion.py: BF16 dtype is now gated on
    torch.cuda.is_bf16_supported. Pascal/Turing GPUs report
    is_available()=True but lack BF16 ALUs; FLUX kernels then fail
    inside from_pretrained. Falls back to FP16 instead of refusing.
  * core/inference/diffusion.py: GGUF transformer allocation and
    pipeline allocation now run AFTER releasing chat/export GPU
    owners; previously from_single_file ran first and could OOM
    before the intended VRAM handoff happened.
  * routes/models.py: /delete-cached now also blocks delete when
    diffusion is_loading=True (not just is_loaded); concurrent
    delete during hf_hub_download / from_single_file would have
    raced the rmtree.
  * routes/models.py: /delete-finetuned now also checks the
    diffusion backend before unlinking a Studio outputs/exports
    path. A user who exported a FLUX LoRA locally and loaded it via
    /images/load could previously rmtree the directory the
    diffusion backend was reading from.

Backend correctness / safety (P2)
  * core/inference/diffusion.py: _FAMILY_EXCLUDE for qwen-image now
    also covers qwen_image_edit / qwenimageedit underscore spellings
    so '...qwen_image_edit-GGUF' no longer misdetects as Qwen-Image.
  * core/inference/diffusion.py: detect_family now scans
    _FULL_REPO_FAMILIES in addition to _FAMILIES, so SDXL repos
    (stabilityai/stable-diffusion-xl-base-1.0) are auto-detected
    instead of failing with 'Could not infer a diffusion family'.
  * core/inference/diffusion.py: generate_image now uses a separate
    _generate_lock for the pipeline forward instead of holding
    _lock for the whole call. status() polls and concurrent unload
    requests no longer block for the full minutes-long generation.
  * routes/models.py: diffusion delete guard now uses exact repo-id
    match instead of prefix match; previously loading 'org/model-v2'
    would block deleting unrelated cached 'org/model'.
  * models/inference.py: DiffusionLoadRequest now rejects ASCII
    control characters in repo_id / gguf_filename / base_repo /
    family via field_validator (closes log-injection surface from
    authenticated callers). Also caps lengths at 256 chars.
  * models/inference.py: DiffusionGenerateRequest seed is now
    bounded to the int64/uint64 range; previously a huge seed
    (e.g. 2**100) passed Pydantic then crashed inside
    torch.Generator.manual_seed with 'Overflow when unpacking long
    long'.

Frontend (P2)
  * features/images/images-page.tsx: Custom HF repo panel now
    exposes a Pipeline family override dropdown; previously the
    backend supported it via DiffusionLoadRequest.family but the UI
    had no way to send it, so custom repos whose names did not
    contain a hard-coded substring failed to load.
  * features/images/images-page.tsx: handleLoad now re-fetches
    status on error. The backend clears its old pipeline before
    allocating the replacement; a failed swap previously left the
    UI showing 'Loaded:' with Generate enabled until manual
    refresh.

Tests (10 new)
  * underscore qwen-image-edit exclusion + SDXL full-repo detection
  * BF16 fallback when is_bf16_supported() returns False
  * status() does not block while generate_image holds _generate_lock
  * route layer rejects control chars in repo_id
  * route layer rejects 2**100 seeds (uint64-max boundary accepted)
  * route layer happy-path with negative-prompt true_cfg_scale
    forwarding (Qwen/Flux) and skip-when-no-neg (distilled CFG)
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/tests/test_diffusion_backend.py Fixed
… PR unslothai#5754

Round 6 reviewers identified several races between load / unload /
generate and several fail-open delete guards. This commit closes
them by widening the lock scope, publishing the pending load
target through status(), and switching delete guards to
fail-closed.

Lifecycle (P1)
  * core/inference/diffusion.py: load_model now also takes
    _generate_lock. Previous behavior released and reallocated the
    pipeline while a generation forward was still iterating
    denoising steps, corrupting scheduler state and stacking VRAM.
    The forward only briefly touches _lock, so taking it on the
    load path does not introduce a deadlock.
  * core/inference/diffusion.py: unload_model now also takes
    _generate_lock. Without it, /images/unload returned
    is_loaded=False while a slow forward was still running, which
    let chat / training / export handoffs allocate VRAM on top of
    the still-resident pipeline.
  * core/inference/diffusion.py: previous pipeline release now
    happens BEFORE from_single_file / from_pretrained. Switching
    FLUX.2 klein 4B -> 9B on a 16-24 GB GPU was failing because
    the new transformer allocation overlapped the old pipe's
    residency.
  * core/inference/diffusion.py: failed pipeline from_pretrained
    now explicitly releases the just-loaded transformer; previously
    its weights stayed pinned to GPU until GC and made the next
    load more likely to OOM.

Pending-target / delete guards (P1)
  * core/inference/diffusion.py: load_model now publishes
    _pending_repo_id / _pending_base_repo / _pending_gguf_filename
    under _lock at the start of the call (and refreshes
    _pending_base_repo when the smart-base / repo defaults resolve).
    status() exposes those as 'repo_id' / 'base_repo' /
    'gguf_filename' during is_loading=True so delete guards can see
    the target before _repo_id is set on success.
  * routes/models.py /delete-cached + /delete-finetuned: diffusion
    status check now fails CLOSED (HTTP 503) when status() raises.
    Both guards previously logged and continued, which could let a
    delete proceed against a repo whose status was unverifiable.
  * routes/models.py: is_loading is also blocked on both guards
    so a mid-download / mid-from_pretrained rmtree is refused.

Symmetric handoffs (P1)
  * routes/export.py: /load-checkpoint now refuses with HTTP 409
    when training is active instead of calling stop_training().
    Chat and /images/load did the same after round 5; export was
    the remaining asymmetry that would silently kill a long
    training run.
  * routes/training.py, routes/inference.py (GGUF and standard
    chat), routes/export.py: diffusion handoff now treats
    is_loading as is_loaded. The diffusion backend's unload waits
    on _load_lock + _generate_lock so an in-flight load completes
    first.

Requirements (P1)
  * requirements/studio.txt: pin python-multipart explicitly. The
    Studio routes package's eager router imports include
    routes/datasets.py whose FastAPI UploadFile/File validation
    crashes with RuntimeError without it in fresh test envs.

Frontend (P2)
  * features/images/api.ts + images-page.tsx: seed handling now
    accepts the full [-2^63, 2^64 - 1] range via BigInt. The
    previous safe-integer cap rejected valid uint64 seeds the
    backend accepts. A small stringify helper emits BigInts as JSON
    integers without touching the rest of the payload.

Tests
  * test_diffusion_routes.py: load routes/inference.py via
    importlib.spec_from_file_location to avoid triggering
    routes/__init__.py (which would pull in training / datasets /
    data_recipe imports unrelated to diffusion tests).
  * test_diffusion_backend.py: status() during is_loading shows
    pending repo + base; unload waits for in-flight generation.
@danielhanchen danielhanchen force-pushed the studio-diffusion-images-staging branch from 0c78013 to 04de106 Compare May 25, 2026 01:28
@danielhanchen danielhanchen force-pushed the studio-diffusion-images-staging branch from acad7c6 to b7207e3 Compare May 25, 2026 06:34
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Round 15 reviewer aggregate (logs/review_round15_aggregate.md):

P1 fixes:
- core/inference/llama_cpp.py publishes loading_model_identifier +
  loading_hf_variant AFTER acquiring _serial_load_lock; previously
  a queued second load could overwrite or clear the identifier
  currently in flight, breaking delete-safety and GPU handoff guards.
- routes/models.py /delete-finetuned compares the pending llama
  load against loading_hf_variant (new), not the stale hf_variant
  from the previous loaded model. Without this, a Q4-loaded
  directory loading Q8 would still accept a Q8 delete.
- core/inference/diffusion.py _release_other_gpu_owners_for_diffusion
  now also raises when training is active so direct backend callers
  cannot bypass the route layer's 409 guard. Mirrors the
  export-active check the same helper already enforces.
- routes/models.py /delete-cached diffusion guard compares owned
  diffusion paths against the HF cache root for the target repo
  via _all_hf_cache_scans + _is_path_under. Without this, loading
  from a local models--owner--model/snapshots/<sha> path let the
  cache delete proceed while the snapshot was still mmap'd.
- models/inference.py DiffusionLoadRequest refuses URL-embedded
  hf_xxxxx tokens in repo_id / base_repo at the API boundary, so
  the value never reaches self._repo_id and status() can never
  echo it back to other authenticated sessions.

P2 fixes:
- core/inference/diffusion.py status() routes UI-facing repo_id /
  base_repo through _display_repo_id, which collapses absolute
  local paths to the leaf name (delete guards still see the full
  path via active_*/pending_*).
- routes/inference.py /images/load maps backend RuntimeError that
  reports an export/training conflict to HTTP 409 instead of 400.
- core/inference/diffusion.py detect_family now uses token-boundary
  matching so owner/flux.20-model does not collide with flux.2.

P3 fixes:
- tests/test_diffusion_routes.py drops the partial routes.inference
  module from sys.modules if exec_module() raises, so the real
  ImportError surfaces instead of a misleading AttributeError on
  follow-up tests.

Tests:
- 5 new regression cases (display_repo_id, token-boundary family
  detection, training-active raise from backend helper, embedded HF
  token rejection).
- All 72 diffusion backend + route tests pass.
@danielhanchen danielhanchen force-pushed the studio-diffusion-images-staging branch from b7207e3 to 4c75b61 Compare May 25, 2026 07:00
Comment thread studio/backend/core/inference/diffusion.py Fixed
for holding ``_serial_load_lock`` and for publishing /
clearing ``_loading_model_identifier`` + ``_loading_hf_variant``
in the surrounding try/finally."""
if True:
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
danielhanchen and others added 2 commits May 25, 2026 07:15
…lothai#5754

Round 15 split LlamaCppBackend.load_model into a thin wrapper that
publishes _loading_model_identifier + _loading_hf_variant under
_serial_load_lock and an inner _load_model_impl_locked body that
actually launches llama-server. The pre-existing source-inspection
regression tests inspected only load_model and broke because the
flag literals and _wait_for_vram_settle call now live in the inner
method:

- tests/test_llama_cpp_no_context_shift.py
  test_no_context_shift_is_in_load_model
  test_flag_sits_inside_the_base_cmd_list
- tests/test_llama_cpp_wait_for_vram_settle.py
  test_load_model_calls_helper_outside_lock_and_uses_last_kill_timestamp

Update both helpers to concatenate the source of load_model AND
_load_model_impl_locked so the assertions still cover the launch
path without weakening their scope to the full module.
@danielhanchen danielhanchen force-pushed the studio-diffusion-images-staging branch from 4c75b61 to aa24f21 Compare May 25, 2026 07:15
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
danielhanchen and others added 2 commits May 25, 2026 07:31
Round 16 reviewer aggregate (logs/review_round16_aggregate.md):

P1 fixes:
- routes/models.py /delete-cached llama guard pairs loading_id with
  loading_hf_variant so deleting a different cached quant (Q8_0)
  while another variant (Q4_K_M) is loading is no longer blocked.
- core/inference/diffusion.py load_model now calls
  _release_other_gpu_owners_for_diffusion BEFORE
  _release_chat_backend_for_diffusion. The other-owners helper
  RAISES on active training/export, so a route -> worker race or
  direct backend caller no longer drops the user's chat model
  before the diffusion load is refused.
- routes/models.py /delete-cached diffusion guard fails CLOSED
  (503) on HF cache scan failure instead of silently falling
  through to repo-id-only matching, which could miss a loaded
  local snapshot path.
- routes/inference.py _release_llama_for and
  _release_safetensors_chat_for now raise 503 on actual unload
  failure (exception or False return), so new GPU workloads do
  not start while the old chat process still owns VRAM.
- core/inference/diffusion.py status() now takes
  include_internal=False by default and only exposes the
  guard-facing active_*/pending_* paths when callers opt in. The
  public /api/inference/images/status route gets the redacted
  payload; routes/models.py delete guards pass
  include_internal=True so they still see the raw paths.
- core/inference/diffusion.py generate_image_with_metadata routes
  the response model through _display_repo_id so /images/generate
  cannot echo back an absolute local path.

P2 fixes:
- routes/inference.py /images/load now maps backend "Could not
  verify training/export status" to 503 instead of 409, matching
  the route-level pre-check.
- core/inference/diffusion.py _release_other_gpu_owners_for_diffusion
  raises "Could not verify export status" when the
  is_export_active() probe itself raises, instead of silently
  treating it as active export.
- core/inference/diffusion.py detect_family compares compact family
  spellings (Flux2Klein) against per-token compact strings so
  unsloth/Flux2Klein-GGUF matches the flux.2-klein family without
  matching the embedded substring inside flux.20.
- main.py installs a RequestValidationError handler that scrubs
  hf_xxxxx tokens out of the 422 response body so a rejected
  ``repo_id`` containing a URL-embedded HF token does not echo it
  back to the browser.

Tests:
- 3 new regression cases (Flux2Klein compact alias, public status
  redaction, generate_image_with_metadata redaction).
- All 75 diffusion backend + route tests pass.
@danielhanchen danielhanchen force-pushed the studio-diffusion-images-staging branch from aa24f21 to 3c6a47d Compare May 25, 2026 07:32
Two diffusion tests broke on the Windows runner after round 16:

- test_display_repo_id_collapses_absolute_path used hardcoded
  POSIX absolute paths; Windows reads /home/... as drive-
  relative so Path.is_absolute() returns False. Use pytest's
  tmp_path so the path is platform-correct.
- test_load_publishes_pending_target_during_loading regressed
  because round 16 moved _release_other_gpu_owners_for_diffusion
  ahead of the chat unload. That helper imports core.training and
  core.export; on Windows CI the import resolved to a real but
  partially configured backend, which raised inside the new
  status-verification path and aborted the load before
  from_pretrained ran. Stub both modules with idle backends in
  _install_fake_diffusers.

Also updated test_public_status_does_not_leak_local_path_via
_active_fields and test_generate_image_with_metadata_redacts_
local_path to use tmp_path for the same Windows reason.
@danielhanchen danielhanchen force-pushed the studio-diffusion-images-staging branch from 3c6a47d to b841562 Compare May 25, 2026 07:36
danielhanchen and others added 3 commits May 25, 2026 08:15
P1: route-layer chat/diffusion/export releases were still
asymmetric. Training start and export load called
``diff_backend.unload_model`` inside a best-effort try/except so a
wedged diffusion backend let the next workload allocate over the
top of the resident pipeline and OOM. Both now use the strict
``_release_diffusion_for`` helper from routes.inference, which
raises HTTPException 503 on status/unload failure or post-check
mismatch.

P2 #9: diffusion load exceptions can include the absolute local
repo / base / gguf path verbatim (FileNotFoundError, OSError from
diffusers / safetensors). The path flows into ``_last_error``,
which ``status()`` returns to every authenticated session. Collapse
the known repo_id / effective_base / gguf_filename paths to their
leaf name before storing the error, mirroring the
``_display_repo_id`` convention used for the public repo label.

P2 #10: when ``repo_id`` is an absolute local path,
``detect_family`` matched _FAMILY_EXCLUDE deny lists against the
full path, so models stored under a parent directory containing
``qwen-image-edit`` or ``3.5`` were misclassified as None. Reduce
the family-detection needle to the leaf directory when the input
looks like a filesystem path; Hub-style ``owner/repo`` ids
continue to use the original needle so existing detection rules
keep working.

P2 #12: ``gguf_filename`` was missing from the
``_reject_embedded_hf_token`` validator. A URL-form quant path
like ``https://hf_xxxxx@huggingface.co/.../flux.gguf`` would be
stored on ``DiffusionBackend._gguf_filename`` and surface in
status() / log lines. Extend the validator to gguf_filename so the
token is dropped before it can leak.

All 85 diffusion-relevant backend tests pass locally.
P1 #1: ``_release_llama_for()`` now verifies ``llama.unload_model``
did not return False AND that ``is_loaded`` / ``is_active`` /
``loading_model_identifier`` are all cleared after the call. The
previous version only treated raised exceptions as failure, so a
subprocess refusing to terminate or an in-flight GGUF download
let the next workload allocate on top.

P1 #2: ``DiffusionBackend._release_other_gpu_owners_for_diffusion``
now raises RuntimeError when ``exp._shutdown_subprocess`` fails on
a settled checkpoint. Direct backend callers used to log at debug
level and proceed toward diffusion allocation while the export
checkpoint still owned VRAM.

P1 #3 + P1 #7: ``/images/load`` no longer drops chat + idle export
before the cheap backend validation runs. ``DiffusionBackend.load_model``
already calls the strict ``_release_other_gpu_owners_for_diffusion``
and ``_release_chat_backend_for_diffusion`` helpers AFTER family
inference and GGUF filename checks pass, so the GPU is still
freed before allocation and a malformed payload no longer
silently unloads the user's chat / chat-export pair.

P1 #4: ``_release_chat_backend_for_diffusion`` now also rejects a
post-unload state where ``loading_model_identifier`` is still set,
matching the route-level ``_release_llama_for`` strictness. A GGUF
download mid-flight before the diffusion handoff used to slip
through and end up double-owning VRAM after diffusion allocated.

P1 #5: ``_release_diffusion_for`` no longer swallows a post-unload
``status()`` failure as ``after = {}``. Training / chat / export
handoffs need proof that the diffusion pipeline released VRAM;
the helper now raises HTTP 503 when the verification status call
itself raises, so the caller retries.

P1 #6: ``DiffusionBackend._release_other_gpu_owners_for_diffusion``
raises RuntimeError when ``get_export_backend()`` itself raises.
Direct backend callers used to silently ``return`` here and
proceed to GPU allocation without being able to verify export
ownership.

P1 #8: ``/training/start`` releases settled export BEFORE chat,
matching the chat-load helpers. If idle export shutdown fails the
user's chat model is preserved instead of being dropped for a
training run that never starts.

P2 #9: GGUF load-error scrubber also collapses ``local_gguf_path``,
the resolved HF cache path passed to
``transformer_cls.from_single_file()``. Without this an exception
like ``OSError: cannot load /home/alice/.cache/huggingface/.../flux.gguf``
would leak the operator's filesystem layout through ``last_error``
and ``/images/status``.

All 85 diffusion-relevant backend tests pass locally.
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/tests/test_diffusion_backend.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
P1 #1: ``_release_safetensors_chat_for`` now re-reads
``active_model_name`` and ``loading_models`` after each unload AND
runs a final sweep against the initial owned-name set. The previous
helper trusted ``unload_model() -> True`` even though the
orchestrator can respond ``unloaded`` while still holding weights
or a concurrent ``load`` can repopulate the tracker between calls.
Per-name and global post-state mismatches now raise HTTP 503 so
the caller retries.

P1 #2: same post-state guarantee inside
``_release_chat_backend_for_diffusion`` for direct backend
callers. ``DiffusionBackend.load_model`` now raises RuntimeError
when the safetensors tracker still owns a previously-resident
name after the unload, matching the route-level helper. The route
layer's existing classifier maps the new wording to HTTP 503.

P1 #3: ``DiffusionBackend.load_model`` now preflights the full
diffusers repo (or explicit GGUF ``base_repo``) via
``hf_hub_download(filename="model_index.json")`` BEFORE the
chat / export unload runs. The GGUF path was already covered by
the existing ``hf_hub_download(gguf_filename)`` round-trip; the
full-repo path used to skip validation and let a typo / private /
gated repo only surface inside ``from_pretrained`` AFTER the
user's chat model was already dropped. Local paths are checked
structurally (must be a directory containing ``model_index.json``)
so we do not network-round-trip for an on-disk miss. Error
messages route through ``_display_repo_id`` so an absolute
filesystem path does not leak the operator's layout.

P1 #6: ``/api/inference/unload`` (the direct chat unload endpoint)
now treats ``unload_model() -> False`` AND a leftover state
(``is_loaded`` / ``is_active`` / ``loading_model_identifier`` for
GGUF, ``active_model_name`` / ``loading_models`` for safetensors)
as 503 instead of unconditionally responding
``status="unloaded"``. The UI used to show the model as gone while
the backend still owned VRAM.

P2 #7: extended the /images/load RuntimeError -> HTTPException
marker list with ``still active or loading after unload`` and
``still loading after unload``. Round 18 introduced these exact
phrasings on the backend side; without the extension a retryable
unload failure was returning HTTP 400 to the user instead of 503.

P2 #8: removed the unused ``unsloth_backend = get_inference_backend()``
eager construction in the GGUF chat-load branch. Eager
construction made the GGUF-only path needlessly fail or pay
startup cost when the safetensors backend was unavailable / lazy;
``_release_safetensors_chat_for`` already handles that case as a
no-op.

All 85 diffusion-relevant + 98 related backend tests pass locally.
P1 #1: ``_preflight_full_diffusers_repo(effective_base, hf_token)``
now runs for every load mode, including the GGUF-with-auto-base
path. Round 19 only preflighted the full repo or an explicit
``base_repo``, so an auto-picked companion that turned out to be
gated / private / missing still unloaded the user's chat model
before ``from_pretrained`` failed. ``effective_base`` is the same
value that feeds every downstream allocation, so preflighting it
unconditionally catches all three modes.

P1 #2: ``diffusers.GGUFQuantizationConfig`` (which imports the
``gguf`` package at construction time) is now built up front,
inside the same try block that surfaces "Re-run Studio setup".
Previously the missing-dependency exception fired AFTER
``_release_other_gpu_owners_for_diffusion`` and
``_release_chat_backend_for_diffusion`` had already taken the
chat / export models down. The downstream from_single_file call
reuses the same ``quant_config`` reference.

P1 #4: ``studio/backend/requirements/studio.txt`` now lists
``diffusers>=0.37.0`` and ``gguf>=0.10.0``. These were only in
the extras files, so fresh standard Studio installs failed on
/images/load with the round 20 P1 #2 dependency error message.

P1 #5: ``LoadRequest``, ``UnloadRequest``, and
``ValidateModelRequest`` now apply the same control-character +
embedded-HF-token validators that ``DiffusionLoadRequest``
already had. /api/inference/load, /api/inference/validate, and
/api/inference/unload used to accept newline / tab / control
characters in ``model_path`` (log-line smuggling) and URL-form
``https://hf_xxxxx@huggingface.co/...`` (credential leak through
structured log sinks).

P2 #6: ``_collapse_local`` in the diffusion load-error scrubber
now resolves relative candidates and adds the absolute form to
the substring set. A relative ``exports/my-flux`` used to leak
``/mnt/disks/.../exports/my-flux/...`` via downstream library
errors because the scrubber only matched the original literal.
Replacement is longest-first so a leaf-only context survives.

All 85 diffusion-relevant + 35 related model-validation tests
pass locally.

(P1 #3 cross-workload GPU handoff lock is deferred: deserves a
focused design pass across /images/load, /chat/load (both
branches), /training/start, and /export/load to pick a lock
boundary that does not deadlock against the backend load locks
or stall the SSE log stream.)
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
Comment thread studio/backend/core/inference/diffusion.py Fixed
P1 #1 + #2: ``LoadRequest._no_embedded_hf_tokens`` and
``ValidateModelRequest._no_embedded_hf_tokens`` now cover
``gguf_variant`` in addition to ``model_path``. A caller could
pass a variant like ``Q4_K_M-hf_xxxxxxxx`` that flowed into
structured log sinks via the GGUF resolver path; the matching
``DiffusionLoadRequest`` validator already covered every string
field, so this restores parity.

P1 #3: ``/api/inference/unload`` now also matches the llama
``loading_model_identifier`` when picking the GGUF branch. A
pending GGUF download (``is_active`` still False,
``loading_model_identifier`` populated) used to fall through to
the safetensors branch and respond ``status="unloaded"`` while
llama-server kept downloading.

P1 #4 + #5: the final safetensors-handoff sweeps (route-level
``_release_safetensors_chat_for`` and backend
``_release_chat_backend_for_diffusion``) now check ``active_model_name``
and ``loading_models`` WITHOUT the initial ``owned_names`` filter.
A concurrent ``/load`` that landed AFTER the snapshot was
previously ignored, so a chat model that began loading during the
unload window let training / export / GGUF chat / diffusion start
anyway and race the new chat for VRAM.

P2 #6: added ``_preflight_diffusers_subfolder_config`` and
invoked it for GGUF loads with a transformer class
(``effective_base``, ``"transformer"``). A custom base companion
that had ``model_index.json`` but lacked
``transformer/config.json`` previously passed the round 19
preflight, unloaded chat, then failed inside
``from_single_file``.

P2 #7: ``_scrub_validation_obj`` in main.py also scrubs string
dict KEYS. Pydantic ``string_type`` errors surface ``input``
verbatim, and a malformed payload like
``{"repo_id": {"hf_xxxxx": "owner/repo"}}`` would otherwise leak
the token through the 422 response body.

All 85 diffusion-relevant + 35 model-validation tests pass
locally. Existing fakes for ``hf_hub_download`` updated to
accept the new ``subfolder=`` kwarg the round 21 preflight uses.

(P1 #3 cross-workload GPU handoff lock from round 20 is still
deferred; round 21's P1 #4 / #5 raised the sweep-level guarantee,
which closes the most common race without the deadlock risk of
holding a process-wide lock across the entire load.)
P1 #1: ``TrainingStartRequest.model_name`` now runs the same
control-character and embedded-HF-token validators that the chat
and diffusion request models gained in rounds 5 / 15 / 20 / 21.
``/api/training/start`` previously accepted newline / tab /
control characters and URL-form ``hf_xxxxx`` tokens that flowed
into structured-log sinks via "Loading model %s" lines.

P1 #2: ``_run_with_helper`` in ``utils/datasets/llm_assist.py``
now skips the helper GGUF when the diffusion image backend
reports loaded / loading. The public chat / training / export
routes already do this through ``_release_diffusion_for``, but
this dataset-side helper loaded llama-server directly with no
diffusion guard, so an Images-page allocation would race the
helper for VRAM. New ``_diffusion_image_model_busy`` helper
fails closed (treats status() failure as busy) so the resident
image model is preserved instead of being overwritten.

P1 #3: same ``_diffusion_image_model_busy`` guard added to
``_run_multi_pass_advisor`` (the dataset conversion advisor),
which has the same direct llama.cpp load shape.

P2 #4: the early "Could not infer a diffusion family" RuntimeError
now routes ``repo_id`` through ``_display_repo_id`` before
formatting. A local absolute path that did not match any known
family used to leak the operator's filesystem layout via the 400
response body, last_error, and log line.

All 97 diffusion + training-validation + related tests pass
locally.
P1 #1 + #2 + #6: extended the chat / diffusion / training
identifier hardening to every export-side request model.
ExportCommonOptions (parent of ExportMergedModelRequest /
ExportBaseModelRequest / ExportLoRAAdapterRequest) now applies
_no_control_chars and _reject_embedded_hf_token to repo_id and
base_model_id; ExportGGUFRequest gets the same on its repo_id
plus a control-char check on quantization_method; and
LoadCheckpointRequest validates checkpoint_path. Previously
"/api/export/*" accepted newline-smuggled identifiers and
URL-form ``hf_xxxxx`` tokens that flowed into log lines.

P1 #3 + #4: ``_run_with_helper`` and ``_run_multi_pass_advisor``
now use a shared ``_gpu_workload_busy_for_helper`` that gates on
diffusion (round 22 already), training, AND export. The round 22
guard only checked diffusion, so the dataset helper / advisor
could still load llama-server on top of an active training run
or a resident export checkpoint. Each step fails closed
(unverifiable status counts as busy) so the user's primary
workload is preserved.

P1 #5: PublishDatasetRequest in models/data_recipe.py also
applies the identifier hardening to repo_id; the publish path
previously accepted control characters and URL-form tokens.

P1 #7-10: added _validate_logged_identifier helper to
routes/models.py and applied it to the path / query parameter
endpoints that flow into logger.info(...) calls --
``/config/{model_name}``, ``/check-vision/{model_name}``,
``/check-embedding/{model_name}``, ``/gguf-variants``. Mapped
the validator's ValueError to HTTP 422 so the client sees the
same shape as a Pydantic validation failure.

P2 #11 + #12: ``Loading diffusion model %s`` and
``Diffusion load failed for %s`` log lines route ``repo_id`` /
``effective_base`` through ``_display_repo_id`` (collapses
absolute local paths to the leaf, still scrubs HF tokens)
instead of plain ``_redact_hf_tokens``. The error path was
already collapsed in the user-facing 400 / RuntimeError, but
the structured-log lines kept the full path.

All 97 diffusion + training-validation + related tests pass
locally.
P1 #1: ``_gpu_workload_busy_for_helper`` in
``utils/datasets/llm_assist.py`` now also gates on the GGUF chat
backend (llama-server) AND the safetensors chat backend. Round 23
extended it to training + export but missed Chat, so a helper /
advisor GGUF could still race a loaded chat model for VRAM.
Both checks fail closed when status is unverifiable.

P1 #2 / #3 / #4 / #5: re-ordered the route-level GPU-handoff
unloads so the diffusion release runs BEFORE the chat releases.
A wedged diffusion unload used to fire AFTER chat was already
gone, so the user lost both on a single failure. Drop chat last
so an earlier failure preserves it. Applied to
``/training/start`` (training.py), ``/export/load`` (export.py),
``/chat/load`` GGUF branch and ``/chat/load`` safetensors branch
(routes/inference.py).

P1 #7 + P2 #13: ``/delete-finetuned`` body now hardens
``model_path`` and ``gguf_variant`` via the shared
``_validate_logged_identifier`` helper, so control characters
and URL-form HF tokens can no longer log-line-smuggle.

P1 #8 + #10: ``/delete-cached`` body hardens ``repo_id`` and
``variant`` the same way.

P1 #9: ``/download-progress`` ``repo_id`` query parameter is
also hardened; the value flows into log lines deep inside
``_get_repo_size_cached`` on lookup failure.

P1 #11: ``CheckFormatRequest.dataset_name`` and
``AiAssistMappingRequest.{dataset_name, model_name}`` in
``models/datasets.py`` now apply the same control-char +
embedded-HF-token validators, matching every other public
request-body model.

All 115 diffusion + training-validation + cached_gguf + export
+ inference model-validation tests pass locally.

(P1 #6 native-path-lease enforcement for diffusion local paths
and P1 #12 React Compiler frontend lint deferred -- both need
focused design / frontend touchups separate from this batch.)
return
try:
del obj
except Exception:
# supplied filename (e.g. ``BF16/model.gguf``) is kept
# separately as ``active_gguf_filename`` for delete
# guards.
gguf_basename = Path(self._gguf_path).name if self._gguf_path else None
self._cpu_offload_enabled = False
self._loaded_at = None
_release(old)
old = None # noqa: F841
backend = get_inference_backend()
active_model_name = getattr(backend, "active_model_name", None)
loading_models = set(getattr(backend, "loading_models", set()) or set())
owned_names = {name for name in ({active_model_name} | loading_models) if name}

active_model_name = getattr(inf, "active_model_name", None)
loading_models = set(getattr(inf, "loading_models", set()) or set())
owned_names = {name for name in ({active_model_name} | loading_models) if name}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant