Skip to content

fix(proxy): bound HF tokenizer load and offload token counting off event loop#1738

Open
Parideboy wants to merge 1 commit into
headroomlabs-ai:mainfrom
Parideboy:fix/1701-hf-tokenizer-event-loop
Open

fix(proxy): bound HF tokenizer load and offload token counting off event loop#1738
Parideboy wants to merge 1 commit into
headroomlabs-ai:mainfrom
Parideboy:fix/1701-hf-tokenizer-event-loop

Conversation

@Parideboy

@Parideboy Parideboy commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Description

Fixes #1701.

On Windows, headroom proxy --anthropic-api-url https://api.deepseek.com/anthropic froze: the first /v1/messages request took ~610s (optimization_latency_ms=609972) with only router/lifecycle markers, and afterwards the whole server was a zombie — /livez, /readyz and /health hung until the process was killed. HEADROOM_DETECT_BACKEND=python was already set, so this was not the #575/#845 native-detect deadlock.

Root cause: DeepSeek model names route to the HuggingFace tokenizer backend (MODEL_PATTERNS in headroom/tokenizers/registry.py). HuggingFaceTokenizer loads lazily, so the registry's construction-time fallback never fires; the first count_messages calls AutoTokenizer.from_pretrained(..., trust_remote_code=True) — unbounded network downloads/retries — and this ran synchronously inside the async Anthropic messages handler (get_tokenizer(model) + tokenizer.count_messages(messages)), outside the 30s _run_compression_in_executor bound. huggingface_hub retry chains on a restricted network easily reach ~10 minutes, blocking the entire asyncio event loop; subsequent on-loop counting kept it pinned. tiktoken got a bounded eager load for the same bug class long ago (#956); the HF backend never did.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)

Changes Made

  • headroom/tokenizers/huggingface.py: _load_tokenizer now tries the local HF cache first (local_files_only=True, no network), then bounds the network load with HEADROOM_HF_TOKENIZER_LOAD_TIMEOUT_SECS (default 10s; 0 disables network loads) on a daemon thread. Timeouts/failures return None (cached by lru_cache, so the hub is probed at most once per process per tokenizer) and count_messages fails open to char-based estimation via the existing _use_fallback() path.
  • headroom/proxy/handlers/anthropic.py: new AnthropicHandlerMixin._count_tokens_offloaded(model, messages) runs get_tokenizer + count_messages on the compression executor bounded by COMPRESSION_TIMEOUT_SECONDS, failing open to EstimatingTokenCounter (downgrade logged once per model). Used in handle_anthropic_messages (the issue's hot path, both count sites) and handle_anthropic_batch_create; the batch path's inline anthropic_pipeline.apply() is now offloaded via _run_compression_in_executor (mirrors the perf(proxy): offload image compression off event loop #1612 image-compression offload).
  • headroom/proxy/handlers/batch.py: the two remaining inline openai_pipeline.apply() calls (handle_google_batch_create, _compress_batch_jsonl) are offloaded the same way; existing except blocks keep the pass-through fail-open semantics.
  • Tests: tests/test_huggingface_tokenizer_timeout.py (cache-first, bounded timeout, failure caching, timeout=0, fail-open estimation), tests/test_tokenizer_count_offload.py (wiring guards, runs on headroom-compress worker, event loop stays responsive during slow tokenizer work, fail-open), plus _run_compression_in_executor stub on the batch test double.

Testing

  • All existing tests pass
  • Added new tests for the changes
  • Manual testing performed
$ python -m pytest tests/test_huggingface_tokenizer_timeout.py tests/test_tokenizer_count_offload.py tests/test_image_compression_offload.py tests/test_gemini_compression_offload.py tests/test_tokenizers tests/test_proxy_handlers_batch.py -q
50 passed

$ ruff check .          # No issues found
$ ruff format --check . # 1043 files already formatted
$ mypy headroom --ignore-missing-imports  # 0 errors

Real Behavior Proof

  • Environment: Windows 11 Pro (10.0.26200), Python 3.13, local checkout of this branch with the Rust core built.
  • Exact command / steps: python -m pytest tests/test_tokenizer_count_offload.py -q — includes test_count_tokens_offloaded_keeps_loop_responsive, which reproduces the issue's mechanism: a tokenizer whose count_messages blocks (stand-in for the unbounded AutoTokenizer.from_pretrained network load) while an asyncio ticker measures event-loop liveness. Also python -m pytest tests/test_huggingface_tokenizer_timeout.py -q with a from_pretrained stub that sleeps 60s and HEADROOM_HF_TOKENIZER_LOAD_TIMEOUT_SECS=0.2.
  • Observed result: with the fix, the slow count runs on a headroom-compress worker thread and the loop keeps ticking (ticks >= 5; inline it yields ~0 — the zombie). The 60s-hung HF load unblocks at the 0.2s timeout, falls back to estimation, and the second call returns instantly (failure cached, no re-probe). All 10 new tests pass.
  • Not tested: live reproduction against api.deepseek.com from a network where HF hub downloads stall (the reporter's exact environment); actual HF vocab download timing on a healthy network.

Review Readiness

  • I have performed a self-review
  • This PR is ready for human review

…ent loop

DeepSeek (and other HF-backed) model names route to HuggingFaceTokenizer,
whose first count_messages lazily calls AutoTokenizer.from_pretrained with
unbounded network downloads/retries — executed synchronously inside the
async Anthropic messages handler. On a restricted network this blocked the
event loop for ~10 minutes (optimization_latency_ms ~610000) and left the
whole proxy unresponsive (/livez, /readyz, /health hung until kill).

- huggingface: try local_files_only first; bound the network load with
  HEADROOM_HF_TOKENIZER_LOAD_TIMEOUT_SECS (default 10s, 0 disables) on a
  daemon thread; cache failures so the hub is probed once per process.
- proxy: new _count_tokens_offloaded runs get_tokenizer + count_messages
  on the compression executor bounded by COMPRESSION_TIMEOUT_SECONDS,
  failing open to estimation; used in the Anthropic messages and batch
  paths instead of inline calls.
- batch handlers: offload the remaining inline pipeline.apply() calls
  via _run_compression_in_executor (same fail-open semantics).

Fixes headroomlabs-ai#1701

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

PR governance

This PR follows the template and is marked ready for human review.

@github-actions github-actions Bot added status: needs author action Pull request body or readiness checklist still needs author updates status: ready for review Pull request body is complete and the author marked it ready for human review and removed status: needs author action Pull request body or readiness checklist still needs author updates labels Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status: ready for review Pull request body is complete and the author marked it ready for human review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]

1 participant