fix(proxy): bound HF tokenizer load and offload token counting off event loop by Parideboy · Pull Request #1738 · headroomlabs-ai/headroom

Parideboy · 2026-07-03T08:54:27Z

Description

On Windows, headroom proxy --anthropic-api-url https://api.deepseek.com/anthropic froze: the first /v1/messages request took ~610s (optimization_latency_ms=609972) with only router/lifecycle markers, and afterwards the whole server was a zombie — /livez, /readyz and /health hung until the process was killed. HEADROOM_DETECT_BACKEND=python was already set, so this was not the #575/#845 native-detect deadlock.

Root cause: DeepSeek model names route to the HuggingFace tokenizer backend (MODEL_PATTERNS in headroom/tokenizers/registry.py). HuggingFaceTokenizer loads lazily, so the registry's construction-time fallback never fires; the first count_messages calls AutoTokenizer.from_pretrained(..., trust_remote_code=True) — unbounded network downloads/retries — and this ran synchronously inside the async Anthropic messages handler (get_tokenizer(model) + tokenizer.count_messages(messages)), outside the 30s _run_compression_in_executor bound. huggingface_hub retry chains on a restricted network easily reach ~10 minutes, blocking the entire asyncio event loop; subsequent on-loop counting kept it pinned. tiktoken got a bounded eager load for the same bug class long ago (#956); the HF backend never did.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)

Changes Made

headroom/tokenizers/huggingface.py: _load_tokenizer now tries the local HF cache first (local_files_only=True, no network), then bounds the network load with HEADROOM_HF_TOKENIZER_LOAD_TIMEOUT_SECS (default 10s; 0 disables network loads) on a daemon thread. Timeouts/failures return None (cached by lru_cache, so the hub is probed at most once per process per tokenizer) and count_messages fails open to char-based estimation via the existing _use_fallback() path.
headroom/proxy/handlers/anthropic.py: new AnthropicHandlerMixin._count_tokens_offloaded(model, messages) runs get_tokenizer + count_messages on the compression executor bounded by COMPRESSION_TIMEOUT_SECONDS, failing open to EstimatingTokenCounter (downgrade logged once per model). Used in handle_anthropic_messages (the issue's hot path, both count sites) and handle_anthropic_batch_create; the batch path's inline anthropic_pipeline.apply() is now offloaded via _run_compression_in_executor (mirrors the perf(proxy): offload image compression off event loop #1612 image-compression offload).
headroom/proxy/handlers/batch.py: the two remaining inline openai_pipeline.apply() calls (handle_google_batch_create, _compress_batch_jsonl) are offloaded the same way; existing except blocks keep the pass-through fail-open semantics.
Tests: tests/test_huggingface_tokenizer_timeout.py (cache-first, bounded timeout, failure caching, timeout=0, fail-open estimation), tests/test_tokenizer_count_offload.py (wiring guards, runs on headroom-compress worker, event loop stays responsive during slow tokenizer work, fail-open), plus _run_compression_in_executor stub on the batch test double.

Testing

All existing tests pass
Added new tests for the changes
Manual testing performed

$ python -m pytest tests/test_huggingface_tokenizer_timeout.py tests/test_tokenizer_count_offload.py tests/test_image_compression_offload.py tests/test_gemini_compression_offload.py tests/test_tokenizers tests/test_proxy_handlers_batch.py -q
50 passed

$ ruff check .          # No issues found
$ ruff format --check . # 1043 files already formatted
$ mypy headroom --ignore-missing-imports  # 0 errors

Real Behavior Proof

Environment: Windows 11 Pro (10.0.26200), Python 3.13, local checkout of this branch with the Rust core built.
Exact command / steps: python -m pytest tests/test_tokenizer_count_offload.py -q — includes test_count_tokens_offloaded_keeps_loop_responsive, which reproduces the issue's mechanism: a tokenizer whose count_messages blocks (stand-in for the unbounded AutoTokenizer.from_pretrained network load) while an asyncio ticker measures event-loop liveness. Also python -m pytest tests/test_huggingface_tokenizer_timeout.py -q with a from_pretrained stub that sleeps 60s and HEADROOM_HF_TOKENIZER_LOAD_TIMEOUT_SECS=0.2.
Observed result: with the fix, the slow count runs on a headroom-compress worker thread and the loop keeps ticking (ticks >= 5; inline it yields ~0 — the zombie). The 60s-hung HF load unblocks at the 0.2s timeout, falls back to estimation, and the second call returns instantly (failure cached, no re-probe). All 10 new tests pass.
Not tested: live reproduction against api.deepseek.com from a network where HF hub downloads stall (the reporter's exact environment); actual HF vocab download timing on a healthy network.

Review Readiness

I have performed a self-review
This PR is ready for human review

…ent loop DeepSeek (and other HF-backed) model names route to HuggingFaceTokenizer, whose first count_messages lazily calls AutoTokenizer.from_pretrained with unbounded network downloads/retries — executed synchronously inside the async Anthropic messages handler. On a restricted network this blocked the event loop for ~10 minutes (optimization_latency_ms ~610000) and left the whole proxy unresponsive (/livez, /readyz, /health hung until kill). - huggingface: try local_files_only first; bound the network load with HEADROOM_HF_TOKENIZER_LOAD_TIMEOUT_SECS (default 10s, 0 disables) on a daemon thread; cache failures so the hub is probed once per process. - proxy: new _count_tokens_offloaded runs get_tokenizer + count_messages on the compression executor bounded by COMPRESSION_TIMEOUT_SECONDS, failing open to estimation; used in the Anthropic messages and batch paths instead of inline calls. - batch handlers: offload the remaining inline pipeline.apply() calls via _run_compression_in_executor (same fail-open semantics). Fixes headroomlabs-ai#1701 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

github-actions · 2026-07-03T08:54:41Z

PR governance

This PR follows the template and is marked ready for human review.

Parideboy requested review from DevanshiVyas, JerrettDavis and chopratejas as code owners July 3, 2026 08:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(proxy): bound HF tokenizer load and offload token counting off event loop#1738

fix(proxy): bound HF tokenizer load and offload token counting off event loop#1738
Parideboy wants to merge 1 commit into
headroomlabs-ai:mainfrom
Parideboy:fix/1701-hf-tokenizer-event-loop

Parideboy commented Jul 3, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Parideboy commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Changes Made

Testing

Real Behavior Proof

Review Readiness

Uh oh!

github-actions Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR governance

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Parideboy commented Jul 3, 2026 •

edited

Loading

github-actions Bot commented Jul 3, 2026 •

edited

Loading