fix(proxy): bound HF tokenizer load and offload token counting off event loop#1738
Open
Parideboy wants to merge 1 commit into
Open
fix(proxy): bound HF tokenizer load and offload token counting off event loop#1738Parideboy wants to merge 1 commit into
Parideboy wants to merge 1 commit into
Conversation
…ent loop DeepSeek (and other HF-backed) model names route to HuggingFaceTokenizer, whose first count_messages lazily calls AutoTokenizer.from_pretrained with unbounded network downloads/retries — executed synchronously inside the async Anthropic messages handler. On a restricted network this blocked the event loop for ~10 minutes (optimization_latency_ms ~610000) and left the whole proxy unresponsive (/livez, /readyz, /health hung until kill). - huggingface: try local_files_only first; bound the network load with HEADROOM_HF_TOKENIZER_LOAD_TIMEOUT_SECS (default 10s, 0 disables) on a daemon thread; cache failures so the hub is probed once per process. - proxy: new _count_tokens_offloaded runs get_tokenizer + count_messages on the compression executor bounded by COMPRESSION_TIMEOUT_SECONDS, failing open to estimation; used in the Anthropic messages and batch paths instead of inline calls. - batch handlers: offload the remaining inline pipeline.apply() calls via _run_compression_in_executor (same fail-open semantics). Fixes headroomlabs-ai#1701 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Contributor
PR governanceThis PR follows the template and is marked ready for human review. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fixes #1701.
On Windows,
headroom proxy --anthropic-api-url https://api.deepseek.com/anthropicfroze: the first/v1/messagesrequest took ~610s (optimization_latency_ms=609972) with only router/lifecycle markers, and afterwards the whole server was a zombie —/livez,/readyzand/healthhung until the process was killed.HEADROOM_DETECT_BACKEND=pythonwas already set, so this was not the #575/#845 native-detect deadlock.Root cause: DeepSeek model names route to the HuggingFace tokenizer backend (
MODEL_PATTERNSinheadroom/tokenizers/registry.py).HuggingFaceTokenizerloads lazily, so the registry's construction-time fallback never fires; the firstcount_messagescallsAutoTokenizer.from_pretrained(..., trust_remote_code=True)— unbounded network downloads/retries — and this ran synchronously inside the async Anthropic messages handler (get_tokenizer(model)+tokenizer.count_messages(messages)), outside the 30s_run_compression_in_executorbound. huggingface_hub retry chains on a restricted network easily reach ~10 minutes, blocking the entire asyncio event loop; subsequent on-loop counting kept it pinned. tiktoken got a bounded eager load for the same bug class long ago (#956); the HF backend never did.Type of Change
Changes Made
headroom/tokenizers/huggingface.py:_load_tokenizernow tries the local HF cache first (local_files_only=True, no network), then bounds the network load withHEADROOM_HF_TOKENIZER_LOAD_TIMEOUT_SECS(default 10s;0disables network loads) on a daemon thread. Timeouts/failures returnNone(cached bylru_cache, so the hub is probed at most once per process per tokenizer) andcount_messagesfails open to char-based estimation via the existing_use_fallback()path.headroom/proxy/handlers/anthropic.py: newAnthropicHandlerMixin._count_tokens_offloaded(model, messages)runsget_tokenizer+count_messageson the compression executor bounded byCOMPRESSION_TIMEOUT_SECONDS, failing open toEstimatingTokenCounter(downgrade logged once per model). Used inhandle_anthropic_messages(the issue's hot path, both count sites) andhandle_anthropic_batch_create; the batch path's inlineanthropic_pipeline.apply()is now offloaded via_run_compression_in_executor(mirrors the perf(proxy): offload image compression off event loop #1612 image-compression offload).headroom/proxy/handlers/batch.py: the two remaining inlineopenai_pipeline.apply()calls (handle_google_batch_create,_compress_batch_jsonl) are offloaded the same way; existingexceptblocks keep the pass-through fail-open semantics.tests/test_huggingface_tokenizer_timeout.py(cache-first, bounded timeout, failure caching, timeout=0, fail-open estimation),tests/test_tokenizer_count_offload.py(wiring guards, runs onheadroom-compressworker, event loop stays responsive during slow tokenizer work, fail-open), plus_run_compression_in_executorstub on the batch test double.Testing
Real Behavior Proof
python -m pytest tests/test_tokenizer_count_offload.py -q— includestest_count_tokens_offloaded_keeps_loop_responsive, which reproduces the issue's mechanism: a tokenizer whosecount_messagesblocks (stand-in for the unboundedAutoTokenizer.from_pretrainednetwork load) while an asyncio ticker measures event-loop liveness. Alsopython -m pytest tests/test_huggingface_tokenizer_timeout.py -qwith afrom_pretrainedstub that sleeps 60s andHEADROOM_HF_TOKENIZER_LOAD_TIMEOUT_SECS=0.2.headroom-compressworker thread and the loop keeps ticking (ticks >= 5; inline it yields ~0 — the zombie). The 60s-hung HF load unblocks at the 0.2s timeout, falls back to estimation, and the second call returns instantly (failure cached, no re-probe). All 10 new tests pass.api.deepseek.comfrom a network where HF hub downloads stall (the reporter's exact environment); actual HF vocab download timing on a healthy network.Review Readiness