fix(content-detector): detect and compress space-separated JSON objects#1742
Draft
rohanprichard wants to merge 1 commit into
Draft
Conversation
Web search tools (SerpAPI, Tavily, custom backends) commonly return
back-to-back JSON objects separated by whitespace ({...} {...} {...})
rather than a JSON array. detect_content_type only treated input
starting with [ as JSON_ARRAY, so this shape fell through to PLAIN_TEXT
and SmartCrusher passed it through at 0% compression.
The detector now recognizes a run of >=2 whitespace-separated JSON
objects as JSON_ARRAY, and SmartCrusher normalizes that shape to a real
array before crushing. Measured ~34% byte reduction on a 100-result
web_search payload that previously compressed 0%.
Closes headroomlabs-ai#1741
Contributor
PR governanceThis draft PR follows the template so far. Keep it in draft until it is ready for human review. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Headroom's
detect_content_type()only recognizes content starting with[as aJSON_ARRAY. Many web search tools (SerpAPI, Tavily, custom backends) return space-separated JSON objects instead of a real array:{"title": "Result 1", "url": "..."} {"title": "Result 2", "url": "..."} {"title": "Result 3", "url": "..."}That shape is detected as
PLAIN_TEXT(confidence 0.5), so SmartCrusher never processes it and web-search results compress 0% — exactly the high-volume, repetitive output that most benefits from compression.Closes #1741
Type of Change
Changes Made
content_detector.py:_try_detect_jsonnow recognizes a run of ≥2 whitespace-separated (space- or newline-separated) JSON objects and returnsJSON_ARRAYwithmetadata["concatenated"] = True. The router already falls back to the Python regex detector when the native detector returnsPLAIN_TEXT(content_router.py), so this fixes routing on the default backend too.content_detector.py: addednormalize_concatenated_json()(and a_decode_concatenated_json()helper) that rewrites the space-separated shape into a canonical[{…}, {…}]array string.smart_crusher.py:SmartCrusher.crush()normalizes concatenated JSON to a real array before handing it to the Rust crusher, so it actually compresses._try_detect_json('{"id": 1}')→None), and any non-JSON token between objects disqualifies the run. Existing[-array detection is unchanged.Testing
pytest) — affected suites (full suite has network-dependent ML tests that can't run offline; see note)ruff check .)mypy headroom) — not run (mypy not installed in my env; change is fully type-annotated)Test Output
Real Behavior Proof
headroom-ai==0.29.0installed from PyPI (ships the prebuilt Rust_core), patched with this change. Default detection backend (native Rust → Python-regex fallback onPLAIN_TEXT).web_searchpayload throughdetect_content_type()andContentRouter().compress(), before and after the patch (repro below).PLAIN_TEXT(conf 0.5) →JSON_ARRAY(conf 1.0) and SmartCrusher compression goes from 0.0% to 34.2% (10369 → 6819 bytes) on the identical payload.PLAIN_TEXT); separators other than whitespace (comma-separated-without-brackets is intentionally not claimed).Before:
After:
Review Readiness
Checklist
Additional Notes
mypy headroomwas not run (mypy isn't in my local env); the new code is fully type-annotated andruff check .is clean repo-wide.