fix(content-detector): detect and compress space-separated JSON objects#1742
Open
rohanprichard wants to merge 1 commit into
Open
Conversation
Contributor
PR governanceThis PR follows the template and is marked ready for human review. |
Web search tools (SerpAPI, Tavily, custom backends) commonly return
back-to-back JSON objects separated by whitespace ({...} {...} {...})
rather than a JSON array. detect_content_type only treated input
starting with [ as JSON_ARRAY, so this shape fell through to PLAIN_TEXT
and SmartCrusher passed it through at 0% compression.
The detector now recognizes a run of >=2 whitespace-separated JSON
objects as JSON_ARRAY, and SmartCrusher normalizes that shape to a real
array before crushing. Measured ~34% byte reduction on a 100-result
web_search payload that previously compressed 0%.
Closes headroomlabs-ai#1741
a216a05 to
35b973a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Headroom's
detect_content_type()only recognizes content starting with[as a `JSON array. Many web search tools (SerpAPI, Tavily, custom backends) return space-separated JSON objects instead of a real array like follows{"title": "Result 1", "url": "..."} {"title": "Result 2", "url": "..."} {"title": "Result 3", "url": "..."}That shape is detected as
PLAIN_TEXT(confidence 0.5), so SmartCrusher never processes it and web-search results compress 0%.Closes #1741
Type of Change
Changes Made
content_detector.py:_try_detect_jsonnow recognizes a run of ≥2 whitespace-separated (space- or newline-separated) JSON objects and returnsJSON_ARRAYwithmetadata["concatenated"] = True. The router already falls back to the Python regex detector when the native detector returnsPLAIN_TEXT(content_router.py), so this fixes routing on the default backend too.content_detector.py: addednormalize_concatenated_json()(and a_decode_concatenated_json()helper) that rewrites the space-separated shape into a canonical[{…}, {…}]array string.smart_crusher.py:SmartCrusher.crush()normalizes concatenated JSON to a real array before handing it to the Rust crusher, so it actually compresses._try_detect_json('{"id": 1}')→None), and any non-JSON token between objects disqualifies the run. Existing[-array detection is unchanged.Testing
pytest) — affected suites (full suite has network-dependent ML tests that can't run offline; see note)ruff check .)mypy headroom)Test Output
Real Behavior Proof
uv pip install -e .) with the Rust_corecompiled locally; default detection backend (native Rust → Python-regex fallback on PLAIN_TEXT).web_searchpayload throughdetect_content_type()andContentRouter().compress(), before and after the patch (repro below).PLAIN_TEXT(conf 0.5) →JSON_ARRAY(conf 1.0) and SmartCrusher compression goes from 0.0% to 34.2% (10369 → 6819 bytes) on the identical payload.PLAIN_TEXT); separators other than whitespace (comma-separated-without-brackets is intentionally not claimed).Before:
After:
Review Readiness
Checklist
Additional Notes