fix(content-detector): detect and compress space-separated JSON objects by rohanprichard · Pull Request #1742 · headroomlabs-ai/headroom

rohanprichard · 2026-07-03T11:04:43Z

Description

Headroom's detect_content_type() only recognizes content starting with [ as a JSON_ARRAY. Many web search tools (SerpAPI, Tavily, custom backends) return space-separated JSON objects instead of a real array:

{"title": "Result 1", "url": "..."} {"title": "Result 2", "url": "..."} {"title": "Result 3", "url": "..."}

That shape is detected as PLAIN_TEXT (confidence 0.5), so SmartCrusher never processes it and web-search results compress 0% — exactly the high-volume, repetitive output that most benefits from compression.

Closes #1741

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Performance improvement
Code refactoring (no functional changes)

Changes Made

content_detector.py: _try_detect_json now recognizes a run of ≥2 whitespace-separated (space- or newline-separated) JSON objects and returns JSON_ARRAY with metadata["concatenated"] = True. The router already falls back to the Python regex detector when the native detector returns PLAIN_TEXT (content_router.py), so this fixes routing on the default backend too.
content_detector.py: added normalize_concatenated_json() (and a _decode_concatenated_json() helper) that rewrites the space-separated shape into a canonical [{…}, {…}] array string.
smart_crusher.py: SmartCrusher.crush() normalizes concatenated JSON to a real array before handing it to the Rust crusher, so it actually compresses.
The change is deliberately conservative: a single object stays unclaimed (_try_detect_json('{"id": 1}') → None), and any non-JSON token between objects disqualifies the run. Existing [-array detection is unchanged.
Added tests and a CHANGELOG entry.

Testing

Unit tests pass (pytest) — affected suites (full suite has network-dependent ML tests that can't run offline; see note)
Linting passes (ruff check .)
Type checking passes (mypy headroom) — not run (mypy not installed in my env; change is fully type-annotated)
New tests added for new functionality
Manual testing performed

Test Output

$ ruff check .
All checks passed!

$ pytest tests/test_transforms_content_detection.py -q
............                                                             [100%]
12 passed

$ pytest tests/test_transforms_content_router.py \
         tests/test_smart_crusher_toin_attachment.py \
         tests/test_transforms_tabular.py -q
96 passed, 2 skipped
# + SmartCrusher passthrough tests in test_text_compressors.py: 2 passed

Real Behavior Proof

Environment: macOS, Python 3.11, headroom-ai==0.29.0 installed from PyPI (ships the prebuilt Rust _core), patched with this change. Default detection backend (native Rust → Python-regex fallback on PLAIN_TEXT).
Exact command / steps: ran a 100-object space-separated web_search payload through detect_content_type() and ContentRouter().compress(), before and after the patch (repro below).
Observed result: detection flips PLAIN_TEXT (conf 0.5) → JSON_ARRAY (conf 1.0) and SmartCrusher compression goes from 0.0% to 34.2% (10369 → 6819 bytes) on the identical payload.
Not tested: the native Rust detector path in isolation (the fix relies on the existing documented Python-regex fallback for PLAIN_TEXT); separators other than whitespace (comma-separated-without-brackets is intentionally not claimed).

Before:

detected  : ContentType.PLAIN_TEXT  conf 0.5
strategy  : CompressionStrategy.SMART_CRUSHER
orig bytes: 10369
comp bytes: 10369
reduction : 0.0%

After:

detected  : ContentType.JSON_ARRAY  conf 1.0
strategy  : CompressionStrategy.SMART_CRUSHER
orig bytes: 10369
comp bytes: 6819
reduction : 34.2%

Review Readiness

I have performed a self-review
This PR is ready for human review

Checklist

My code follows the project's style guidelines
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation (CHANGELOG)
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have updated the CHANGELOG.md if applicable

Additional Notes

Kept as a draft pending a maintainer's read and my own re-run of the repro on a clean install before flipping "ready for human review".
mypy headroom was not run (mypy isn't in my local env); the new code is fully type-annotated and ruff check . is clean repo-wide.

Web search tools (SerpAPI, Tavily, custom backends) commonly return back-to-back JSON objects separated by whitespace ({...} {...} {...}) rather than a JSON array. detect_content_type only treated input starting with [ as JSON_ARRAY, so this shape fell through to PLAIN_TEXT and SmartCrusher passed it through at 0% compression. The detector now recognizes a run of >=2 whitespace-separated JSON objects as JSON_ARRAY, and SmartCrusher normalizes that shape to a real array before crushing. Measured ~34% byte reduction on a 100-result web_search payload that previously compressed 0%. Closes headroomlabs-ai#1741

github-actions · 2026-07-03T11:04:55Z

PR governance

This draft PR follows the template so far. Keep it in draft until it is ready for human review.

github-actions Bot added status: needs author action Pull request body or readiness checklist still needs author updates and removed status: needs author action Pull request body or readiness checklist still needs author updates labels Jul 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(content-detector): detect and compress space-separated JSON objects#1742

fix(content-detector): detect and compress space-separated JSON objects#1742
rohanprichard wants to merge 1 commit into
headroomlabs-ai:mainfrom
rohanprichard:fix/content-detector-space-separated-json

rohanprichard commented Jul 3, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

rohanprichard commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Changes Made

Testing

Test Output

Real Behavior Proof

Review Readiness

Checklist

Additional Notes

Uh oh!

github-actions Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR governance

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rohanprichard commented Jul 3, 2026 •

edited

Loading

github-actions Bot commented Jul 3, 2026 •

edited

Loading