Skip to content

fix(content-detector): detect and compress space-separated JSON objects#1742

Draft
rohanprichard wants to merge 1 commit into
headroomlabs-ai:mainfrom
rohanprichard:fix/content-detector-space-separated-json
Draft

fix(content-detector): detect and compress space-separated JSON objects#1742
rohanprichard wants to merge 1 commit into
headroomlabs-ai:mainfrom
rohanprichard:fix/content-detector-space-separated-json

Conversation

@rohanprichard

@rohanprichard rohanprichard commented Jul 3, 2026

Copy link
Copy Markdown

Description

Headroom's detect_content_type() only recognizes content starting with [ as a JSON_ARRAY. Many web search tools (SerpAPI, Tavily, custom backends) return space-separated JSON objects instead of a real array:

{"title": "Result 1", "url": "..."} {"title": "Result 2", "url": "..."} {"title": "Result 3", "url": "..."}

That shape is detected as PLAIN_TEXT (confidence 0.5), so SmartCrusher never processes it and web-search results compress 0% — exactly the high-volume, repetitive output that most benefits from compression.

Closes #1741

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Performance improvement
  • Code refactoring (no functional changes)

Changes Made

  • content_detector.py: _try_detect_json now recognizes a run of ≥2 whitespace-separated (space- or newline-separated) JSON objects and returns JSON_ARRAY with metadata["concatenated"] = True. The router already falls back to the Python regex detector when the native detector returns PLAIN_TEXT (content_router.py), so this fixes routing on the default backend too.
  • content_detector.py: added normalize_concatenated_json() (and a _decode_concatenated_json() helper) that rewrites the space-separated shape into a canonical [{…}, {…}] array string.
  • smart_crusher.py: SmartCrusher.crush() normalizes concatenated JSON to a real array before handing it to the Rust crusher, so it actually compresses.
  • The change is deliberately conservative: a single object stays unclaimed (_try_detect_json('{"id": 1}')None), and any non-JSON token between objects disqualifies the run. Existing [-array detection is unchanged.
  • Added tests and a CHANGELOG entry.

Testing

  • Unit tests pass (pytest) — affected suites (full suite has network-dependent ML tests that can't run offline; see note)
  • Linting passes (ruff check .)
  • Type checking passes (mypy headroom) — not run (mypy not installed in my env; change is fully type-annotated)
  • New tests added for new functionality
  • Manual testing performed

Test Output

$ ruff check .
All checks passed!

$ pytest tests/test_transforms_content_detection.py -q
............                                                             [100%]
12 passed

$ pytest tests/test_transforms_content_router.py \
         tests/test_smart_crusher_toin_attachment.py \
         tests/test_transforms_tabular.py -q
96 passed, 2 skipped
# + SmartCrusher passthrough tests in test_text_compressors.py: 2 passed

Real Behavior Proof

  • Environment: macOS, Python 3.11, headroom-ai==0.29.0 installed from PyPI (ships the prebuilt Rust _core), patched with this change. Default detection backend (native Rust → Python-regex fallback on PLAIN_TEXT).
  • Exact command / steps: ran a 100-object space-separated web_search payload through detect_content_type() and ContentRouter().compress(), before and after the patch (repro below).
  • Observed result: detection flips PLAIN_TEXT (conf 0.5) → JSON_ARRAY (conf 1.0) and SmartCrusher compression goes from 0.0% to 34.2% (10369 → 6819 bytes) on the identical payload.
  • Not tested: the native Rust detector path in isolation (the fix relies on the existing documented Python-regex fallback for PLAIN_TEXT); separators other than whitespace (comma-separated-without-brackets is intentionally not claimed).

Before:

detected  : ContentType.PLAIN_TEXT  conf 0.5
strategy  : CompressionStrategy.SMART_CRUSHER
orig bytes: 10369
comp bytes: 10369
reduction : 0.0%

After:

detected  : ContentType.JSON_ARRAY  conf 1.0
strategy  : CompressionStrategy.SMART_CRUSHER
orig bytes: 10369
comp bytes: 6819
reduction : 34.2%

Review Readiness

  • I have performed a self-review
  • This PR is ready for human review

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (CHANGELOG)
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have updated the CHANGELOG.md if applicable

Additional Notes

  • Kept as a draft pending a maintainer's read and my own re-run of the repro on a clean install before flipping "ready for human review".
  • mypy headroom was not run (mypy isn't in my local env); the new code is fully type-annotated and ruff check . is clean repo-wide.

Web search tools (SerpAPI, Tavily, custom backends) commonly return
back-to-back JSON objects separated by whitespace ({...} {...} {...})
rather than a JSON array. detect_content_type only treated input
starting with [ as JSON_ARRAY, so this shape fell through to PLAIN_TEXT
and SmartCrusher passed it through at 0% compression.

The detector now recognizes a run of >=2 whitespace-separated JSON
objects as JSON_ARRAY, and SmartCrusher normalizes that shape to a real
array before crushing. Measured ~34% byte reduction on a 100-result
web_search payload that previously compressed 0%.

Closes headroomlabs-ai#1741
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

PR governance

This draft PR follows the template so far. Keep it in draft until it is ready for human review.

@github-actions github-actions Bot added status: needs author action Pull request body or readiness checklist still needs author updates and removed status: needs author action Pull request body or readiness checklist still needs author updates labels Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(content-detector): support space-separated JSON objects for web_search tool output

1 participant