Skip to content

fix(content-detector): detect and compress space-separated JSON objects#1742

Open
rohanprichard wants to merge 1 commit into
headroomlabs-ai:mainfrom
rohanprichard:fix/content-detector-space-separated-json
Open

fix(content-detector): detect and compress space-separated JSON objects#1742
rohanprichard wants to merge 1 commit into
headroomlabs-ai:mainfrom
rohanprichard:fix/content-detector-space-separated-json

Conversation

@rohanprichard

@rohanprichard rohanprichard commented Jul 3, 2026

Copy link
Copy Markdown

Description

Headroom's detect_content_type() only recognizes content starting with [ as a `JSON array. Many web search tools (SerpAPI, Tavily, custom backends) return space-separated JSON objects instead of a real array like follows

{"title": "Result 1", "url": "..."} {"title": "Result 2", "url": "..."} {"title": "Result 3", "url": "..."}

That shape is detected as PLAIN_TEXT (confidence 0.5), so SmartCrusher never processes it and web-search results compress 0%.

Closes #1741

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Performance improvement
  • Code refactoring (no functional changes)

Changes Made

  • content_detector.py: _try_detect_json now recognizes a run of ≥2 whitespace-separated (space- or newline-separated) JSON objects and returns JSON_ARRAY with metadata["concatenated"] = True. The router already falls back to the Python regex detector when the native detector returns PLAIN_TEXT (content_router.py), so this fixes routing on the default backend too.
  • content_detector.py: added normalize_concatenated_json() (and a _decode_concatenated_json() helper) that rewrites the space-separated shape into a canonical [{…}, {…}] array string.
  • smart_crusher.py: SmartCrusher.crush() normalizes concatenated JSON to a real array before handing it to the Rust crusher, so it actually compresses.
  • The change is deliberately conservative: a single object stays unclaimed (_try_detect_json('{"id": 1}')None), and any non-JSON token between objects disqualifies the run. Existing [-array detection is unchanged.
  • Added tests and a CHANGELOG entry.

Testing

  • Unit tests pass (pytest) — affected suites (full suite has network-dependent ML tests that can't run offline; see note)
  • Linting passes (ruff check .)
  • Type checking passes (mypy headroom)
  • New tests added for new functionality
  • Manual testing performed

Test Output

$ ruff check .
All checks passed!

$ pytest tests/test_transforms_content_detection.py -q
............                                                             [100%]
12 passed

$ pytest tests/test_transforms_content_router.py \
         tests/test_smart_crusher_toin_attachment.py \
         tests/test_transforms_tabular.py -q
96 passed, 2 skipped
# + SmartCrusher passthrough tests in test_text_compressors.py: 2 passed

Real Behavior Proof

  • Environment: macOS 26.5, Python 3.12.11, editable source build (uv pip install -e .) with the Rust _core compiled locally; default detection backend (native Rust → Python-regex fallback on PLAIN_TEXT).
  • Exact command / steps: ran a 100-object space-separated web_search payload through detect_content_type() and ContentRouter().compress(), before and after the patch (repro below).
  • Observed result: detection flips PLAIN_TEXT (conf 0.5) → JSON_ARRAY (conf 1.0) and SmartCrusher compression goes from 0.0% to 34.2% (10369 → 6819 bytes) on the identical payload.
  • Not tested: the native Rust detector path in isolation (the fix relies on the existing documented Python-regex fallback for PLAIN_TEXT); separators other than whitespace (comma-separated-without-brackets is intentionally not claimed).

Before:

detected  : ContentType.PLAIN_TEXT  conf 0.5
strategy  : CompressionStrategy.SMART_CRUSHER
orig bytes: 10369
comp bytes: 10369
reduction : 0.0%

After:

detected  : ContentType.JSON_ARRAY  conf 1.0
strategy  : CompressionStrategy.SMART_CRUSHER
orig bytes: 10369
comp bytes: 6819
reduction : 34.2%

Review Readiness

  • I have performed a self-review
  • This PR is ready for human review

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (CHANGELOG)
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have updated the CHANGELOG.md if applicable

Additional Notes

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

PR governance

This PR follows the template and is marked ready for human review.

@github-actions github-actions Bot added status: needs author action Pull request body or readiness checklist still needs author updates and removed status: needs author action Pull request body or readiness checklist still needs author updates labels Jul 3, 2026
@rohanprichard rohanprichard marked this pull request as ready for review July 5, 2026 13:31
@github-actions github-actions Bot added status: needs author action Pull request body or readiness checklist still needs author updates status: ready for review Pull request body is complete and the author marked it ready for human review and removed status: needs author action Pull request body or readiness checklist still needs author updates labels Jul 5, 2026
Web search tools (SerpAPI, Tavily, custom backends) commonly return
back-to-back JSON objects separated by whitespace ({...} {...} {...})
rather than a JSON array. detect_content_type only treated input
starting with [ as JSON_ARRAY, so this shape fell through to PLAIN_TEXT
and SmartCrusher passed it through at 0% compression.

The detector now recognizes a run of >=2 whitespace-separated JSON
objects as JSON_ARRAY, and SmartCrusher normalizes that shape to a real
array before crushing. Measured ~34% byte reduction on a 100-result
web_search payload that previously compressed 0%.

Closes headroomlabs-ai#1741
@rohanprichard rohanprichard force-pushed the fix/content-detector-space-separated-json branch from a216a05 to 35b973a Compare July 5, 2026 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status: ready for review Pull request body is complete and the author marked it ready for human review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(content-detector): support space-separated JSON objects for web_search tool output

1 participant