Bug Description
Headroom's ContentType detector (detect_content_type() in headroom/transforms/content_detector.py) only recognizes valid JSON arrays (starting with [) as JSON_ARRAY. Many web search tools (SerpAPI, Tavily, custom backends) return space-separated JSON objects:
{"title": "Result 1", "url": "http://example.com/1", "snippet": "..."} {"title": "Result 2", "url": "http://example.com/2", "snippet": "..."} {"title": "Result 3", "url": "http://example.com/3", "snippet": "..."}
This format is not detected as JSON_ARRAY (confidence 0.5 → PLAIN_TEXT), so SmartCrusher never processes it — resulting in 0% compression for web search results.
Steps to Reproduce
- Install headroom-ai v0.28.0+
- Create space-separated JSON content (typical web_search output):
content = ' '.join([json.dumps({"title": f"Result {i}", "url": f"http://example.com/{i}", "snippet": f"snippet {i}"}) for i in range(100)])
- Run detection:
from headroom.transforms import detect_content_type
result = detect_content_type(content)
print(result.content_type) # ContentType.PLAIN_TEXT (confidence 0.5)
- Compress via SmartCrusher — returns passthrough (no compression)
Expected Behavior
Space-separated JSON objects should be:
- Detected as
JSON_ARRAY (or new JSON_OBJECTS type)
- Converted to valid JSON array internally
- Compressed by SmartCrusher with ~40-90% token reduction
Actual Behavior
- Detection returns
PLAIN_TEXT (confidence 0.5)
- SmartCrusher sees non-JSON, returns
passthrough
- 0% compression on web_search tool outputs
Environment
- headroom-ai: 0.28.0
- Python: 3.10+
- OS: Linux/Windows/macOS
Proposed Solution
Option 1: Enhance _try_detect_json() in content_detector.py
Add support for space-separated JSON objects before the fallback:
def _try_detect_json(content: str) -> DetectionResult | None:
content = content.strip()
# 1. Try valid JSON array (existing)
if content.startswith('['):
try:
parsed = json.loads(content)
if isinstance(parsed, list):
return DetectionResult(...)
except json.JSONDecodeError:
pass
# 2. NEW: Try space-separated JSON objects
# Pattern: {...} {...} {...} -> valid JSON array
import re
# Normalize whitespace
normalized = re.sub(r'\s+', ' ', content)
if '} {' in normalized:
try:
fixed = '[' + normalized.replace('} {', '},{') + ']'
parsed = json.loads(fixed)
if isinstance(parsed, list) and parsed and all(isinstance(item, dict) for item in parsed):
return DetectionResult(
ContentType.JSON_ARRAY,
0.9, # High confidence
{"item_count": len(parsed), "is_dict_array": True, "was_space_separated": True}
)
except json.JSONDecodeError:
pass
# 3. NEW: Try JSON Lines format
lines = [line.strip() for line in content.split('\n') if line.strip()]
if lines and all(line.startswith('{') and line.endswith('}') for line in lines):
try:
fixed = '[' + ','.join(lines) + ']'
parsed = json.loads(fixed)
if isinstance(parsed, list) and parsed and all(isinstance(item, dict) for item in parsed):
return DetectionResult(
ContentType.JSON_ARRAY,
0.9,
{"item_count": len(parsed), "is_dict_array": True, "was_json_lines": True}
)
except json.JSONDecodeError:
pass
return None
Option 2: Add pre-processing in ContentRouter / SmartCrusher
Handle conversion at compression time rather than detection time (less invasive, more flexible):
# In ContentRouter.compress() or SmartCrusher.crush()
def _ensure_json_array(content: str) -> str:
"""Convert space-separated JSON / JSON Lines to valid JSON array."""
content = content.strip()
if not content:
return content
# Already valid array?
if content.startswith('[') and content.endswith(']'):
try:
json.loads(content)
return content
except json.JSONDecodeError:
pass
# Normalize whitespace
import re
content = re.sub(r'\s+', ' ', content)
# Space-separated objects
if '} {' in content:
fixed = '[' + content.replace('} {', '},{') + ']'
try:
json.loads(fixed)
return fixed
except json.JSONDecodeError:
pass
# JSON Lines
lines = [line.strip() for line in content.split('\n') if line.strip()]
if lines and all(line.startswith('{') and line.endswith('}') for line in lines):
fixed = '[' + ','.join(lines) + ']'
try:
json.loads(fixed)
return fixed
except json.JSONDecodeError:
pass
return content
Option 3: Add tool_profiles hint for known tools
Allow users to declare tool output format in config:
# headroom config
tool_profiles:
web_search:
input_format: "space_separated_json" # or "json_lines", "json_array"
custom_search:
input_format: "json_lines"
Alternatives Considered
| Approach |
Pros |
Cons |
Enhance detect_content_type() |
Central fix, all tools benefit |
May affect other content types |
Pre-process in SmartCrusher |
Targeted, minimal risk |
Only fixes SmartCrusher path |
Config tool_profiles |
Explicit, user-controlled |
Requires config, not automatic |
Recommendation: Option 2 (pre-processing in ContentRouter/SmartCrusher) — minimal risk, handles the actual compression path, doesn't change detection semantics for other uses.
Impact
- High: web_search is often the largest token consumer in agent workflows
- Current 0% → potential 40-90% compression
- Fixes a common real-world pattern (SerpAPI, Tavily, custom backends)
Related Issues
Implementation Notes
The fix has been tested and validated in a downstream integration (Hermes Agent):
- Space-separated JSON → valid array conversion works
- SmartCrusher achieves ~44% compression on realistic web_search data (100 items)
- Handles edge cases: extra whitespace, tabs, newlines, nested objects, JSON Lines
Test results:
web_search (space-separated, 100 items): 44.4% compression (router:mixed:0.36)
web_search (valid JSON array, 100 items): 33.3% compression (router:smart_crusher:0.60)
Willing to Contribute
Yes — can submit PR with:
- Core fix in
content_detector.py or content_router.py
- Tests in
tests/test_transforms/test_content_detector.py
- Documentation update in
docs/transforms.md
Reference Implementation (from Hermes Agent headroom_compressor.py):
def _ensure_json_array(content: str) -> str:
"""Convert space-separated JSON objects to a valid JSON array."""
content = content.strip()
if not content:
return content
import re
content = re.sub(r'\s+', ' ', content) # Normalize whitespace
if content.startswith('[') and content.endswith(']'):
try:
json.loads(content)
return content
except json.JSONDecodeError:
pass
if '} {' in content:
fixed = '[' + content.replace('} {', '},{') + ']'
try:
json.loads(fixed)
return fixed
except json.JSONDecodeError:
pass
lines = [line.strip() for line in content.split('\n') if line.strip()]
if lines and all(line.startswith('{') and line.endswith('}') for line in lines):
fixed = '[' + ','.join(lines) + ']'
try:
json.loads(fixed)
return fixed
except json.JSONDecodeError:
pass
return content
Bug Description
Headroom's
ContentTypedetector (detect_content_type()inheadroom/transforms/content_detector.py) only recognizes valid JSON arrays (starting with[) asJSON_ARRAY. Many web search tools (SerpAPI, Tavily, custom backends) return space-separated JSON objects:{"title": "Result 1", "url": "http://example.com/1", "snippet": "..."} {"title": "Result 2", "url": "http://example.com/2", "snippet": "..."} {"title": "Result 3", "url": "http://example.com/3", "snippet": "..."}This format is not detected as
JSON_ARRAY(confidence 0.5 →PLAIN_TEXT), soSmartCrushernever processes it — resulting in 0% compression for web search results.Steps to Reproduce
Expected Behavior
Space-separated JSON objects should be:
JSON_ARRAY(or newJSON_OBJECTStype)Actual Behavior
PLAIN_TEXT(confidence 0.5)passthroughEnvironment
Proposed Solution
Option 1: Enhance
_try_detect_json()incontent_detector.pyAdd support for space-separated JSON objects before the fallback:
Option 2: Add pre-processing in ContentRouter / SmartCrusher
Handle conversion at compression time rather than detection time (less invasive, more flexible):
Option 3: Add
tool_profileshint for known toolsAllow users to declare tool output format in config:
Alternatives Considered
detect_content_type()SmartCrushertool_profilesRecommendation: Option 2 (pre-processing in ContentRouter/SmartCrusher) — minimal risk, handles the actual compression path, doesn't change detection semantics for other uses.
Impact
Related Issues
Implementation Notes
The fix has been tested and validated in a downstream integration (Hermes Agent):
Test results:
Willing to Contribute
Yes — can submit PR with:
content_detector.pyorcontent_router.pytests/test_transforms/test_content_detector.pydocs/transforms.mdReference Implementation (from Hermes Agent
headroom_compressor.py):