Skip to content

fix(content-detector): support space-separated JSON objects for web_search tool output #1741

Description

@Songraf

Bug Description

Headroom's ContentType detector (detect_content_type() in headroom/transforms/content_detector.py) only recognizes valid JSON arrays (starting with [) as JSON_ARRAY. Many web search tools (SerpAPI, Tavily, custom backends) return space-separated JSON objects:

{"title": "Result 1", "url": "http://example.com/1", "snippet": "..."} {"title": "Result 2", "url": "http://example.com/2", "snippet": "..."} {"title": "Result 3", "url": "http://example.com/3", "snippet": "..."}

This format is not detected as JSON_ARRAY (confidence 0.5 → PLAIN_TEXT), so SmartCrusher never processes it — resulting in 0% compression for web search results.

Steps to Reproduce

  1. Install headroom-ai v0.28.0+
  2. Create space-separated JSON content (typical web_search output):
    content = ' '.join([json.dumps({"title": f"Result {i}", "url": f"http://example.com/{i}", "snippet": f"snippet {i}"}) for i in range(100)])
  3. Run detection:
    from headroom.transforms import detect_content_type
    result = detect_content_type(content)
    print(result.content_type)  # ContentType.PLAIN_TEXT (confidence 0.5)
  4. Compress via SmartCrusher — returns passthrough (no compression)

Expected Behavior

Space-separated JSON objects should be:

  1. Detected as JSON_ARRAY (or new JSON_OBJECTS type)
  2. Converted to valid JSON array internally
  3. Compressed by SmartCrusher with ~40-90% token reduction

Actual Behavior

  • Detection returns PLAIN_TEXT (confidence 0.5)
  • SmartCrusher sees non-JSON, returns passthrough
  • 0% compression on web_search tool outputs

Environment

  • headroom-ai: 0.28.0
  • Python: 3.10+
  • OS: Linux/Windows/macOS

Proposed Solution

Option 1: Enhance _try_detect_json() in content_detector.py

Add support for space-separated JSON objects before the fallback:

def _try_detect_json(content: str) -> DetectionResult | None:
    content = content.strip()
    
    # 1. Try valid JSON array (existing)
    if content.startswith('['):
        try:
            parsed = json.loads(content)
            if isinstance(parsed, list):
                return DetectionResult(...)
        except json.JSONDecodeError:
            pass
    
    # 2. NEW: Try space-separated JSON objects
    # Pattern: {...} {...} {...} -> valid JSON array
    import re
    # Normalize whitespace
    normalized = re.sub(r'\s+', ' ', content)
    if '} {' in normalized:
        try:
            fixed = '[' + normalized.replace('} {', '},{') + ']'
            parsed = json.loads(fixed)
            if isinstance(parsed, list) and parsed and all(isinstance(item, dict) for item in parsed):
                return DetectionResult(
                    ContentType.JSON_ARRAY,
                    0.9,  # High confidence
                    {"item_count": len(parsed), "is_dict_array": True, "was_space_separated": True}
                )
        except json.JSONDecodeError:
            pass
    
    # 3. NEW: Try JSON Lines format
    lines = [line.strip() for line in content.split('\n') if line.strip()]
    if lines and all(line.startswith('{') and line.endswith('}') for line in lines):
        try:
            fixed = '[' + ','.join(lines) + ']'
            parsed = json.loads(fixed)
            if isinstance(parsed, list) and parsed and all(isinstance(item, dict) for item in parsed):
                return DetectionResult(
                    ContentType.JSON_ARRAY,
                    0.9,
                    {"item_count": len(parsed), "is_dict_array": True, "was_json_lines": True}
                )
        except json.JSONDecodeError:
            pass
    
    return None

Option 2: Add pre-processing in ContentRouter / SmartCrusher

Handle conversion at compression time rather than detection time (less invasive, more flexible):

# In ContentRouter.compress() or SmartCrusher.crush()
def _ensure_json_array(content: str) -> str:
    """Convert space-separated JSON / JSON Lines to valid JSON array."""
    content = content.strip()
    if not content:
        return content
    
    # Already valid array?
    if content.startswith('[') and content.endswith(']'):
        try:
            json.loads(content)
            return content
        except json.JSONDecodeError:
            pass
    
    # Normalize whitespace
    import re
    content = re.sub(r'\s+', ' ', content)
    
    # Space-separated objects
    if '} {' in content:
        fixed = '[' + content.replace('} {', '},{') + ']'
        try:
            json.loads(fixed)
            return fixed
        except json.JSONDecodeError:
            pass
    
    # JSON Lines
    lines = [line.strip() for line in content.split('\n') if line.strip()]
    if lines and all(line.startswith('{') and line.endswith('}') for line in lines):
        fixed = '[' + ','.join(lines) + ']'
        try:
            json.loads(fixed)
            return fixed
        except json.JSONDecodeError:
            pass
    
    return content

Option 3: Add tool_profiles hint for known tools

Allow users to declare tool output format in config:

# headroom config
tool_profiles:
  web_search:
    input_format: "space_separated_json"  # or "json_lines", "json_array"
  custom_search:
    input_format: "json_lines"

Alternatives Considered

Approach Pros Cons
Enhance detect_content_type() Central fix, all tools benefit May affect other content types
Pre-process in SmartCrusher Targeted, minimal risk Only fixes SmartCrusher path
Config tool_profiles Explicit, user-controlled Requires config, not automatic

Recommendation: Option 2 (pre-processing in ContentRouter/SmartCrusher) — minimal risk, handles the actual compression path, doesn't change detection semantics for other uses.

Impact

  • High: web_search is often the largest token consumer in agent workflows
  • Current 0% → potential 40-90% compression
  • Fixes a common real-world pattern (SerpAPI, Tavily, custom backends)

Related Issues

Implementation Notes

The fix has been tested and validated in a downstream integration (Hermes Agent):

  • Space-separated JSON → valid array conversion works
  • SmartCrusher achieves ~44% compression on realistic web_search data (100 items)
  • Handles edge cases: extra whitespace, tabs, newlines, nested objects, JSON Lines

Test results:

web_search (space-separated, 100 items): 44.4% compression (router:mixed:0.36)
web_search (valid JSON array, 100 items): 33.3% compression (router:smart_crusher:0.60)

Willing to Contribute

Yes — can submit PR with:

  1. Core fix in content_detector.py or content_router.py
  2. Tests in tests/test_transforms/test_content_detector.py
  3. Documentation update in docs/transforms.md

Reference Implementation (from Hermes Agent headroom_compressor.py):

def _ensure_json_array(content: str) -> str:
    """Convert space-separated JSON objects to a valid JSON array."""
    content = content.strip()
    if not content:
        return content
    
    import re
    content = re.sub(r'\s+', ' ', content)  # Normalize whitespace
    
    if content.startswith('[') and content.endswith(']'):
        try:
            json.loads(content)
            return content
        except json.JSONDecodeError:
            pass
    
    if '} {' in content:
        fixed = '[' + content.replace('} {', '},{') + ']'
        try:
            json.loads(fixed)
            return fixed
        except json.JSONDecodeError:
            pass
    
    lines = [line.strip() for line in content.split('\n') if line.strip()]
    if lines and all(line.startswith('{') and line.endswith('}') for line in lines):
        fixed = '[' + ','.join(lines) + ']'
        try:
            json.loads(fixed)
            return fixed
        except json.JSONDecodeError:
            pass
    
    return content

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions