fix(content-detector): support space-separated JSON objects for web_search tool output

## Bug Description

Headroom's `ContentType` detector (`detect_content_type()` in `headroom/transforms/content_detector.py`) only recognizes **valid JSON arrays** (starting with `[`) as `JSON_ARRAY`. Many web search tools (SerpAPI, Tavily, custom backends) return **space-separated JSON objects**:

```json
{"title": "Result 1", "url": "http://example.com/1", "snippet": "..."} {"title": "Result 2", "url": "http://example.com/2", "snippet": "..."} {"title": "Result 3", "url": "http://example.com/3", "snippet": "..."}
```

This format is **not detected** as `JSON_ARRAY` (confidence 0.5 → `PLAIN_TEXT`), so `SmartCrusher` never processes it — resulting in **0% compression** for web search results.

## Steps to Reproduce

1. Install headroom-ai v0.28.0+
2. Create space-separated JSON content (typical web_search output):
   ```python
   content = ' '.join([json.dumps({"title": f"Result {i}", "url": f"http://example.com/{i}", "snippet": f"snippet {i}"}) for i in range(100)])
   ```
3. Run detection:
   ```python
   from headroom.transforms import detect_content_type
   result = detect_content_type(content)
   print(result.content_type)  # ContentType.PLAIN_TEXT (confidence 0.5)
   ```
4. Compress via SmartCrusher — returns passthrough (no compression)

## Expected Behavior

Space-separated JSON objects should be:
1. **Detected** as `JSON_ARRAY` (or new `JSON_OBJECTS` type)
2. **Converted** to valid JSON array internally
3. **Compressed** by SmartCrusher with ~40-90% token reduction

## Actual Behavior

- Detection returns `PLAIN_TEXT` (confidence 0.5)
- SmartCrusher sees non-JSON, returns `passthrough`
- 0% compression on web_search tool outputs

## Environment

- headroom-ai: 0.28.0
- Python: 3.10+
- OS: Linux/Windows/macOS

## Proposed Solution

### Option 1: Enhance `_try_detect_json()` in `content_detector.py`

Add support for space-separated JSON objects before the fallback:

```python
def _try_detect_json(content: str) -> DetectionResult | None:
    content = content.strip()
    
    # 1. Try valid JSON array (existing)
    if content.startswith('['):
        try:
            parsed = json.loads(content)
            if isinstance(parsed, list):
                return DetectionResult(...)
        except json.JSONDecodeError:
            pass
    
    # 2. NEW: Try space-separated JSON objects
    # Pattern: {...} {...} {...} -> valid JSON array
    import re
    # Normalize whitespace
    normalized = re.sub(r'\s+', ' ', content)
    if '} {' in normalized:
        try:
            fixed = '[' + normalized.replace('} {', '},{') + ']'
            parsed = json.loads(fixed)
            if isinstance(parsed, list) and parsed and all(isinstance(item, dict) for item in parsed):
                return DetectionResult(
                    ContentType.JSON_ARRAY,
                    0.9,  # High confidence
                    {"item_count": len(parsed), "is_dict_array": True, "was_space_separated": True}
                )
        except json.JSONDecodeError:
            pass
    
    # 3. NEW: Try JSON Lines format
    lines = [line.strip() for line in content.split('\n') if line.strip()]
    if lines and all(line.startswith('{') and line.endswith('}') for line in lines):
        try:
            fixed = '[' + ','.join(lines) + ']'
            parsed = json.loads(fixed)
            if isinstance(parsed, list) and parsed and all(isinstance(item, dict) for item in parsed):
                return DetectionResult(
                    ContentType.JSON_ARRAY,
                    0.9,
                    {"item_count": len(parsed), "is_dict_array": True, "was_json_lines": True}
                )
        except json.JSONDecodeError:
            pass
    
    return None
```

### Option 2: Add pre-processing in ContentRouter / SmartCrusher

Handle conversion at compression time rather than detection time (less invasive, more flexible):

```python
# In ContentRouter.compress() or SmartCrusher.crush()
def _ensure_json_array(content: str) -> str:
    """Convert space-separated JSON / JSON Lines to valid JSON array."""
    content = content.strip()
    if not content:
        return content
    
    # Already valid array?
    if content.startswith('[') and content.endswith(']'):
        try:
            json.loads(content)
            return content
        except json.JSONDecodeError:
            pass
    
    # Normalize whitespace
    import re
    content = re.sub(r'\s+', ' ', content)
    
    # Space-separated objects
    if '} {' in content:
        fixed = '[' + content.replace('} {', '},{') + ']'
        try:
            json.loads(fixed)
            return fixed
        except json.JSONDecodeError:
            pass
    
    # JSON Lines
    lines = [line.strip() for line in content.split('\n') if line.strip()]
    if lines and all(line.startswith('{') and line.endswith('}') for line in lines):
        fixed = '[' + ','.join(lines) + ']'
        try:
            json.loads(fixed)
            return fixed
        except json.JSONDecodeError:
            pass
    
    return content
```

### Option 3: Add `tool_profiles` hint for known tools

Allow users to declare tool output format in config:

```yaml
# headroom config
tool_profiles:
  web_search:
    input_format: "space_separated_json"  # or "json_lines", "json_array"
  custom_search:
    input_format: "json_lines"
```

## Alternatives Considered

| Approach | Pros | Cons |
|----------|------|------|
| Enhance `detect_content_type()` | Central fix, all tools benefit | May affect other content types |
| Pre-process in `SmartCrusher` | Targeted, minimal risk | Only fixes SmartCrusher path |
| Config `tool_profiles` | Explicit, user-controlled | Requires config, not automatic |

**Recommendation:** Option 2 (pre-processing in ContentRouter/SmartCrusher) — minimal risk, handles the actual compression path, doesn't change detection semantics for other uses.

## Impact

- **High**: web_search is often the largest token consumer in agent workflows
- Current 0% → potential 40-90% compression
- Fixes a common real-world pattern (SerpAPI, Tavily, custom backends)

## Related Issues

- #553 — JSON bracket detection in mixed content (fixed)
- #887 — JSON array item counting fix (fixed)
- No existing issue for space-separated JSON objects

## Implementation Notes

The fix has been **tested and validated** in a downstream integration (Hermes Agent):
- Space-separated JSON → valid array conversion works
- SmartCrusher achieves **~44% compression** on realistic web_search data (100 items)
- Handles edge cases: extra whitespace, tabs, newlines, nested objects, JSON Lines

**Test results:**
```
web_search (space-separated, 100 items): 44.4% compression (router:mixed:0.36)
web_search (valid JSON array, 100 items): 33.3% compression (router:smart_crusher:0.60)
```

## Willing to Contribute

Yes — can submit PR with:
1. Core fix in `content_detector.py` or `content_router.py`
2. Tests in `tests/test_transforms/test_content_detector.py`
3. Documentation update in `docs/transforms.md`

---

**Reference Implementation** (from Hermes Agent `headroom_compressor.py`):
```python
def _ensure_json_array(content: str) -> str:
    """Convert space-separated JSON objects to a valid JSON array."""
    content = content.strip()
    if not content:
        return content
    
    import re
    content = re.sub(r'\s+', ' ', content)  # Normalize whitespace
    
    if content.startswith('[') and content.endswith(']'):
        try:
            json.loads(content)
            return content
        except json.JSONDecodeError:
            pass
    
    if '} {' in content:
        fixed = '[' + content.replace('} {', '},{') + ']'
        try:
            json.loads(fixed)
            return fixed
        except json.JSONDecodeError:
            pass
    
    lines = [line.strip() for line in content.split('\n') if line.strip()]
    if lines and all(line.startswith('{') and line.endswith('}') for line in lines):
        fixed = '[' + ','.join(lines) + ']'
        try:
            json.loads(fixed)
            return fixed
        except json.JSONDecodeError:
            pass
    
    return content
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(content-detector): support space-separated JSON objects for web_search tool output #1741

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Environment

Proposed Solution

Option 1: Enhance `_try_detect_json()` in `content_detector.py`

Option 2: Add pre-processing in ContentRouter / SmartCrusher

Option 3: Add `tool_profiles` hint for known tools

Alternatives Considered

Impact

Related Issues

Implementation Notes

Willing to Contribute

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Approach	Pros	Cons
Enhance `detect_content_type()`	Central fix, all tools benefit	May affect other content types
Pre-process in `SmartCrusher`	Targeted, minimal risk	Only fixes SmartCrusher path
Config `tool_profiles`	Explicit, user-controlled	Requires config, not automatic

Uh oh!

fix(content-detector): support space-separated JSON objects for web_search tool output #1741

Description

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Environment

Proposed Solution

Option 1: Enhance _try_detect_json() in content_detector.py

Option 2: Add pre-processing in ContentRouter / SmartCrusher

Option 3: Add tool_profiles hint for known tools

Alternatives Considered

Impact

Related Issues

Implementation Notes

Willing to Contribute

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Option 1: Enhance `_try_detect_json()` in `content_detector.py`

Option 3: Add `tool_profiles` hint for known tools