fix(search-compressor): CJK-aware relevance + harden Rust/Python parity by lifeodyssey · Pull Request #1749 · headroomlabs-ai/headroom

lifeodyssey · 2026-07-03T13:51:37Z

Description

The search compressor's relevance scorer (score_matches, present in both the Rust runtime path and the Python legacy mirror) split the query on whitespace. A spaceless CJK query therefore matched a result line only when the WHOLE query was a literal substring of that line — partial overlaps never boosted relevant lines, so correct matches got dropped when the result set was over budget.

This adds CJK character bigrams to the query match set, so a longer CJK query boosts lines that share a substring. It also fixes two latent Rust/Python parity divergences the ASCII-only fixtures had masked:

Length filter: Rust counted word length in BYTES (w.len()), Python in codepoints (len(w)), so a CJK word crossed the > 2 threshold differently. Rust now uses chars().count().
Dedup: Rust collected words into a Vec (no dedup), Python into a set, so a repeated query word double-counted in Rust. Rust now uses a BTreeSet.

Both scorers are byte-exact now; non-CJK output is unchanged (the 53 existing tests and the parity fixtures stay green).

Type of Change

Bug fix (non-breaking change that fixes an issue)

Changes Made

crates/headroom-core/src/transforms/search_compressor.rs + headroom/transforms/search_compressor.py: add is_cjk_char/_is_cjk_char and cjk_bigrams/_cjk_bigrams (identical ranges + logic), union CJK bigrams into the query match set, and align the Rust word set to Python (chars().count() length, BTreeSet dedup).
tests/test_search_compressor_cjk.py + a Rust unit test: CJK bigram extraction (same input/expected in both languages) and a CJK query boosting a partially-overlapping line.
Corrected a stale _score_matches docstring that referenced a non-existent parity assertion; it now states honestly how the two sides are pinned (test-equal for word-overlap + CJK bigrams; a few error-boost keywords still diverge, fixed only Rust-side).

Testing

Unit tests pass (cargo test + pytest)
Linting passes (cargo clippy / cargo fmt / ruff / mypy)
New tests added for new functionality
Manual testing performed (see Real Behavior Proof)

Test Output

$ cargo test -p headroom-core --lib search_compressor
test result: ok. 16 passed; 0 failed

$ .venv/bin/python -m pytest tests/test_search_compressor_cjk.py \
    tests/test_transforms_search_compressor.py tests/test_search_compressor.py
55 passed   # 2 new CJK tests + 53 existing (no regression)

Real Behavior Proof

Environment: macOS (Darwin), Rust via cargo, Python in a uv venv (_core rebuilt on this branch), branch feat/search-compressor-cjk off main.
Exact command / steps: scored a CJK content line against a longer CJK query whose whole form is not a substring of the line.
Observed result: for content src/a.py:10:认证令牌已过期需要重新登录 and query 认证令牌缓存淘汰策略 (the whole query is NOT a substring of the line, but its bigrams are), the line now scores > 0 (bigrams 认证 / 证令 / 令牌 match); before, it scored 0. An ASCII-only line still scores 0. All 53 existing search-compressor tests are unchanged. cjk_bigrams("认证令牌") returns {认证, 证令, 令牌} in both Rust and Python.

Review Readiness

I have performed a self-review
This PR is ready for human review

Checklist

My code follows the project's style guidelines
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation — N/A (internal relevance scoring)
My changes generate no new warnings
I have added tests that prove my fix is effective
New and existing unit tests pass locally with my changes
I have updated the CHANGELOG.md — N/A: internal relevance-scoring fix, no user-facing surface change

Additional Notes

The two parity divergences (byte-vs-codepoint length, Vec-vs-set dedup) were pre-existing and only reachable with non-ASCII or repeated-word queries — the all-ASCII fixtures never exercised them. This PR brings both sides back to byte-exact for the word-overlap + CJK-bigram scoring. The remaining error-boost keyword divergence is pre-existing (fixed only Rust-side in the 3e.1 port) and is now documented in the code rather than glossed over.

The relevance scorer split the query on whitespace, so a spaceless CJK query matched content only when the WHOLE query was a literal substring of a line. Add CJK character bigrams so a longer CJK query boosts lines sharing a substring. Also fixes two latent Rust/Python parity divergences the ASCII-only fixtures masked: Rust counted word length in BYTES (Python: codepoints) and used a Vec (Python: a deduped set). Rust now uses chars().count() and a BTreeSet, matching Python. Byte-exact in both; non-CJK output unchanged (parity fixtures green).

github-actions · 2026-07-03T13:54:18Z

PR governance

This PR does not yet satisfy the required template fields:

Fill in Real Behavior Proof → Not tested.

Please update the PR body, or move the PR back to draft while it is still in progress.

lifeodyssey requested review from DevanshiVyas, JerrettDavis and chopratejas as code owners July 3, 2026 13:51

github-actions Bot added the status: needs author action Pull request body or readiness checklist still needs author updates label Jul 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(search-compressor): CJK-aware relevance + harden Rust/Python parity#1749

fix(search-compressor): CJK-aware relevance + harden Rust/Python parity#1749
lifeodyssey wants to merge 1 commit into
headroomlabs-ai:mainfrom
lifeodyssey:feat/search-compressor-cjk

lifeodyssey commented Jul 3, 2026

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lifeodyssey commented Jul 3, 2026

Description

Type of Change

Changes Made

Testing

Test Output

Real Behavior Proof

Review Readiness

Checklist

Additional Notes

Uh oh!

github-actions Bot commented Jul 3, 2026

PR governance

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant