Skip to content

fix(search-compressor): CJK-aware relevance + harden Rust/Python parity#1749

Open
lifeodyssey wants to merge 1 commit into
headroomlabs-ai:mainfrom
lifeodyssey:feat/search-compressor-cjk
Open

fix(search-compressor): CJK-aware relevance + harden Rust/Python parity#1749
lifeodyssey wants to merge 1 commit into
headroomlabs-ai:mainfrom
lifeodyssey:feat/search-compressor-cjk

Conversation

@lifeodyssey

Copy link
Copy Markdown
Contributor

Description

The search compressor's relevance scorer (score_matches, present in both the Rust runtime path and the Python legacy mirror) split the query on whitespace. A spaceless CJK query therefore matched a result line only when the WHOLE query was a literal substring of that line — partial overlaps never boosted relevant lines, so correct matches got dropped when the result set was over budget.

This adds CJK character bigrams to the query match set, so a longer CJK query boosts lines that share a substring. It also fixes two latent Rust/Python parity divergences the ASCII-only fixtures had masked:

  • Length filter: Rust counted word length in BYTES (w.len()), Python in codepoints (len(w)), so a CJK word crossed the > 2 threshold differently. Rust now uses chars().count().
  • Dedup: Rust collected words into a Vec (no dedup), Python into a set, so a repeated query word double-counted in Rust. Rust now uses a BTreeSet.

Both scorers are byte-exact now; non-CJK output is unchanged (the 53 existing tests and the parity fixtures stay green).

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Changes Made

  • crates/headroom-core/src/transforms/search_compressor.rs + headroom/transforms/search_compressor.py: add is_cjk_char/_is_cjk_char and cjk_bigrams/_cjk_bigrams (identical ranges + logic), union CJK bigrams into the query match set, and align the Rust word set to Python (chars().count() length, BTreeSet dedup).
  • tests/test_search_compressor_cjk.py + a Rust unit test: CJK bigram extraction (same input/expected in both languages) and a CJK query boosting a partially-overlapping line.
  • Corrected a stale _score_matches docstring that referenced a non-existent parity assertion; it now states honestly how the two sides are pinned (test-equal for word-overlap + CJK bigrams; a few error-boost keywords still diverge, fixed only Rust-side).

Testing

  • Unit tests pass (cargo test + pytest)
  • Linting passes (cargo clippy / cargo fmt / ruff / mypy)
  • New tests added for new functionality
  • Manual testing performed (see Real Behavior Proof)

Test Output

$ cargo test -p headroom-core --lib search_compressor
test result: ok. 16 passed; 0 failed

$ .venv/bin/python -m pytest tests/test_search_compressor_cjk.py \
    tests/test_transforms_search_compressor.py tests/test_search_compressor.py
55 passed   # 2 new CJK tests + 53 existing (no regression)

Real Behavior Proof

  • Environment: macOS (Darwin), Rust via cargo, Python in a uv venv (_core rebuilt on this branch), branch feat/search-compressor-cjk off main.
  • Exact command / steps: scored a CJK content line against a longer CJK query whose whole form is not a substring of the line.
  • Observed result: for content src/a.py:10:认证令牌已过期需要重新登录 and query 认证令牌缓存淘汰策略 (the whole query is NOT a substring of the line, but its bigrams are), the line now scores > 0 (bigrams 认证 / 证令 / 令牌 match); before, it scored 0. An ASCII-only line still scores 0. All 53 existing search-compressor tests are unchanged. cjk_bigrams("认证令牌") returns {认证, 证令, 令牌} in both Rust and Python.

Review Readiness

  • I have performed a self-review
  • This PR is ready for human review

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation — N/A (internal relevance scoring)
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective
  • New and existing unit tests pass locally with my changes
  • I have updated the CHANGELOG.md — N/A: internal relevance-scoring fix, no user-facing surface change

Additional Notes

  • The two parity divergences (byte-vs-codepoint length, Vec-vs-set dedup) were pre-existing and only reachable with non-ASCII or repeated-word queries — the all-ASCII fixtures never exercised them. This PR brings both sides back to byte-exact for the word-overlap + CJK-bigram scoring. The remaining error-boost keyword divergence is pre-existing (fixed only Rust-side in the 3e.1 port) and is now documented in the code rather than glossed over.

The relevance scorer split the query on whitespace, so a spaceless CJK query
matched content only when the WHOLE query was a literal substring of a line.
Add CJK character bigrams so a longer CJK query boosts lines sharing a substring.

Also fixes two latent Rust/Python parity divergences the ASCII-only fixtures
masked: Rust counted word length in BYTES (Python: codepoints) and used a Vec
(Python: a deduped set). Rust now uses chars().count() and a BTreeSet, matching
Python. Byte-exact in both; non-CJK output unchanged (parity fixtures green).
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

PR governance

This PR does not yet satisfy the required template fields:

  • Fill in Real Behavior ProofNot tested.

Please update the PR body, or move the PR back to draft while it is still in progress.

@github-actions github-actions Bot added the status: needs author action Pull request body or readiness checklist still needs author updates label Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status: needs author action Pull request body or readiness checklist still needs author updates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant