fix(search-compressor): CJK-aware relevance + harden Rust/Python parity#1749
Open
lifeodyssey wants to merge 1 commit into
Open
fix(search-compressor): CJK-aware relevance + harden Rust/Python parity#1749lifeodyssey wants to merge 1 commit into
lifeodyssey wants to merge 1 commit into
Conversation
The relevance scorer split the query on whitespace, so a spaceless CJK query matched content only when the WHOLE query was a literal substring of a line. Add CJK character bigrams so a longer CJK query boosts lines sharing a substring. Also fixes two latent Rust/Python parity divergences the ASCII-only fixtures masked: Rust counted word length in BYTES (Python: codepoints) and used a Vec (Python: a deduped set). Rust now uses chars().count() and a BTreeSet, matching Python. Byte-exact in both; non-CJK output unchanged (parity fixtures green).
Contributor
PR governanceThis PR does not yet satisfy the required template fields:
Please update the PR body, or move the PR back to draft while it is still in progress. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The search compressor's relevance scorer (
score_matches, present in both the Rust runtime path and the Python legacy mirror) split the query on whitespace. A spaceless CJK query therefore matched a result line only when the WHOLE query was a literal substring of that line — partial overlaps never boosted relevant lines, so correct matches got dropped when the result set was over budget.This adds CJK character bigrams to the query match set, so a longer CJK query boosts lines that share a substring. It also fixes two latent Rust/Python parity divergences the ASCII-only fixtures had masked:
w.len()), Python in codepoints (len(w)), so a CJK word crossed the> 2threshold differently. Rust now useschars().count().Vec(no dedup), Python into aset, so a repeated query word double-counted in Rust. Rust now uses aBTreeSet.Both scorers are byte-exact now; non-CJK output is unchanged (the 53 existing tests and the parity fixtures stay green).
Type of Change
Changes Made
crates/headroom-core/src/transforms/search_compressor.rs+headroom/transforms/search_compressor.py: addis_cjk_char/_is_cjk_charandcjk_bigrams/_cjk_bigrams(identical ranges + logic), union CJK bigrams into the query match set, and align the Rust word set to Python (chars().count()length,BTreeSetdedup).tests/test_search_compressor_cjk.py+ a Rust unit test: CJK bigram extraction (same input/expected in both languages) and a CJK query boosting a partially-overlapping line._score_matchesdocstring that referenced a non-existent parity assertion; it now states honestly how the two sides are pinned (test-equal for word-overlap + CJK bigrams; a few error-boost keywords still diverge, fixed only Rust-side).Testing
cargo test+pytest)cargo clippy/cargo fmt/ruff/mypy)Test Output
Real Behavior Proof
_corerebuilt on this branch), branchfeat/search-compressor-cjkoffmain.src/a.py:10:认证令牌已过期需要重新登录and query认证令牌缓存淘汰策略(the whole query is NOT a substring of the line, but its bigrams are), the line now scores> 0(bigrams 认证 / 证令 / 令牌 match); before, it scored0. An ASCII-only line still scores0. All 53 existing search-compressor tests are unchanged.cjk_bigrams("认证令牌")returns{认证, 证令, 令牌}in both Rust and Python.Review Readiness
Checklist
Additional Notes
Vec-vs-setdedup) were pre-existing and only reachable with non-ASCII or repeated-word queries — the all-ASCII fixtures never exercised them. This PR brings both sides back to byte-exact for the word-overlap + CJK-bigram scoring. The remaining error-boost keyword divergence is pre-existing (fixed only Rust-side in the 3e.1 port) and is now documented in the code rather than glossed over.