fix(adaptive-sizer): char bigrams for spaceless CJK items#1748
Open
lifeodyssey wants to merge 1 commit into
Open
fix(adaptive-sizer): char bigrams for spaceless CJK items#1748lifeodyssey wants to merge 1 commit into
lifeodyssey wants to merge 1 commit into
Conversation
compute_unique_bigram_curve word-split on whitespace, so a spaceless CJK item became one (whole_string, "") pseudo-bigram -> the coverage curve grew ~1 per item, the kneedle knee detector found no knee, and CJK lists under-compressed. Spaceless CJK items now use character bigrams, giving a real coverage curve. Mirrored byte-exactly in Rust + Python (same reference-test curve values); non-CJK items (whitespace-bearing or spaceless-ASCII) are byte-identical, so the smart_crusher parity fixtures are unchanged.
Contributor
PR governanceThis PR does not yet satisfy the required template fields:
Please update the PR body, or move the PR back to draft while it is still in progress. |
JerrettDavis
approved these changes
Jul 3, 2026
JerrettDavis
left a comment
Collaborator
There was a problem hiding this comment.
Reviewed the adaptive-sizer CJK bigram change. The Python and Rust implementations stay aligned, ASCII and empty-item behavior are explicitly pinned, and the CJK cases now produce a useful coverage curve instead of one pseudo-bigram per item. Checks are green; this looks ready.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
compute_unique_bigram_curve— the adaptive sizer's coverage-curve builder, mirrored in Rust and Python — word-splits each item on whitespace to form word bigrams. A spaceless CJK item has no whitespace, so it collapsed into one(whole_string, "")pseudo-bigram: the coverage curve then grew ~1 per item, the kneedle knee detector found no knee, and CJK lists under-compressed.Spaceless CJK items now use character bigrams, producing a real coverage curve. Mirrored byte-exactly in Rust and Python (identical reference-test curve values). Non-CJK items — anything whitespace-bearing or spaceless-ASCII — are byte-identical to before, so the
smart_crusherparity fixtures are unchanged.Type of Change
Changes Made
crates/headroom-core/src/transforms/adaptive_sizer.rs+headroom/transforms/adaptive_sizer.py: addis_cjk_char/_is_cjk_char(identical code-point ranges) and a spaceless-CJK character-bigram branch incompute_unique_bigram_curve.tests/test_adaptive_sizer.py: CJK curve, single-char CJK, ASCII-unchanged, empty-item — the Rust and Python reference values are identical.Testing
cargo test+pytest)cargo clippy/cargo fmt/ruff/mypy)Test Output
Real Behavior Proof
feat/adaptive-sizer-cjkoffmain.compute_unique_bigram_curveon a CJK list and on ASCII lists, in both implementations.compute_unique_bigram_curve(["数据库连接失败", "数据库连接成功"])returns[6, 8]in both Rust and Python (before: ~[1, 2]— one pseudo-bigram per item, no coverage signal). ASCII curves are unchanged:["the cat", "the dog", "a fish"]→[1, 2, 3]. Thesmart_crusherparity suite (all-ASCII fixtures) stays green, confirming non-CJK output is byte-identical.vec![6, 8]) and the Python test ([6, 8]) use the same inputs and the same expected values, so the two implementations are pinned to agree.Review Readiness
Checklist
Additional Notes
smart_crusherparity fixtures need no re-recording.