Skip to content

fix(adaptive-sizer): char bigrams for spaceless CJK items#1748

Open
lifeodyssey wants to merge 1 commit into
headroomlabs-ai:mainfrom
lifeodyssey:feat/adaptive-sizer-cjk
Open

fix(adaptive-sizer): char bigrams for spaceless CJK items#1748
lifeodyssey wants to merge 1 commit into
headroomlabs-ai:mainfrom
lifeodyssey:feat/adaptive-sizer-cjk

Conversation

@lifeodyssey

Copy link
Copy Markdown
Contributor

Description

compute_unique_bigram_curve — the adaptive sizer's coverage-curve builder, mirrored in Rust and Python — word-splits each item on whitespace to form word bigrams. A spaceless CJK item has no whitespace, so it collapsed into one (whole_string, "") pseudo-bigram: the coverage curve then grew ~1 per item, the kneedle knee detector found no knee, and CJK lists under-compressed.

Spaceless CJK items now use character bigrams, producing a real coverage curve. Mirrored byte-exactly in Rust and Python (identical reference-test curve values). Non-CJK items — anything whitespace-bearing or spaceless-ASCII — are byte-identical to before, so the smart_crusher parity fixtures are unchanged.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Changes Made

  • crates/headroom-core/src/transforms/adaptive_sizer.rs + headroom/transforms/adaptive_sizer.py: add is_cjk_char/_is_cjk_char (identical code-point ranges) and a spaceless-CJK character-bigram branch in compute_unique_bigram_curve.
  • Rust unit tests + tests/test_adaptive_sizer.py: CJK curve, single-char CJK, ASCII-unchanged, empty-item — the Rust and Python reference values are identical.

Testing

  • Unit tests pass (cargo test + pytest)
  • Linting passes (cargo clippy / cargo fmt / ruff / mypy)
  • New tests added for new functionality
  • Manual testing performed (see Real Behavior Proof)

Test Output

$ cargo test -p headroom-core --lib adaptive_sizer
test result: ok. 35 passed; 0 failed

$ .venv/bin/python -m pytest tests/test_adaptive_sizer.py
20 passed

$ .venv/bin/python -m pytest -k "smart_crusher and parity"
18 passed, 6 skipped   # non-CJK fixtures unchanged

Real Behavior Proof

  • Environment: macOS (Darwin), Rust via cargo, Python in a uv venv, branch feat/adaptive-sizer-cjk off main.
  • Exact command / steps: called compute_unique_bigram_curve on a CJK list and on ASCII lists, in both implementations.
  • Observed result: compute_unique_bigram_curve(["数据库连接失败", "数据库连接成功"]) returns [6, 8] in both Rust and Python (before: ~[1, 2] — one pseudo-bigram per item, no coverage signal). ASCII curves are unchanged: ["the cat", "the dog", "a fish"][1, 2, 3]. The smart_crusher parity suite (all-ASCII fixtures) stays green, confirming non-CJK output is byte-identical.
  • Byte-exact parity: the Rust reference test (vec![6, 8]) and the Python test ([6, 8]) use the same inputs and the same expected values, so the two implementations are pinned to agree.

Review Readiness

  • I have performed a self-review
  • This PR is ready for human review

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation — N/A (internal sizing heuristic)
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective
  • New and existing unit tests pass locally with my changes
  • I have updated the CHANGELOG.md — N/A: internal sizing-heuristic fix, no user-facing surface change

Additional Notes

  • This is a parity-locked function (Rust and Python must agree byte-for-byte). The fix is CJK-gated, so non-CJK output is byte-identical and the smart_crusher parity fixtures need no re-recording.

compute_unique_bigram_curve word-split on whitespace, so a spaceless CJK item
became one (whole_string, "") pseudo-bigram -> the coverage curve grew ~1 per
item, the kneedle knee detector found no knee, and CJK lists under-compressed.
Spaceless CJK items now use character bigrams, giving a real coverage curve.
Mirrored byte-exactly in Rust + Python (same reference-test curve values);
non-CJK items (whitespace-bearing or spaceless-ASCII) are byte-identical, so the
smart_crusher parity fixtures are unchanged.
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

PR governance

This PR does not yet satisfy the required template fields:

  • Fill in Real Behavior ProofNot tested.

Please update the PR body, or move the PR back to draft while it is still in progress.

@github-actions github-actions Bot added the status: needs author action Pull request body or readiness checklist still needs author updates label Jul 3, 2026

@JerrettDavis JerrettDavis left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the adaptive-sizer CJK bigram change. The Python and Rust implementations stay aligned, ASCII and empty-item behavior are explicitly pinned, and the CJK cases now produce a useful coverage curve instead of one pseudo-bigram per item. Checks are green; this looks ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status: needs author action Pull request body or readiness checklist still needs author updates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants