feat: optimize row based shuffle with DenseRow by zhangxffff · Pull Request #637 · bytedance/bolt

zhangxffff · 2026-06-09T14:19:35Z

What problem does this PR solve?

Row-based shuffle serializes each row with CompactRow, which pads
values to fixed widths, carries a per-row null bitmap, and stores integers at
full width. For the common Spark case of small / narrow / nullable integer
columns this wastes a large fraction of shuffle bytes.

This PR adds DenseRow, a no-waste variable-length row format, and wires it
into the Spark row-based shuffle path behind a pluggable RowFormat option.
The existing CompactRow path is unchanged and remains the default.

Issue Number: close #540

Type of Change

🐛 Bug fix (non-breaking change which fixes an issue)
✨ New feature (non-breaking change which adds functionality)
🚀 Performance improvement (optimization)
⚠️ Breaking change (fix or feature that would cause existing functionality to change)
🔨 Refactoring (no logic changes)
🔧 Build/CI or Infrastructure changes
📝 Documentation only

Description

DenseRow codec (bolt/row/dense/) — a column-batched serializer, sibling to
CompactRow / UnsafeRowFast, that processes all rows at once. The on-wire
layout is the "dense" form:

variable-length (varint/zigzag) integer values,
nulls fused into the structure bytes — no null bitmap,
no alignment padding,
level-hoisted nesting for ARRAY/MAP/ROW.

Implementation is split into:

IntVarint.h — varint/zigzag + nullable-int wire codec, with a SIMD
(xsimd/AVX2) batched size pass and a BMI2 (pdep) varint writer for long
values; scalar fast path otherwise.
DenseRowScalar*.cpp — flat fast path (identity / constant / dictionary
decoded inputs).
DenseRowGeneral*.cpp — general path for arbitrary nested typed data via a two-pass
(size, then write) slot-tree walk shared by both passes.
DenseRow.{h,cpp} — public API mirroring CompactRow (rowSizes() /
serialize() / deserialize()), grammar documented at the top of the .cpp.

Shuffle integration (bolt/shuffle/sparksql/) — new row::RowFormat enum
({DENSE, COMPACT}) added to ShuffleWriterOptions / ShuffleReaderOptions
(default COMPACT). The columnar↔row converters and the reader/writer dispatch
on it; the wire framing (rowSize | rowData) is unchanged, so only the row body
encoding differs. Writer and reader must be configured with the same format.

Tests — DenseRowTest.cpp (17 cases: scalars, wide/mixed rows, bigint &
hugeint edges, arrays / nested arrays / maps, empty containers, dictionary &
constant encoded inputs, non-contiguous serialize offsets, and golden-byte wire
assertions). The shuffle matrix test is extended to exercise the DENSE format.

Benchmark — DenseRowSerializeBenchmark.cpp compares DenseRow against
UnsafeRow/CompactRow for serialize, deserialize, and serialized size across 20
cases

Benchmark Result

— Intel i7-12700KF, -O3 AVX2+BMI2, VectorFuzzer vectorSize=1000,
pinned core. All times are ns/row; size is bytes/row. Compared against the
two existing Spark row formats (UnsafeRow, CompactRow).

Highlights

Size: up to 9× smaller than CompactRow on small ints (bigint < 2⁸: 9→1),
5× on small-int arrays/nested arrays. On par for full-random int64 and wide strings.
Deserialize: 2.7–2.8× faster than CompactRow on multi-column flat rows
(10 scalars: 89→32 ns), faster on all flat scalars.
Trade-off: container serialize is slower than CompactRow
(CompactRow would do buckcopy for array/map) — smaller payload for higher encode cost.

Serialized size (bytes/row)

case	UnsafeRow	CompactRow	DenseRow	vs Compact
bigint, \|v\| < 2⁸	16	9	1	9.0×
bigint, \|v\| < 2³²	16	9	4	2.3×
bigint, random	16	9	9	1.0×
bigint, random + 40% null	16	9	6	1.5×
double	16	9	8	1.1×
string(8)	24	13	9	1.4×
string(100)	120	105	101	1.04×
5 scalars	48	23	23	1.0×
5 scalars, small int	48	23	16	1.4×
10 scalars	88	50	54	0.9×
10 scalars, small int	88	50	31	1.6×
array	116	91	101	0.9×
array⟨bigint < 2⁸⟩	81	56	11	5.1×
array⟨bigint < 2³²⟩	78	53	30	1.8×
nested array	265	135	63	2.1×
nested⟨bigint < 2⁸⟩	368	272	53	5.1×
nested⟨bigint < 2³²⟩	351	259	133	1.9×
map	177	132	136	1.0×
map⟨bigint < 2⁸⟩	127	82	32	2.6×
map⟨bigint < 2³²⟩	123	78	50	1.6×

Note

We use the columnar CompactRow serde interface in this benchmark. In row-based shuffle, the row-based CompactRow serde interface is still used, which means CompactRow would see even greater performance improvement in real row-based shuffle scenarios. The array/map use bulk copy to serde in CompactRow, so it is fast than DenseRow for these type.

case	ser C	ser D	deser C	deser D	round-trip
bigint < 2⁸	1.8	3.6	5.5	2.4	1.2× faster
bigint < 2³²	1.8	4.5	4.9	4.2	0.8×
bigint random	1.8	4.1	5.0	4.1	0.8×
bigint random + null	2.4	4.4	5.4	5.6	0.8×
double	1.7	2.3	5.5	1.8	1.8× faster
string(8)	7.7	6.8	8.3	6.4	1.2× faster
string(100)	9.9	8.1	10.5	8.9	1.2× faster
5 scalars	7.1	8.3	27.4	10.2	1.9× faster
5 scalars, small	6.7	7.6	25.7	7.6	2.1× faster
10 scalars	17.8	29.0	89.2	31.5	1.8× faster
10 scalars, small	17.9	16.6	65.1	15.9	2.6× faster
array	16.4	99.1	57.0	76.5	0.4×
array⟨bigint < 2⁸⟩	12.3	22.2	21.5	40.4	0.5×
array⟨bigint < 2³²⟩	11.9	29.2	22.4	43.6	0.5×
nested array	142.4	142.7	84.0	111.2	0.9×
nested⟨bigint < 2⁸⟩	92.1	186.5	143.3	221.0	0.6×
nested⟨bigint < 2³²⟩	87.4	174.5	103.0	284.2	0.4×
map	25.8	97.1	103.3	91.0	0.7×
map⟨bigint < 2⁸⟩	21.8	28.2	45.8	52.5	0.8×
map⟨bigint < 2³²⟩	21.7	32.7	47.4	57.1	0.8×

Performance Impact

No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).

Positive Impact: I have run benchmarks.

Click to view Benchmark Results

Paste your google-benchmark or TPC-H results here.
Before: 10.5s
After:   8.2s  (+20%)

Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Please describe the changes in this PR

Release Note:

Release Note:
- Fixed a crash in `substr` when input is null.
- optimized `group by` performance by 20%.

Checklist (For Author)

I have added/updated unit tests (ctest).
I have verified the code with local build (Release/Debug).
I have run clang-format / linters.
(Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
No need to test or manual test.

Breaking Changes

No

Yes (Description: ...)

Click to view Breaking Changes

Breaking Changes:
- Description of the breaking change.
- Possible solutions or workarounds.
- Any other relevant information.

guhaiyan0221 · 2026-06-24T15:28:50Z

+    const uint8_t*& in,
+    const uint8_t* end,
+    bool& isNull,
+    int64_t& value) {


check bounds（in） before accessing *in
if (FOLLY_UNLIKELY(in >= end)) { return false}

Fixed. Added early checks in readVarint and readNullableInt64 to ensure in < end before the first byte access. The existing post-parse checks still validate that the final cursor does not pass end.

For serialize, I added a row-size validation after writing: the number of bytes written must match rowSizes[r], with row/offset info included on failure.

I also checked the perf impact. The deserialize guards are small/noisy. The serialize validation shows about 4% regression in the smallest case (bigint_lt_2pow8) and less than 2% in larger cases.

guhaiyan0221 · 2026-06-24T15:34:56Z

+    forEachLivePos(out, r, [&](uint32_t p) {
+      uint64_t v{0};
+      BOLT_USER_CHECK(
+          readVarint(c.cur, c.end, v),


need to check c.cur < c.end?

Added in readVarint

feat: optimize row based shuffle with DenseRow

7b8bde7

zhangxffff marked this pull request as draft June 9, 2026 14:20

zhangxffff added 2 commits June 21, 2026 11:47

simplify code

3f4553e

simpilify

68e773a

zhangxffff marked this pull request as ready for review June 23, 2026 12:24

zhangxffff requested review from guhaiyan0221 and kexianda June 24, 2026 01:49

guhaiyan0221 reviewed Jun 24, 2026

View reviewed changes

zhangxffff added 2 commits June 30, 2026 08:09

Add DenseRow bounds validation

3904102

Merge branch 'main' into feat/lazy_deserializer

82821d5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: optimize row based shuffle with DenseRow#637

feat: optimize row based shuffle with DenseRow#637
zhangxffff wants to merge 5 commits into
bytedance:mainfrom
zhangxffff:feat/lazy_deserializer

zhangxffff commented Jun 9, 2026 •

edited

Loading

Uh oh!

guhaiyan0221 Jun 24, 2026

Uh oh!

zhangxffff Jun 30, 2026

Uh oh!

guhaiyan0221 Jun 24, 2026

Uh oh!

zhangxffff Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

zhangxffff commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Type of Change

Description

Benchmark Result

Performance Impact

Release Note

Checklist (For Author)

Breaking Changes

Uh oh!

guhaiyan0221 Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

zhangxffff Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

guhaiyan0221 Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

zhangxffff Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhangxffff commented Jun 9, 2026 •

edited

Loading