fix(parquet): lazily load pending repdefs during page seek by oarap · Pull Request #625 · bytedance/bolt

oarap · 2026-06-02T19:08:08Z

What problem does this PR solve?

Issue Number: close #624

Repeated Parquet columns keep two streams of state in sync:

the value pages that contain the actual leaf values, and
the repetition/definition levels that describe how those leaf values map back to top-level rows.

For chunk-level repetition/definition levels, PageReader materializes per-page leaf counts in numLeavesInPage_ and uses pageIndex_ to look up the count for the current non-top-level data page.

Sparse selective reads can make the value reader reach a later physical page while numLeavesInPage_ only contains metadata for pages decoded from earlier rep/def batches. This does not necessarily mean the file
or seek is invalid: preloadRepDefs() may already have staged more rep/def batches in preloadedRepDefs_, but those batches have not yet been decoded into numLeavesInPage_.

The old code checked pageIndex_ against numLeavesInPage_ immediately and could fail with:

Seeking past known repdefs for non top level column page N

This PR fixes that by lazily decoding pending rep/def batches before enforcing the existing bounds check.

Type of Change

🐛 Bug fix (non-breaking change which fixes an issue)
✨ New feature (non-breaking change which adds functionality)
🚀 Performance improvement (optimization)
⚠️ Breaking change (fix or feature that would cause existing functionality to change)
🔨 Refactoring (no logic changes)
🔧 Build/CI or Infrastructure changes
📝 Documentation only

Description

This PR updates PageReader::setPageRowInfo() for non-top-level repeated columns.

Before this change, PageReader advanced pageIndex_ and immediately checked whether the current page was already covered by numLeavesInPage_.

That was too eager when preloadedRepDefs_ was non-empty. In that case, the metadata may already be available but still encoded in a pending rep/def batch.

The new logic decodes pending rep/def batches until either:

the current pageIndex_ is covered by numLeavesInPage_, or
there are no pending rep/def batches left.

The original bounds check remains in place, so truly inconsistent state still fails. The reader only recovers when the needed metadata was already preloaded.

The fix is:

  while (pageIndex_ >= static_cast<int32_t>(numLeavesInPage_.size()) &&
         !preloadedRepDefs_.empty()) {
    loadMoreRepDefs();
  }

This PR also adds regression coverage:

A focused PageReader regression test that constructs the exact problematic state:
• one known page in numLeavesInPage_,
• the next page requested by pageIndex_,
• one pending encoded rep/def batch in preloadedRepDefs_.
Without the lazy load, this test throws the same Seeking past known repdefs error. With the fix, the pending batch is materialized and the page row info is populated.
An end-to-end reader scenario with ARRAY<VARCHAR>, small Parquet pages, batched rep/def settings, and a sparse filter to exercise the same reader path through public APIs.

Performance Impact

No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).

Positive Impact: I have run benchmarks.

Click to view Benchmark Results

Paste your google-benchmark or TPC-H results here.
Before: 10.5s
After:   8.2s  (+20%)

Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Please describe the changes in this PR

Release Note:

Release Note:
- Fixed a native Parquet reader failure where sparse reads of repeated columns could seek past currently materialized rep/def page metadata even though more rep/def batches were already preloaded.

Checklist (For Author)

I have added/updated unit tests (ctest).
I have verified the code with local build (Release/Debug).
I have run clang-format / linters.
(Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
No need to test or manual test.

Breaking Changes

No

Yes (Description: ...)

Click to view Breaking Changes

Breaking Changes:
- Description of the breaking change.
- Possible solutions or workarounds.
- Any other relevant information.

Repeated Parquet columns keep two streams of state in sync: the value pages that contain the actual leaf values, and the repetition/definition levels that describe how those leaf values map back to top-level rows. For chunk-level rep/defs, PageReader materializes per-page leaf counts in numLeavesInPage_ and uses pageIndex_ to look up the count for the current non-top-level data page. Sparse selective reads can make the value reader reach a later physical page while numLeavesInPage_ only contains metadata for the pages decoded from earlier rep/def batches. That does not necessarily mean the file or seek is invalid: preloadRepDefs() may already have staged more rep/def batches in preloadedRepDefs_, but those batches have not yet been decoded into numLeavesInPage_. The old code checked pageIndex_ against numLeavesInPage_ immediately and could fail with: Seeking past known repdefs for non top level column page N This was too eager when preloadedRepDefs_ was non-empty. Before enforcing the existing bounds check, decode pending rep/def batches until either the current page index is covered or there are no pending batches left. The original check remains in place, so truly inconsistent state still fails; the reader only recovers when the needed metadata was already preloaded. Add a focused PageReader regression test that constructs this exact state: one known page in numLeavesInPage_, the next page requested by pageIndex_, and one pending encoded rep/def batch. Without the lazy load, the test throws the same "Seeking past known repdefs" error. With the fix, the pending batch is materialized and the page row info is populated. Also add an end-to-end reader scenario with ARRAY<VARCHAR>, small Parquet pages, batched rep/def settings, and a sparse filter to exercise the same reader path through public APIs. Validation: cmake --build _build/Release --target bolt_dwio_parquet_reader_test _build/Release/bolt/dwio/parquet/tests/reader/bolt_dwio_parquet_reader_test \ --gtest_filter=ParquetPageReaderTest.loadsPendingRepDefsBeforePageRowInfoCheck:ParquetReaderTest.sparseSkipAcrossRepeatedPagesWithBatchedRepDefs

JinyuanZhang617 · 2026-06-09T08:00:45Z

+    // exactly when only the already-consumed rep/def page counts are present in
+    // numLeavesInPage_. If more preloaded rep/def batches are available, decode
+    // them here instead of failing the scan immediately.
+    while (pageIndex_ >= static_cast<int32_t>(numLeavesInPage_.size()) &&


Nice fix — I hit the exact same crash in production and arrived at a similar conclusion.

I considered an alternative placement: doing the loadMoreRepDefs() loop in seekToPage(),
right before readPageHeader() (i.e., the caller side), rather than inside setPageRowInfo()
(the callee/crash site). I ended up going with that approach in #635

Trade-off I'd like to discuss:

Your approach (callee-side): More defensively robust — protects the CHECK_LT regardless
of how setPageRowInfo is reached. If a future code path forgets to preload, this still
saves the day.

Caller-side approach: Keeps setPageRowInfo() as a pure metadata lookup with no I/O side-effects. The invariant becomes: "seekToPage is the page lifecycle owner and must guarantee repdef coverage before entering the next page." This preserves the fail-fast semantics of BOLT_CHECK_LT for genuinely broken invariants.

My concern with the callee-side fix is that if a future path forgets to preload and there happen to be
leftover batches, the callee-side fix would silently mask the bug.

Not a blocker — both approaches fix the production crash correctly. Just wanted to flag
this trade-off for the discussion. Happy to hear your thoughts, and feel free to check
#635 for the alternative implementation.

JinyuanZhang617 · 2026-06-09T08:02:11Z

  EXPECT_EQ(0, rowReader->next(1000, result));
 }

+TEST_F(ParquetReaderTest, sparseSkipAcrossRepeatedPagesWithBatchedRepDefs) {


FYI — I ran sparseSkipAcrossRepeatedPagesWithBatchedRepDefs on main without this patch
and it passes. The test doesn't seem to trigger the crash path on an unpatched build.

oarap marked this pull request as ready for review June 3, 2026 20:52

JinyuanZhang617 reviewed Jun 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(parquet): lazily load pending repdefs during page seek#625

fix(parquet): lazily load pending repdefs during page seek#625
oarap wants to merge 1 commit into
bytedance:mainfrom
oarap:parquet-repdef-lazy-load

oarap commented Jun 2, 2026 •

edited

Loading

Uh oh!

JinyuanZhang617 Jun 9, 2026

Uh oh!

JinyuanZhang617 Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

oarap commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Type of Change

Description

Performance Impact

Release Note

Checklist (For Author)

Breaking Changes

Uh oh!

JinyuanZhang617 Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

JinyuanZhang617 Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

oarap commented Jun 2, 2026 •

edited

Loading