Skip to content

fix(parquet): lazily load pending repdefs during page seek#625

Open
oarap wants to merge 1 commit into
bytedance:mainfrom
oarap:parquet-repdef-lazy-load
Open

fix(parquet): lazily load pending repdefs during page seek#625
oarap wants to merge 1 commit into
bytedance:mainfrom
oarap:parquet-repdef-lazy-load

Conversation

@oarap

@oarap oarap commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

What problem does this PR solve?

Issue Number: close #624

Repeated Parquet columns keep two streams of state in sync:

  1. the value pages that contain the actual leaf values, and
  2. the repetition/definition levels that describe how those leaf values map back to top-level rows.

For chunk-level repetition/definition levels, PageReader materializes per-page leaf counts in numLeavesInPage_ and uses pageIndex_ to look up the count for the current non-top-level data page.

Sparse selective reads can make the value reader reach a later physical page while numLeavesInPage_ only contains metadata for pages decoded from earlier rep/def batches. This does not necessarily mean the file
or seek is invalid: preloadRepDefs() may already have staged more rep/def batches in preloadedRepDefs_, but those batches have not yet been decoded into numLeavesInPage_.

The old code checked pageIndex_ against numLeavesInPage_ immediately and could fail with:

Seeking past known repdefs for non top level column page N

This PR fixes that by lazily decoding pending rep/def batches before enforcing the existing bounds check.

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 🚀 Performance improvement (optimization)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)
  • 🔨 Refactoring (no logic changes)
  • 🔧 Build/CI or Infrastructure changes
  • 📝 Documentation only

Description

This PR updates PageReader::setPageRowInfo() for non-top-level repeated columns.

Before this change, PageReader advanced pageIndex_ and immediately checked whether the current page was already covered by numLeavesInPage_.

That was too eager when preloadedRepDefs_ was non-empty. In that case, the metadata may already be available but still encoded in a pending rep/def batch.

The new logic decodes pending rep/def batches until either:

  1. the current pageIndex_ is covered by numLeavesInPage_, or
  2. there are no pending rep/def batches left.

The original bounds check remains in place, so truly inconsistent state still fails. The reader only recovers when the needed metadata was already preloaded.

The fix is:

  while (pageIndex_ >= static_cast<int32_t>(numLeavesInPage_.size()) &&
         !preloadedRepDefs_.empty()) {
    loadMoreRepDefs();
  }

This PR also adds regression coverage:

  1. A focused PageReader regression test that constructs the exact problematic state:
    • one known page in numLeavesInPage_,
    • the next page requested by pageIndex_,
    • one pending encoded rep/def batch in preloadedRepDefs_.
    Without the lazy load, this test throws the same Seeking past known repdefs error. With the fix, the pending batch is materialized and the page row info is populated.
  2. An end-to-end reader scenario with ARRAY<VARCHAR>, small Parquet pages, batched rep/def settings, and a sparse filter to exercise the same reader path through public APIs.

Performance Impact

  • No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).

  • Positive Impact: I have run benchmarks.

    Click to view Benchmark Results
    Paste your google-benchmark or TPC-H results here.
    Before: 10.5s
    After:   8.2s  (+20%)
    
  • Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Please describe the changes in this PR

Release Note:

Release Note:
- Fixed a native Parquet reader failure where sparse reads of repeated columns could seek past currently materialized rep/def page metadata even though more rep/def batches were already preloaded.

Checklist (For Author)

  • I have added/updated unit tests (ctest).
  • I have verified the code with local build (Release/Debug).
  • I have run clang-format / linters.
  • (Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
  • No need to test or manual test.

Breaking Changes

  • No

  • Yes (Description: ...)

    Click to view Breaking Changes
    Breaking Changes:
    - Description of the breaking change.
    - Possible solutions or workarounds.
    - Any other relevant information.
    

Repeated Parquet columns keep two streams of state in sync: the value
pages that contain the actual leaf values, and the repetition/definition
levels that describe how those leaf values map back to top-level rows.
For chunk-level rep/defs, PageReader materializes per-page leaf counts in
numLeavesInPage_ and uses pageIndex_ to look up the count for the current
non-top-level data page.

Sparse selective reads can make the value reader reach a later physical
page while numLeavesInPage_ only contains metadata for the pages decoded
from earlier rep/def batches. That does not necessarily mean the file or
seek is invalid: preloadRepDefs() may already have staged more rep/def
batches in preloadedRepDefs_, but those batches have not yet been decoded
into numLeavesInPage_.

The old code checked pageIndex_ against numLeavesInPage_ immediately and
could fail with:

  Seeking past known repdefs for non top level column page N

This was too eager when preloadedRepDefs_ was non-empty. Before enforcing
the existing bounds check, decode pending rep/def batches until either the
current page index is covered or there are no pending batches left. The
original check remains in place, so truly inconsistent state still fails;
the reader only recovers when the needed metadata was already preloaded.

Add a focused PageReader regression test that constructs this exact state:
one known page in numLeavesInPage_, the next page requested by pageIndex_,
and one pending encoded rep/def batch. Without the lazy load, the test
throws the same "Seeking past known repdefs" error. With the fix, the
pending batch is materialized and the page row info is populated.

Also add an end-to-end reader scenario with ARRAY<VARCHAR>, small Parquet
pages, batched rep/def settings, and a sparse filter to exercise the same
reader path through public APIs.

Validation:
  cmake --build _build/Release --target bolt_dwio_parquet_reader_test
  _build/Release/bolt/dwio/parquet/tests/reader/bolt_dwio_parquet_reader_test \
    --gtest_filter=ParquetPageReaderTest.loadsPendingRepDefsBeforePageRowInfoCheck:ParquetReaderTest.sparseSkipAcrossRepeatedPagesWithBatchedRepDefs
@oarap oarap marked this pull request as ready for review June 3, 2026 20:52
// exactly when only the already-consumed rep/def page counts are present in
// numLeavesInPage_. If more preloaded rep/def batches are available, decode
// them here instead of failing the scan immediately.
while (pageIndex_ >= static_cast<int32_t>(numLeavesInPage_.size()) &&

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix — I hit the exact same crash in production and arrived at a similar conclusion.

I considered an alternative placement: doing the loadMoreRepDefs() loop in seekToPage(),
right before readPageHeader() (i.e., the caller side), rather than inside setPageRowInfo()
(the callee/crash site). I ended up going with that approach in #635

Trade-off I'd like to discuss:

Your approach (callee-side): More defensively robust — protects the CHECK_LT regardless
of how setPageRowInfo is reached. If a future code path forgets to preload, this still
saves the day.

Caller-side approach: Keeps setPageRowInfo() as a pure metadata lookup with no I/O side-effects. The invariant becomes: "seekToPage is the page lifecycle owner and must guarantee repdef coverage before entering the next page." This preserves the fail-fast semantics of BOLT_CHECK_LT for genuinely broken invariants.

My concern with the callee-side fix is that if a future path forgets to preload and there happen to be
leftover batches, the callee-side fix would silently mask the bug.

Not a blocker — both approaches fix the production crash correctly. Just wanted to flag
this trade-off for the discussion. Happy to hear your thoughts, and feel free to check
#635 for the alternative implementation.

EXPECT_EQ(0, rowReader->next(1000, result));
}

TEST_F(ParquetReaderTest, sparseSkipAcrossRepeatedPagesWithBatchedRepDefs) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI — I ran sparseSkipAcrossRepeatedPagesWithBatchedRepDefs on main without this patch
and it passes. The test doesn't seem to trigger the crash path on an unpatched build.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Parquet reader can fail seeking past known repdefs for repeated columns

2 participants