Skip to content

Batch validator catch-up via an aggregated MissingCrossChainUpdates error#6573

Open
ma2bd wants to merge 7 commits into
mainfrom
ma2bd/batch-cross-chain-updates-main
Open

Batch validator catch-up via an aggregated MissingCrossChainUpdates error#6573
ma2bd wants to merge 7 commits into
mainfrom
ma2bd/batch-cross-chain-updates-main

Conversation

@ma2bd

@ma2bd ma2bd commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Motivation

Port of #6556 (which targets testnet_conway) to main.

A client that owns a busy hub chain (one receiving cross-chain messages from many sender chains) periodically froze for minutes and tripped the chain-idle liveness probe, restarting the process. Root cause: when a validator is behind on the sender chains a block consumes, it rejects the proposal one missing sender at a time (MissingCrossChainUpdate), and the client brings it up to date with one blocking round-trip per sender. Accepting a single high-fan-out block therefore takes ~N sequential propose → sync-one-sender → retry round-trips per validator, and because per-chain block production is a single task gated on a quorum, this serializes into multi-minute stalls (observed process_inbox p99 ≈ 8.7 min during a freeze).

Proposal

Aggregate the error. Replace the per-sender MissingCrossChainUpdate { chain_id, origin, height } with an aggregated MissingCrossChainUpdates { chain_id, bundles: Vec<(ChainId, BlockHeight)> }:

  • When validating a proposal, the validator collects every missing cross-chain bundle instead of bailing on the first, and returns them together.
  • The client (both the validator-push path in updater.rs and the local-node pull path in client/mod.rs) fetches the whole set in one batch and retries once, instead of one rejection/round-trip per sender.
  • Unlike the testnet_conway PR (Batch validator catch-up via an aggregated MissingCrossChainUpdates error #6556), this version makes a clean break: no supports_aggregated_missing capability flag and no gRPC-boundary downgrade to the legacy per-sender error. The legacy MissingCrossChainUpdate variant is replaced outright, so this requires a coordinated validator/client deployment.

Local-node pull loop. In the proposal catch-up loop that brings the local node up to date, drop the dead if !bundles.is_empty() guard, deduplicate the reported bundles to the highest missing height per origin (download_sender_block_with_sending_ancestors walks each origin's message-bearing blocks back from the given height, so the max subsumes the lower ones), and download the independent origins concurrently, bounded by max_joined_tasks.

Scope is proposals only, matching #6556. Confirmed certificates don't have the same pathology (missing bundles are tolerated), so that path is left unchanged.

On the validator-push catch-up

The push path keeps using send_chain_info_up_to_heights. It computes a nominally contiguous fill range [validator_next_height .. target) but reads those certificates from the client's local storage (read_certificates_by_heights.flatten()), which silently drops any height the client doesn't hold. For a hub that merely receives from a sender, local storage only ever contains the message-bearing sender blocks (fetched via the received-log sync), never the non-message blocks in between — so in practice only those go out, executing the message-bearing prefix and preprocessing the gap block. (This is incidental to the hub not tracking the sender; it is not a gap-skip guarantee of the function.)

An experiment to make this explicitly sparse by sending only the specifically-reported heights (send_chain_info_at_heights) was reverted: it derives the set from the error's reported heights (just the top gap block) rather than from local storage, so it drops the message-bearing ancestor block the gap block depends on. A gap block's bundle can only be scheduled once its previous_message_block has been executed on the validator (#4181), so sending only the top reported height leaves the bundle undeliverable and the retry fails. test_proposal_catch_up_with_sender_gap (added here) reproduces this at the memory level and locks in the correct behavior.

Test Plan

  • test_proposal_batches_missing_cross_chain_update_catch_up drives the aggregated push path end to end (a proposal consuming bundles from three senders, to a validator behind on all of them).
  • test_handle_block_proposal_reports_all_missing_bundles asserts a proposal consuming missing bundles from two senders is rejected with a single MissingCrossChainUpdates listing both.
  • test_proposal_catch_up_with_sender_gap (new) reproduces the gap case: a sender that messages the recipient at heights 0 and 2 but not 1, consumed across two recipient blocks, with a lagging validator that must be caught up across the gap.
  • Broader cross-chain/inbox suite passes (test_cross_chain_message_chunking_end_to_end, test_prepare_chain_with_cross_chain_messages, test_propose_block_with_messages_and_blobs, test_handle_block_proposal_sparse_chain).
  • The linera-rpc wire-format snapshot is updated for the reshaped NodeError variant.
  • cargo clippy (incl. --features server) and cargo +nightly fmt are clean.

Release Plan

already backported to testnet with #6556

@ma2bd ma2bd marked this pull request as ready for review June 30, 2026 19:01
@ma2bd ma2bd force-pushed the ma2bd/batch-cross-chain-updates-main branch from 02d4cf5 to 15fdc5f Compare July 2, 2026 23:55
Comment thread linera-chain/src/chain.rs
}
);
if must_be_present && !was_present {
missing_bundles.push((origin, bundle.height));

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What could happen here is that a previous message from the same origin is "negative" in the inbox (due to a confirmed block consuming it). Currently, this older message is not reported as missing. (It would use more network data but could arguably make the work of [future] VM-free clients easier.)

Conversely, we are not trying to normalize the reported missing messages to include only the maximum height from each origin. (It would be a small optimization in line with the rest of the code)

ma2bd added a commit that referenced this pull request Jul 3, 2026
## Motivation

Follow-up to #6556 on `testnet_conway` (which already batches validator
catch-up via the aggregated `MissingCrossChainUpdates` error). This
tightens the **local-node** side of the proposal catch-up loop, which
still downloaded the reported missing sender blocks one origin at a
time.

## Proposal

In the proposal loop that brings the local node up to date after a
`MissingCrossChainUpdates` (`client/mod.rs`):

- Drop the dead `if !bundles.is_empty()` guard (the error is only
produced with a non-empty set).
- Deduplicate the reported bundles to the **highest missing height per
origin**. `download_sender_block_with_sending_ancestors` walks each
origin's message-bearing blocks back from the given height, so the max
subsumes the lower ones.
- Download the independent origins **concurrently**, bounded by
`max_joined_tasks`, instead of sequentially.

### What this PR deliberately does *not* do

An earlier revision also tried to make the **validator-push** catch-up
"sparse" — replacing `send_chain_info_up_to_heights` with
`send_chain_info_at_heights` so it would send only the
specifically-reported sender heights. That was **reverted** because it
breaks the "leave gaps on the validator side" design (#4181), as caught
by `test_update_validator_sender_gaps`:

- `send_chain_info_up_to_heights` sources the blocks to send from the
client's **local storage** (all message-bearing blocks up to the
target); for a hub that only receives from a sender, that's already
exactly the message-bearing blocks and nothing else, so it's already
sparse in practice.
- `send_chain_info_at_heights` sourced them from the **error's reported
heights**, which — by design — omit any message-bearing **ancestor**
whose bundle was consumed by an earlier block. A gap block's bundle can
only be scheduled once its `previous_message_block` has been executed on
the validator, so dropping the ancestor leaves the bundle undeliverable
and the retry fails.

The push path is therefore left on `send_chain_info_up_to_heights`, and
a memory-level regression test
(`test_proposal_catch_up_with_sender_gap`) is added to lock this in.

## Test Plan

- `test_proposal_catch_up_with_sender_gap` (new): a sender that messages
the recipient at heights 0 and 2 but not 1, consumed across two
recipient blocks, with a lagging validator that must be caught up across
the gap. Fails with the reverted sparse push, passes with
`send_chain_info_up_to_heights`.
- `test_proposal_batches_missing_dependency_catch_up` and
`test_handle_block_proposal_with_incoming_bundles` (aggregated-error
coverage) pass, as does the broader cross-chain/inbox suite.
- `cargo clippy -p linera-core --lib --tests` and `cargo +nightly fmt`
are clean.

## Release Plan

- These changes should be backported to the latest `testnet` branch,
then
    - be released in a new SDK,
    - be released in a validator hotfix.

## Links

- Follow-up to #6556. Main counterpart: #6573.
- [reviewer
checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant