Batch validator catch-up via an aggregated MissingCrossChainUpdates error#6573
Open
ma2bd wants to merge 7 commits into
Open
Batch validator catch-up via an aggregated MissingCrossChainUpdates error#6573ma2bd wants to merge 7 commits into
ma2bd wants to merge 7 commits into
Conversation
…ated MissingCrossChainUpdates error.
…s instead of whole chains.
…breaks gap delivery (test_update_validator_sender_gaps).
…, leaving gaps intentionally.
02d4cf5 to
15fdc5f
Compare
ma2bd
commented
Jul 3, 2026
| } | ||
| ); | ||
| if must_be_present && !was_present { | ||
| missing_bundles.push((origin, bundle.height)); |
Contributor
Author
There was a problem hiding this comment.
What could happen here is that a previous message from the same origin is "negative" in the inbox (due to a confirmed block consuming it). Currently, this older message is not reported as missing. (It would use more network data but could arguably make the work of [future] VM-free clients easier.)
Conversely, we are not trying to normalize the reported missing messages to include only the maximum height from each origin. (It would be a small optimization in line with the rest of the code)
ma2bd
added a commit
that referenced
this pull request
Jul 3, 2026
## Motivation Follow-up to #6556 on `testnet_conway` (which already batches validator catch-up via the aggregated `MissingCrossChainUpdates` error). This tightens the **local-node** side of the proposal catch-up loop, which still downloaded the reported missing sender blocks one origin at a time. ## Proposal In the proposal loop that brings the local node up to date after a `MissingCrossChainUpdates` (`client/mod.rs`): - Drop the dead `if !bundles.is_empty()` guard (the error is only produced with a non-empty set). - Deduplicate the reported bundles to the **highest missing height per origin**. `download_sender_block_with_sending_ancestors` walks each origin's message-bearing blocks back from the given height, so the max subsumes the lower ones. - Download the independent origins **concurrently**, bounded by `max_joined_tasks`, instead of sequentially. ### What this PR deliberately does *not* do An earlier revision also tried to make the **validator-push** catch-up "sparse" — replacing `send_chain_info_up_to_heights` with `send_chain_info_at_heights` so it would send only the specifically-reported sender heights. That was **reverted** because it breaks the "leave gaps on the validator side" design (#4181), as caught by `test_update_validator_sender_gaps`: - `send_chain_info_up_to_heights` sources the blocks to send from the client's **local storage** (all message-bearing blocks up to the target); for a hub that only receives from a sender, that's already exactly the message-bearing blocks and nothing else, so it's already sparse in practice. - `send_chain_info_at_heights` sourced them from the **error's reported heights**, which — by design — omit any message-bearing **ancestor** whose bundle was consumed by an earlier block. A gap block's bundle can only be scheduled once its `previous_message_block` has been executed on the validator, so dropping the ancestor leaves the bundle undeliverable and the retry fails. The push path is therefore left on `send_chain_info_up_to_heights`, and a memory-level regression test (`test_proposal_catch_up_with_sender_gap`) is added to lock this in. ## Test Plan - `test_proposal_catch_up_with_sender_gap` (new): a sender that messages the recipient at heights 0 and 2 but not 1, consumed across two recipient blocks, with a lagging validator that must be caught up across the gap. Fails with the reverted sparse push, passes with `send_chain_info_up_to_heights`. - `test_proposal_batches_missing_dependency_catch_up` and `test_handle_block_proposal_with_incoming_bundles` (aggregated-error coverage) pass, as does the broader cross-chain/inbox suite. - `cargo clippy -p linera-core --lib --tests` and `cargo +nightly fmt` are clean. ## Release Plan - These changes should be backported to the latest `testnet` branch, then - be released in a new SDK, - be released in a validator hotfix. ## Links - Follow-up to #6556. Main counterpart: #6573. - [reviewer checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Port of #6556 (which targets
testnet_conway) tomain.A client that owns a busy hub chain (one receiving cross-chain messages from many sender chains) periodically froze for minutes and tripped the chain-idle liveness probe, restarting the process. Root cause: when a validator is behind on the sender chains a block consumes, it rejects the proposal one missing sender at a time (
MissingCrossChainUpdate), and the client brings it up to date with one blocking round-trip per sender. Accepting a single high-fan-out block therefore takes ~N sequentialpropose → sync-one-sender → retryround-trips per validator, and because per-chain block production is a single task gated on a quorum, this serializes into multi-minute stalls (observedprocess_inboxp99 ≈ 8.7 min during a freeze).Proposal
Aggregate the error. Replace the per-sender
MissingCrossChainUpdate { chain_id, origin, height }with an aggregatedMissingCrossChainUpdates { chain_id, bundles: Vec<(ChainId, BlockHeight)> }:updater.rsand the local-node pull path inclient/mod.rs) fetches the whole set in one batch and retries once, instead of one rejection/round-trip per sender.testnet_conwayPR (Batch validator catch-up via an aggregated MissingCrossChainUpdates error #6556), this version makes a clean break: nosupports_aggregated_missingcapability flag and no gRPC-boundary downgrade to the legacy per-sender error. The legacyMissingCrossChainUpdatevariant is replaced outright, so this requires a coordinated validator/client deployment.Local-node pull loop. In the proposal catch-up loop that brings the local node up to date, drop the dead
if !bundles.is_empty()guard, deduplicate the reported bundles to the highest missing height per origin (download_sender_block_with_sending_ancestorswalks each origin's message-bearing blocks back from the given height, so the max subsumes the lower ones), and download the independent origins concurrently, bounded bymax_joined_tasks.Scope is proposals only, matching #6556. Confirmed certificates don't have the same pathology (missing bundles are tolerated), so that path is left unchanged.
On the validator-push catch-up
The push path keeps using
send_chain_info_up_to_heights. It computes a nominally contiguous fill range[validator_next_height .. target)but reads those certificates from the client's local storage (read_certificates_by_heights→.flatten()), which silently drops any height the client doesn't hold. For a hub that merely receives from a sender, local storage only ever contains the message-bearing sender blocks (fetched via the received-log sync), never the non-message blocks in between — so in practice only those go out, executing the message-bearing prefix and preprocessing the gap block. (This is incidental to the hub not tracking the sender; it is not a gap-skip guarantee of the function.)An experiment to make this explicitly sparse by sending only the specifically-reported heights (
send_chain_info_at_heights) was reverted: it derives the set from the error's reported heights (just the top gap block) rather than from local storage, so it drops the message-bearing ancestor block the gap block depends on. A gap block's bundle can only be scheduled once itsprevious_message_blockhas been executed on the validator (#4181), so sending only the top reported height leaves the bundle undeliverable and the retry fails.test_proposal_catch_up_with_sender_gap(added here) reproduces this at the memory level and locks in the correct behavior.Test Plan
test_proposal_batches_missing_cross_chain_update_catch_updrives the aggregated push path end to end (a proposal consuming bundles from three senders, to a validator behind on all of them).test_handle_block_proposal_reports_all_missing_bundlesasserts a proposal consuming missing bundles from two senders is rejected with a singleMissingCrossChainUpdateslisting both.test_proposal_catch_up_with_sender_gap(new) reproduces the gap case: a sender that messages the recipient at heights 0 and 2 but not 1, consumed across two recipient blocks, with a lagging validator that must be caught up across the gap.test_cross_chain_message_chunking_end_to_end,test_prepare_chain_with_cross_chain_messages,test_propose_block_with_messages_and_blobs,test_handle_block_proposal_sparse_chain).linera-rpcwire-format snapshot is updated for the reshapedNodeErrorvariant.cargo clippy(incl.--features server) andcargo +nightly fmtare clean.Release Plan
already backported to testnet with #6556