Add observability for message and block rejections (backport of #6473)#6565
Open
ndr-ds wants to merge 2 commits into
Open
Add observability for message and block rejections (backport of #6473)#6565ndr-ds wants to merge 2 commits into
ndr-ds wants to merge 2 commits into
Conversation
We have almost no observability when messages or blocks are rejected. Most rejection paths are silent (no log, no metric), so rejection rates can't be measured reliably and silent rejections are invisible in production. Closes linera-io#6459. Add logging at every currently-silent rejection decision site, plus rate-correct Prometheus counters, reusing the `error_type()` labelling convention established in linera-io#5951. - **Message rejections:** logs at the policy decision (`IncomingBundle::apply_policy`, INFO), the closed-chain auto-reject (`prepare_chain_info_response`, INFO), and the validator-side execution of a rejection (`BlockExecutionTracker`, DEBUG — see Log levels). The execution-failure path already logged. No new message counter — the rejection rate stays on the existing commit-time pair `rejected_bundle_count` / `incoming_bundle_count`; the reason (which can't be recovered from a committed block) lives in the logs. - **Validator proposal rejections:** `block_proposals_received_total` (unconditional) + `block_proposals_rejected_total{error_type}` at the worker's `handle_block_proposal`, plus a DEBUG log (see Log levels). `error_type` reuses `WorkerError::error_type()`, surfacing `WorkerError::InvalidOwner`, `ChainError::WrongRound`, etc. - **Client staging failures:** `block_staging_failures_total{error_type}` + INFO log in `ChainClient::execute_block`, reusing `execute_block_latency`'s count as the denominator. Adds `error_type()` to the client `Error` and `LocalNodeError`, delegating into `WorkerError` / `ChainError` exactly like `WorkerError::error_type` from Each rate lives within one observation point + process role, so numerator and denominator emit from the same site and metric names can't be summed across points. The three low-volume decision sites are INFO; the two high-volume sites (proposal rejected, message rejected) are DEBUG. Rationale from prod (`testnet-conway`, `rate(linera_server_request_error{method_name="handle_block_proposal"}[1h])`): - Proposal rejections run at **~10/sec cluster-wide** (~1.7/sec per shard), but **~92% is routine consensus churn** — `UnexpectedBlockHeight` (4.0/s), `InvalidTimestamp` (3.7/s), `MissingCrossChainUpdate` (1.6/s). The genuinely adversarial rejections (`InvalidOwner`, `InvalidSigner`, `WrongRound`, …) sum to **<0.05/sec**. This is why kept these at DEBUG. - The message-rejection site fires **per message × per executing node** and re-runs during confirmed-block execution and catch-up sync, so a syncing node would replay the entire rejection history into Loki as a burst. At INFO these two would emit ~10/sec of almost-entirely-liveness noise, contradicting the issue's own goal of making rejection data usable. DEBUG keeps them opt-in. The per-`reason` rates are fully preserved in the metrics (`block_proposals_rejected_total{error_type}`, etc.) — the logs are just the detail layer. - `cargo clippy --all-targets --all-features -- -D warnings` — clean. - `cargo clippy --no-default-features -- -D warnings` (linera-core, linera-chain) — clean. - `cargo doc --all-features` — clean. - New unit tests covering `error_type()` delegation and prefix fallback (`LocalNodeError -> WorkerError`, `Error -> LocalNodeError`, the `ChainClientError::` / `LocalNodeError::` fallbacks) — pass. - **NOT yet verified end-to-end on a running network.** Metric registration/increments and log emission are validated by inspection and compilation only; they follow the existing metric/log patterns in these files. A local-net run confirming the counters appear on `/metrics` is still pending. - These changes follow the usual release cycle (observability only; no protocol or storage format change). Recommend backporting to the latest `testnet` branch, since that is where the rejection-visibility gap is felt operationally — though this is not a bug fix. - Closes linera-io#6459 - Builds on linera-io#5951 (`error_type` labelling for server/proxy error metrics) - [reviewer checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist)
afck
reviewed
Jun 29, 2026
| debug!( | ||
| chain_id = %self.chain_id, | ||
| origin = %incoming_bundle.origin, | ||
| "Rejecting incoming message" |
Contributor
There was a problem hiding this comment.
This is now printed whenever we execute a block containing a rejected message.
The other log, in data_types, has similar wording but is where we make the decision to reject a message.
There is at least one more place where we make that decision, I think, which should probably also be logged: When executing a message fails during staging.
Contributor
There was a problem hiding this comment.
I see you removed this one, but I think there are still other places where we decide to choose Action::Reject, and we should either log at all or none of those places?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Backport of #6473 to
testnet_conway. We have almost no observability when messages orblocks are rejected. This adds logging and rate-correct Prometheus counters at the
previously-silent rejection sites. The metrics are most useful on the deployed network
(
testnet_conway), where all the rejection traffic that justified the change lives.Proposal
Cherry-pick of #6473 (its squash commit), reusing the
error_type()labelling conventionfrom #5966 (the
testnet_conwaybackport of #5951).apply_policy, INFO), theclosed-chain auto-reject (
prepare_chain_info_response, INFO), and the validator-sideexecution of a rejection (
block_tracker, DEBUG). No new message counter.block_proposals_received_total+block_proposals_rejected_total{error_type}athandle_block_proposal, plus a DEBUG log.block_staging_failures_total{error_type}+ INFO log inexecute_block.The two highest-volume sites (proposal rejected, message rejected) are DEBUG, not INFO:
on conway they run at ~10/sec, ~92% routine consensus churn (
UnexpectedBlockHeight,InvalidTimestamp,MissingCrossChainUpdate); genuinely adversarial rejections are<0.05/sec. The three low-volume sites are INFO. Per-reason rates live in the metrics.
Test Plan
main: conway lacks main's
impl From<ExecutionError> for LocalNodeError(kept conway'slayout, added only
error_type()), and refactored the closed-chain loop to useself.chain/origins_and_inboxes(kept conway's structure, added theis_closedcapture for the closed-chain log).
cargo clippy -p linera-core -p linera-chain --features metrics -- -D warnings— clean.error_type()delegation unit tests — pass.cargo fmt— no changes.Release Plan
testnetbranch;the change reaches validators on the next deploy after merge.
Links
error_typelabelling from [testnet] Add error_type label to server and proxy error metrics #5966 (originally Add error_type label to server and proxy error metrics #5951 onmain)