Add observability for message and block rejections (backport of #6473) by ndr-ds · Pull Request #6565 · linera-io/linera-protocol

ndr-ds · 2026-06-26T19:03:02Z

Motivation

Backport of #6473 to testnet_conway. We have almost no observability when messages or
blocks are rejected. This adds logging and rate-correct Prometheus counters at the
previously-silent rejection sites. The metrics are most useful on the deployed network
(testnet_conway), where all the rejection traffic that justified the change lives.

Proposal

Cherry-pick of #6473 (its squash commit), reusing the error_type() labelling convention
from #5966 (the testnet_conway backport of #5951).

Message rejections: logs at the policy decision (apply_policy, INFO), the
closed-chain auto-reject (prepare_chain_info_response, INFO), and the validator-side
execution of a rejection (block_tracker, DEBUG). No new message counter.
Validator proposal rejections: block_proposals_received_total +
block_proposals_rejected_total{error_type} at handle_block_proposal, plus a DEBUG log.
Client staging failures: block_staging_failures_total{error_type} + INFO log in
execute_block.

The two highest-volume sites (proposal rejected, message rejected) are DEBUG, not INFO:
on conway they run at ~10/sec, ~92% routine consensus churn (UnexpectedBlockHeight,
InvalidTimestamp, MissingCrossChainUpdate); genuinely adversarial rejections are
<0.05/sec. The three low-volume sites are INFO. Per-reason rates live in the metrics.

Test Plan

Cherry-picked Add observability for message and block rejections #6473's squash commit. Conflicts resolved where conway has diverged from
main: conway lacks main's impl From<ExecutionError> for LocalNodeError (kept conway's
layout, added only error_type()), and refactored the closed-chain loop to use
self.chain / origins_and_inboxes (kept conway's structure, added the is_closed
capture for the closed-chain log).
cargo clippy -p linera-core -p linera-chain --features metrics -- -D warnings — clean.
error_type() delegation unit tests — pass.
cargo fmt — no changes.
Full CI on this PR.

Release Plan

Observability only; no protocol or storage format change. This is the testnet branch;
the change reaches validators on the next deploy after merge.

Links

Backport of Add observability for message and block rejections #6473
error_type labelling from [testnet] Add error_type label to server and proxy error metrics #5966 (originally Add error_type label to server and proxy error metrics #5951 on main)

We have almost no observability when messages or blocks are rejected. Most rejection paths are silent (no log, no metric), so rejection rates can't be measured reliably and silent rejections are invisible in production. Closes linera-io#6459. Add logging at every currently-silent rejection decision site, plus rate-correct Prometheus counters, reusing the `error_type()` labelling convention established in linera-io#5951. - **Message rejections:** logs at the policy decision (`IncomingBundle::apply_policy`, INFO), the closed-chain auto-reject (`prepare_chain_info_response`, INFO), and the validator-side execution of a rejection (`BlockExecutionTracker`, DEBUG — see Log levels). The execution-failure path already logged. No new message counter — the rejection rate stays on the existing commit-time pair `rejected_bundle_count` / `incoming_bundle_count`; the reason (which can't be recovered from a committed block) lives in the logs. - **Validator proposal rejections:** `block_proposals_received_total` (unconditional) + `block_proposals_rejected_total{error_type}` at the worker's `handle_block_proposal`, plus a DEBUG log (see Log levels). `error_type` reuses `WorkerError::error_type()`, surfacing `WorkerError::InvalidOwner`, `ChainError::WrongRound`, etc. - **Client staging failures:** `block_staging_failures_total{error_type}` + INFO log in `ChainClient::execute_block`, reusing `execute_block_latency`'s count as the denominator. Adds `error_type()` to the client `Error` and `LocalNodeError`, delegating into `WorkerError` / `ChainError` exactly like `WorkerError::error_type` from Each rate lives within one observation point + process role, so numerator and denominator emit from the same site and metric names can't be summed across points. The three low-volume decision sites are INFO; the two high-volume sites (proposal rejected, message rejected) are DEBUG. Rationale from prod (`testnet-conway`, `rate(linera_server_request_error{method_name="handle_block_proposal"}[1h])`): - Proposal rejections run at **~10/sec cluster-wide** (~1.7/sec per shard), but **~92% is routine consensus churn** — `UnexpectedBlockHeight` (4.0/s), `InvalidTimestamp` (3.7/s), `MissingCrossChainUpdate` (1.6/s). The genuinely adversarial rejections (`InvalidOwner`, `InvalidSigner`, `WrongRound`, …) sum to **<0.05/sec**. This is why kept these at DEBUG. - The message-rejection site fires **per message × per executing node** and re-runs during confirmed-block execution and catch-up sync, so a syncing node would replay the entire rejection history into Loki as a burst. At INFO these two would emit ~10/sec of almost-entirely-liveness noise, contradicting the issue's own goal of making rejection data usable. DEBUG keeps them opt-in. The per-`reason` rates are fully preserved in the metrics (`block_proposals_rejected_total{error_type}`, etc.) — the logs are just the detail layer. - `cargo clippy --all-targets --all-features -- -D warnings` — clean. - `cargo clippy --no-default-features -- -D warnings` (linera-core, linera-chain) — clean. - `cargo doc --all-features` — clean. - New unit tests covering `error_type()` delegation and prefix fallback (`LocalNodeError -> WorkerError`, `Error -> LocalNodeError`, the `ChainClientError::` / `LocalNodeError::` fallbacks) — pass. - **NOT yet verified end-to-end on a running network.** Metric registration/increments and log emission are validated by inspection and compilation only; they follow the existing metric/log patterns in these files. A local-net run confirming the counters appear on `/metrics` is still pending. - These changes follow the usual release cycle (observability only; no protocol or storage format change). Recommend backporting to the latest `testnet` branch, since that is where the rejection-visibility gap is felt operationally — though this is not a bug fix. - Closes linera-io#6459 - Builds on linera-io#5951 (`error_type` labelling for server/proxy error metrics) - [reviewer checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist)

afck · 2026-06-29T08:11:32Z

+                debug!(
+                    chain_id = %self.chain_id,
+                    origin = %incoming_bundle.origin,
+                    "Rejecting incoming message"


This is now printed whenever we execute a block containing a rejected message.

The other log, in data_types, has similar wording but is where we make the decision to reject a message.

There is at least one more place where we make that decision, I think, which should probably also be logged: When executing a message fails during staging.

I see you removed this one, but I think there are still other places where we decide to choose Action::Reject, and we should either log at all or none of those places?

ndr-ds marked this pull request as ready for review June 26, 2026 23:13

afck reviewed Jun 29, 2026

View reviewed changes

Drop redundant execution-site rejection log in block_tracker

b90db49

ndr-ds requested review from afck, deuszx and ma2bd and removed request for afck July 1, 2026 23:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add observability for message and block rejections (backport of #6473)#6565

Add observability for message and block rejections (backport of #6473)#6565
ndr-ds wants to merge 2 commits into
linera-io:testnet_conwayfrom
ndr-ds:ndr-ds/rejection-observability-conway

ndr-ds commented Jun 26, 2026

Uh oh!

afck Jun 29, 2026

Uh oh!

afck Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ndr-ds commented Jun 26, 2026

Motivation

Proposal

Test Plan

Release Plan

Links

Uh oh!

afck Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

afck Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants