Skip to content

Add observability for message and block rejections (backport of #6473)#6565

Open
ndr-ds wants to merge 2 commits into
linera-io:testnet_conwayfrom
ndr-ds:ndr-ds/rejection-observability-conway
Open

Add observability for message and block rejections (backport of #6473)#6565
ndr-ds wants to merge 2 commits into
linera-io:testnet_conwayfrom
ndr-ds:ndr-ds/rejection-observability-conway

Conversation

@ndr-ds

@ndr-ds ndr-ds commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Motivation

Backport of #6473 to testnet_conway. We have almost no observability when messages or
blocks are rejected. This adds logging and rate-correct Prometheus counters at the
previously-silent rejection sites. The metrics are most useful on the deployed network
(testnet_conway), where all the rejection traffic that justified the change lives.

Proposal

Cherry-pick of #6473 (its squash commit), reusing the error_type() labelling convention
from #5966 (the testnet_conway backport of #5951).

  • Message rejections: logs at the policy decision (apply_policy, INFO), the
    closed-chain auto-reject (prepare_chain_info_response, INFO), and the validator-side
    execution of a rejection (block_tracker, DEBUG). No new message counter.
  • Validator proposal rejections: block_proposals_received_total +
    block_proposals_rejected_total{error_type} at handle_block_proposal, plus a DEBUG log.
  • Client staging failures: block_staging_failures_total{error_type} + INFO log in
    execute_block.

The two highest-volume sites (proposal rejected, message rejected) are DEBUG, not INFO:
on conway they run at ~10/sec, ~92% routine consensus churn (UnexpectedBlockHeight,
InvalidTimestamp, MissingCrossChainUpdate); genuinely adversarial rejections are
<0.05/sec. The three low-volume sites are INFO. Per-reason rates live in the metrics.

Test Plan

  • Cherry-picked Add observability for message and block rejections #6473's squash commit. Conflicts resolved where conway has diverged from
    main: conway lacks main's impl From<ExecutionError> for LocalNodeError (kept conway's
    layout, added only error_type()), and refactored the closed-chain loop to use
    self.chain / origins_and_inboxes (kept conway's structure, added the is_closed
    capture for the closed-chain log).
  • cargo clippy -p linera-core -p linera-chain --features metrics -- -D warnings — clean.
  • error_type() delegation unit tests — pass.
  • cargo fmt — no changes.
  • Full CI on this PR.

Release Plan

  • Observability only; no protocol or storage format change. This is the testnet branch;
    the change reaches validators on the next deploy after merge.

Links

We have almost no observability when messages or blocks are rejected.
Most rejection
paths are silent (no log, no metric), so rejection rates can't be
measured reliably and
silent rejections are invisible in production. Closes linera-io#6459.

Add logging at every currently-silent rejection decision site, plus
rate-correct
Prometheus counters, reusing the `error_type()` labelling convention
established in linera-io#5951.

- **Message rejections:** logs at the policy decision
(`IncomingBundle::apply_policy`,
INFO), the closed-chain auto-reject (`prepare_chain_info_response`,
INFO), and the
validator-side execution of a rejection (`BlockExecutionTracker`, DEBUG
— see Log levels).
The execution-failure path already logged. No new message counter — the
rejection rate
stays on the existing commit-time pair `rejected_bundle_count` /
`incoming_bundle_count`;
the reason (which can't be recovered from a committed block) lives in
the logs.
- **Validator proposal rejections:** `block_proposals_received_total`
(unconditional) +
`block_proposals_rejected_total{error_type}` at the worker's
`handle_block_proposal`,
plus a DEBUG log (see Log levels). `error_type` reuses
`WorkerError::error_type()`,
  surfacing `WorkerError::InvalidOwner`, `ChainError::WrongRound`, etc.
- **Client staging failures:**
`block_staging_failures_total{error_type}` + INFO log in
`ChainClient::execute_block`, reusing `execute_block_latency`'s count as
the denominator.
Adds `error_type()` to the client `Error` and `LocalNodeError`,
delegating into
`WorkerError` / `ChainError` exactly like `WorkerError::error_type` from

Each rate lives within one observation point + process role, so
numerator and denominator
emit from the same site and metric names can't be summed across points.

The three low-volume decision sites are INFO; the two high-volume sites
(proposal rejected,
message rejected) are DEBUG. Rationale from prod (`testnet-conway`,

`rate(linera_server_request_error{method_name="handle_block_proposal"}[1h])`):

- Proposal rejections run at **~10/sec cluster-wide** (~1.7/sec per
shard), but **~92% is
routine consensus churn** — `UnexpectedBlockHeight` (4.0/s),
`InvalidTimestamp` (3.7/s),
`MissingCrossChainUpdate` (1.6/s). The genuinely adversarial rejections
(`InvalidOwner`,
`InvalidSigner`, `WrongRound`, …) sum to **<0.05/sec**. This is why
  kept these at DEBUG.
- The message-rejection site fires **per message × per executing node**
and re-runs during
confirmed-block execution and catch-up sync, so a syncing node would
replay the entire
  rejection history into Loki as a burst.

At INFO these two would emit ~10/sec of almost-entirely-liveness noise,
contradicting the
issue's own goal of making rejection data usable. DEBUG keeps them
opt-in. The per-`reason`
rates are fully preserved in the metrics
(`block_proposals_rejected_total{error_type}`,
etc.) — the logs are just the detail layer.

- `cargo clippy --all-targets --all-features -- -D warnings` — clean.
- `cargo clippy --no-default-features -- -D warnings` (linera-core,
linera-chain) — clean.
- `cargo doc --all-features` — clean.
- New unit tests covering `error_type()` delegation and prefix fallback
(`LocalNodeError -> WorkerError`, `Error -> LocalNodeError`, the
`ChainClientError::` /
  `LocalNodeError::` fallbacks) — pass.
- **NOT yet verified end-to-end on a running network.** Metric
registration/increments and
log emission are validated by inspection and compilation only; they
follow the existing
metric/log patterns in these files. A local-net run confirming the
counters appear on
  `/metrics` is still pending.

- These changes follow the usual release cycle (observability only; no
protocol or storage
format change). Recommend backporting to the latest `testnet` branch,
since that is where
the rejection-visibility gap is felt operationally — though this is not
a bug fix.

- Closes linera-io#6459
- Builds on linera-io#5951 (`error_type` labelling for server/proxy error
metrics)
- [reviewer
checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist)
@ndr-ds ndr-ds marked this pull request as ready for review June 26, 2026 23:13
Comment thread linera-chain/src/block_tracker.rs Outdated
debug!(
chain_id = %self.chain_id,
origin = %incoming_bundle.origin,
"Rejecting incoming message"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now printed whenever we execute a block containing a rejected message.

The other log, in data_types, has similar wording but is where we make the decision to reject a message.

There is at least one more place where we make that decision, I think, which should probably also be logged: When executing a message fails during staging.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you removed this one, but I think there are still other places where we decide to choose Action::Reject, and we should either log at all or none of those places?

@ndr-ds ndr-ds requested review from afck, deuszx and ma2bd and removed request for afck July 1, 2026 23:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants