Skip to content

Improve failure observability for the incident-resolution agent#120

Open
elina-chertova wants to merge 3 commits into
masterfrom
logging/agent-observability
Open

Improve failure observability for the incident-resolution agent#120
elina-chertova wants to merge 3 commits into
masterfrom
logging/agent-observability

Conversation

@elina-chertova

Copy link
Copy Markdown

Logging-only changes so portal failures can be diagnosed from pod logs alone — primarily for the automated incident-resolution agent that reads them (one behavioral exception noted below).

  • worker-query failures: one warn per failure with peer_id and failure kind (timeout / transport_error / invalid_response / read_error); previously metrics-only
  • query_worker span: records peer_id, dataset, chunk, block range and request_id, so every per-query event is attributable
  • Stream got interrupted: now carries dataset, request_id, stream_index
  • Stream finished: now carries request_id, joinable with the access-log line of the same request
  • panic hook: panics are logged through tracing (message + location) before the default handler runs; task panics used to die as bare stderr text outside the JSON pipeline
  • head-lookup task: panic no longer fails the whole stream response via bare unwrap — it is logged and the best-effort head header is omitted (the one behavioral change)
  • blockchain-state fetch: consecutive_failures counter, escalates warnerror after 5 in a row

Known risk: warn volume scales with the worker-failure rate during a large incident; if it proves noisy, rate-limit at the call site rather than dropping the events.

Verified with cargo check.

The agent has to diagnose portal alerts from pod logs alone. Several
failure classes left no usable trace there:

- worker-query failures were metrics-only: emit one warn per failure
  with peer_id and failure kind (timeout / transport_error /
  invalid_response / read_error)
- the query_worker span records peer_id, dataset, chunk, block range
  and request_id, so every per-query event is attributable
- "Stream got interrupted" carries dataset, request_id and stream_index
- "Stream finished" carries request_id so it can be joined with the
  access-log line of the same request
- panics are logged through tracing (message + location) before the
  default handler runs; task panics used to die as bare stderr text
- the head-lookup task panic no longer fails the whole stream response
  via bare unwrap: it is logged and the head header is omitted
- blockchain-state fetch failures carry a consecutive_failures counter
  and escalate warn -> error after 5 in a row
@elina-chertova elina-chertova marked this pull request as ready for review June 11, 2026 08:08
@kalabukdima

Copy link
Copy Markdown
Contributor

request-id is already included in "Stream got interrupted" and "Stream finished" messages. dataset is missing but can be found by relating by request-id.

Worker failures may be too frequent even on the normal path. Please run a real portal in the mainnet, stream some data from "arbitrum-one" dataset and check how many logs it's emitting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants