Improve failure observability for the incident-resolution agent by elina-chertova · Pull Request #120 · subsquid/sqd-portal

elina-chertova · 2026-06-11T07:26:25Z

Logging-only changes so portal failures can be diagnosed from pod logs alone — primarily for the automated incident-resolution agent that reads them (one behavioral exception noted below).

worker-query failures: one warn per failure with peer_id and failure kind (timeout / transport_error / invalid_response / read_error); previously metrics-only
query_worker span: records peer_id, dataset, chunk, block range and request_id, so every per-query event is attributable
Stream got interrupted: now carries dataset, request_id, stream_index
Stream finished: now carries request_id, joinable with the access-log line of the same request
panic hook: panics are logged through tracing (message + location) before the default handler runs; task panics used to die as bare stderr text outside the JSON pipeline
head-lookup task: panic no longer fails the whole stream response via bare unwrap — it is logged and the best-effort head header is omitted (the one behavioral change)
blockchain-state fetch: consecutive_failures counter, escalates warn → error after 5 in a row

Known risk: warn volume scales with the worker-failure rate during a large incident; if it proves noisy, rate-limit at the call site rather than dropping the events.

Verified with cargo check.

The agent has to diagnose portal alerts from pod logs alone. Several failure classes left no usable trace there: - worker-query failures were metrics-only: emit one warn per failure with peer_id and failure kind (timeout / transport_error / invalid_response / read_error) - the query_worker span records peer_id, dataset, chunk, block range and request_id, so every per-query event is attributable - "Stream got interrupted" carries dataset, request_id and stream_index - "Stream finished" carries request_id so it can be joined with the access-log line of the same request - panics are logged through tracing (message + location) before the default handler runs; task panics used to die as bare stderr text - the head-lookup task panic no longer fails the whole stream response via bare unwrap: it is logged and the head header is omitted - blockchain-state fetch failures carry a consecutive_failures counter and escalate warn -> error after 5 in a row

kalabukdima · 2026-07-01T09:30:12Z

request-id is already included in "Stream got interrupted" and "Stream finished" messages. dataset is missing but can be found by relating by request-id.

Worker failures may be too frequent even on the normal path. Please run a real portal in the mainnet, stream some data from "arbitrum-one" dataset and check how many logs it's emitting

This reverts commit 575fb82.

elina-chertova assigned mo4islona Jun 11, 2026

elina-chertova marked this pull request as ready for review June 11, 2026 08:08

elina-chertova added 2 commits July 2, 2026 11:02

Aggregate worker failure logging

575fb82

Revert "Aggregate worker failure logging"

816728c

This reverts commit 575fb82.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve failure observability for the incident-resolution agent#120

Improve failure observability for the incident-resolution agent#120
elina-chertova wants to merge 3 commits into
masterfrom
logging/agent-observability

elina-chertova commented Jun 11, 2026

Uh oh!

kalabukdima commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

elina-chertova commented Jun 11, 2026

Uh oh!

kalabukdima commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants