Improve failure observability for the incident-resolution agent#120
Open
elina-chertova wants to merge 3 commits into
Open
Improve failure observability for the incident-resolution agent#120elina-chertova wants to merge 3 commits into
elina-chertova wants to merge 3 commits into
Conversation
The agent has to diagnose portal alerts from pod logs alone. Several failure classes left no usable trace there: - worker-query failures were metrics-only: emit one warn per failure with peer_id and failure kind (timeout / transport_error / invalid_response / read_error) - the query_worker span records peer_id, dataset, chunk, block range and request_id, so every per-query event is attributable - "Stream got interrupted" carries dataset, request_id and stream_index - "Stream finished" carries request_id so it can be joined with the access-log line of the same request - panics are logged through tracing (message + location) before the default handler runs; task panics used to die as bare stderr text - the head-lookup task panic no longer fails the whole stream response via bare unwrap: it is logged and the head header is omitted - blockchain-state fetch failures carry a consecutive_failures counter and escalate warn -> error after 5 in a row
Contributor
|
Worker failures may be too frequent even on the normal path. Please run a real portal in the mainnet, stream some data from "arbitrum-one" dataset and check how many logs it's emitting |
This reverts commit 575fb82.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Logging-only changes so portal failures can be diagnosed from pod logs alone — primarily for the automated incident-resolution agent that reads them (one behavioral exception noted below).
warnper failure withpeer_idand failure kind (timeout/transport_error/invalid_response/read_error); previously metrics-onlyquery_workerspan: recordspeer_id,dataset,chunk, block range andrequest_id, so every per-query event is attributableStream got interrupted: now carriesdataset,request_id,stream_indexStream finished: now carriesrequest_id, joinable with the access-log line of the same requesttracing(message + location) before the default handler runs; task panics used to die as bare stderr text outside the JSON pipelineunwrap— it is logged and the best-effort head header is omitted (the one behavioral change)consecutive_failurescounter, escalateswarn→errorafter 5 in a rowKnown risk: warn volume scales with the worker-failure rate during a large incident; if it proves noisy, rate-limit at the call site rather than dropping the events.
Verified with
cargo check.