Track service-oracle execution time only when the service runs, not on replay#6517
Draft
ndr-ds wants to merge 1 commit into
Draft
Track service-oracle execution time only when the service runs, not on replay#6517ndr-ds wants to merge 1 commit into
ndr-ds wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
When a validator re-executes a certified block it replays the recorded oracle responses, so
the computed outcome must be a deterministic function of the block and those responses — the
chain worker compares the re-computed
(messages, state_hash)against the certified one andraises
CorruptedChainStateon any mismatch.run_service_oracle_query(linera-execution/src/runtime.rs) breaks this for blocks thatcall a service as an oracle (
query_service, used by pm-oracle):execution_state_actor.rsruns the service inside
TransactionTracker::oracle(), whose closure is skipped on replay.Instant::now()…elapsed()and feeds it toResourceController::track_service_oracle_execution, which accumulates it and hard-abortswith
MaximumServiceOracleExecutionTimeExceededonce the running total crossespolicy.maximum_service_oracle_execution_ms.re-executing the same block measure different wall-clock times (host load, scheduling), so
one can cross the limit and abort while the other completes → different
(messages, state_hash)→CorruptedChainState.Wall-clock leaking into the block outcome is a determinism violation, and a candidate root
cause for the recent per-validator-disjoint "Corrupted chain state" surge on testnet_conway
(the affected chains are PM chains, and pm-oracle calls
query_service).Proposal
Measure the service-oracle execution time where the service actually runs — inside the
actor's
oracle()closure, which is skipped during replay — and report it back to the runtimeto track. On replay the closure does not run, so the reported time is
Duration::ZEROandnothing accumulates. The per-call
deadline(used only by the proposer during validation) isunchanged. Wall-clock time is now only ever consumed by the proposer at validation time, never
compared across replaying validators.
Test Plan
test_query_service_does_not_track_execution_time_on_replay(
contract_runtime_apis.rs): replays aquery_servicecall and assertscontroller.tracker.service_oracle_execution == Duration::ZERO. It fails on the old code(which tracks the round-trip
elapsed(), always > 0) and passes with the fix.cargo clippy -p linera-executionclean.Release Plan
These changes should be backported to the latest
testnetbranch, then(Consensus determinism — affects certified-block re-execution.)
Links
JournalingError::must_reload_view), different mechanism.Open questions for reviewers
limit? (I believe not — the response is certified and no real service work runs.)
OracleResponseso itis certified and replayed deterministically, rather than discarded on replay. This PR takes
the smaller change (discard on replay); happy to switch if you prefer the recorded-time
approach.