Skip to content

Track service-oracle execution time only when the service runs, not on replay#6517

Draft
ndr-ds wants to merge 1 commit into
mainfrom
ndr-ds/fix-service-oracle-replay-determinism
Draft

Track service-oracle execution time only when the service runs, not on replay#6517
ndr-ds wants to merge 1 commit into
mainfrom
ndr-ds/fix-service-oracle-replay-determinism

Conversation

@ndr-ds

@ndr-ds ndr-ds commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Draft — flagged for execution-team review. This changes consensus-relevant
execution (service-oracle execution-time accounting). Please sanity-check the reasoning
before this is considered for merge/backport.

Motivation

When a validator re-executes a certified block it replays the recorded oracle responses, so
the computed outcome must be a deterministic function of the block and those responses — the
chain worker compares the re-computed (messages, state_hash) against the certified one and
raises CorruptedChainState on any mismatch.

run_service_oracle_query (linera-execution/src/runtime.rs) breaks this for blocks that
call a service as an oracle (query_service, used by pm-oracle):

  • The service response is correctly memoized and replayed — execution_state_actor.rs
    runs the service inside TransactionTracker::oracle(), whose closure is skipped on replay.
  • But the runtime times the whole request with Instant::now()…elapsed() and feeds it to
    ResourceController::track_service_oracle_execution, which accumulates it and hard-aborts
    with MaximumServiceOracleExecutionTimeExceeded once the running total crosses
    policy.maximum_service_oracle_execution_ms.
  • That timing runs on every execution, including replay. Two honest validators
    re-executing the same block measure different wall-clock times (host load, scheduling), so
    one can cross the limit and abort while the other completes → different (messages, state_hash)CorruptedChainState.

Wall-clock leaking into the block outcome is a determinism violation, and a candidate root
cause for the recent per-validator-disjoint "Corrupted chain state" surge on testnet_conway
(the affected chains are PM chains, and pm-oracle calls query_service).

Proposal

Measure the service-oracle execution time where the service actually runs — inside the
actor's oracle() closure, which is skipped during replay — and report it back to the runtime
to track. On replay the closure does not run, so the reported time is Duration::ZERO and
nothing accumulates. The per-call deadline (used only by the proposer during validation) is
unchanged. Wall-clock time is now only ever consumed by the proposer at validation time, never
compared across replaying validators.

Test Plan

  • New regression test test_query_service_does_not_track_execution_time_on_replay
    (contract_runtime_apis.rs): replays a query_service call and asserts
    controller.tracker.service_oracle_execution == Duration::ZERO. It fails on the old code
    (which tracks the round-trip elapsed(), always > 0) and passes with the fix.
  • cargo clippy -p linera-execution clean.

Release Plan

  • These changes should be backported to the latest testnet branch, then

    • be released in a validator hotfix.

    (Consensus determinism — affects certified-block re-execution.)

Links

Open questions for reviewers

  • Is there any case where a replaying validator should still enforce a service-oracle time
    limit? (I believe not — the response is certified and no real service work runs.)
  • Alternative considered: record the measured execution time inside the OracleResponse so it
    is certified and replayed deterministically, rather than discarded on replay. This PR takes
    the smaller change (discard on replay); happy to switch if you prefer the recorded-time
    approach.
  • reviewer checklist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant