Skip to content

Classify write-batch failures as must_reload_view in RocksDB and storage-service backends#6515

Draft
ndr-ds wants to merge 1 commit into
mainfrom
ndr-ds/require-explicit-must-reload-view
Draft

Classify write-batch failures as must_reload_view in RocksDB and storage-service backends#6515
ndr-ds wants to merge 1 commit into
mainfrom
ndr-ds/require-explicit-must-reload-view

Conversation

@ndr-ds

@ndr-ds ndr-ds commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Motivation

KeyValueStoreError::must_reload_view() tells the chain worker whether a storage
error may have left the in-memory view inconsistent with durable storage (a write
that may-or-may-not have landed). When it does, the worker is evicted and reloaded
from storage. Misclassifying a write-uncertainty error as benign causes a silent
in-memory↔storage desync — the root cause of the recent "Corrupted chain state"
incident, fixed for the journaling layer in #6507 / #6508.

Auditing the rest of the storage stack, two backends never override the trait's
default must_reload_view() == false, so an ambiguous write is silently treated as
safe:

  • RocksDB (rocks_db.rs): a write_batch failure from db.write() surfaces as
    the generic RocksDb variant → false. RocksDB is the storage backend used by the
    workers, which is exactly where the recent payout divergence appeared, so this
    is a production silent-desync vector.
  • storage-service (common.rs): a write_batch is sent over gRPC; a
    DEADLINE_EXCEEDED/UNAVAILABLE/transport failure after the server has committed
    the batch but before the client receives the ack → GrpcError/TransportError
    false.

Proposal

  • RocksDB: add a dedicated WriteBatchError(rocksdb::Error) variant (mirroring
    ScyllaDB's WriteBatchExecutionError), map db.write() failures to it, and return
    must_reload_view() == true only for it. Read failures keep using the shared
    RocksDb variant and are unaffected (no reload churn on reads).
  • storage-service: override must_reload_view() to return true for GrpcError
    and TransportError. These variants can also surface on read RPCs (there is no
    dedicated read/write error split), so this conservatively errs toward reloading;
    documented inline.

Test Plan

  • cargo clippy -p linera-views --features rocksdb and
    cargo clippy -p linera-storage-service --all-targets are clean.
  • No unit test for the RocksDB variant: rocksdb::Error has no public constructor, so
    it cannot be instantiated in a test. The classification is a trivial matches!
    mirroring the proven ScyllaDB pattern; the abstract delegation is covered by Fix JournalingError::must_reload_view to delegate to the inner store error #6508's
    journaling test, and worker-level behavior will be covered by a follow-up
    fault-injection test.

Release Plan

  • These changes should be backported to the latest testnet branch, then

    • be released in a validator/worker hotfix.

    (RocksDB is the worker storage backend, so the workers benefit directly.)

Links

@github-actions

Copy link
Copy Markdown

Instruction Count Benchmark Results

Baseline: 18f37fef50

Deterministic metrics — reproducible across runs (34 benchmarks)
Benchmark Instructions Total R+W
BucketQueueView
delete_500_from_1000 19,046 (${\color{green}\textbf{-1.33\%%}}$) 28,369 (${\color{green}\textbf{-1.23\%%}}$)
front_100_from_1000 5,562 (${\color{green}\textbf{-4.09\%%}}$) 8,149 (${\color{green}\textbf{-3.89\%%}}$)
pre_save_1000 37,633 (-0.44%) 52,950 (-0.42%)
push_1000 22,785 (${\color{green}\textbf{-1.03\%%}}$) 31,511 (${\color{green}\textbf{-1.04\%%}}$)
Cold Load
load_1000 690,687 (+0.00%) 1,006,158 (+0.00%)
CollectionView
indices_100 187,863 (-0.13%) 260,638 (-0.13%)
load_all_100_from_storage 613,708 (No change) 864,896 (No change)
load_all_100_in_memory 336,779 (-0.04%) 472,140 (-0.04%)
pre_save_100 257,827 (-0.09%) 356,749 (-0.09%)
try_load_10_from_100 100,905 (-0.05%) 142,657 (-0.06%)
MapView
contains_key_10_from_100 51,675 (-0.49%) 72,952 (-0.48%)
contains_key_10_from_1000 351,616 (No change) 496,805 (No change)
get_10_from_100 54,241 (-0.44%) 76,799 (-0.43%)
get_10_from_1000 354,211 (No change) 500,696 (No change)
get_100_missing_from_1000 597,185 (No change) 829,538 (No change)
indices_100 99,211 (-0.26%) 136,584 (-0.26%)
indices_1000 940,579 (No change) 1,311,328 (No change)
insert_100 253,057 (No change) 345,961 (No change)
insert_1000 2,906,810 (No change) 3,899,149 (No change)
post_save_1000 1,020,722 (No change) 1,472,717 (No change)
pre_save_100 318,364 (No change) 443,912 (No change)
pre_save_1000 3,240,326 (-0.00%) 4,574,338 (-0.00%)
remove_500_from_1000 1,183,893 (-0.00%) 1,635,788 (-0.00%)
QueueView
delete_500_from_1000 11,157 (+0.17%) 13,231 (+0.18%)
front_100_from_1000 7,581 (${\color{green}\textbf{-3.27\%%}}$) 11,687 (${\color{green}\textbf{-2.94\%%}}$)
pre_save_1000 1,105,525 (No change) 1,580,027 (No change)
push_1000 22,971 (+0.08%) 31,773 (+0.08%)
ReentrantCollectionView
contains_key_10_from_100 141,912 (-0.17%) 201,636 (-0.16%)
indices_100 232,852 (-0.11%) 325,773 (-0.11%)
load_all_100_from_storage 772,525 (+0.00%) 1,088,763 (+0.00%)
load_all_100_in_memory 412,689 (-0.00%) 568,689 (-0.00%)
pre_save_100 341,624 (-0.04%) 475,950 (-0.04%)
RegisterView
get_set_100 83,830 (No change) 118,850 (No change)
pre_save 5,160 (${\color{green}\textbf{-4.39\%%}}$) 7,637 (${\color{green}\textbf{-4.14\%%}}$)

Regression threshold: 1%${\color{red}\textbf{red}}$ = regression, ${\color{green}\textbf{green}}$ = improvement.

Cache-dependent metrics — expect fluctuations between runs (34 benchmarks)
Benchmark L1 Hits LLC Hits RAM Hits Est. Cycles
BucketQueueView
delete_500_from_1000 28,179 (${\color{green}\textbf{-1.21\%%}}$) 35 (${\color{green}\textbf{-5.41\%%}}$) 155 (${\color{green}\textbf{-3.73\%%}}$) 33,779 (${\color{green}\textbf{-1.65\%%}}$)
front_100_from_1000 7,986 (${\color{green}\textbf{-3.96\%%}}$) 35 (${\color{red}\textbf{+12.90\%%}}$) 128 (${\color{green}\textbf{-3.76\%%}}$) 12,641 (${\color{green}\textbf{-3.69\%%}}$)
pre_save_1000 52,588 (-0.43%) 67 (${\color{red}\textbf{+3.08\%%}}$) 295 (-0.34%) 63,248 (-0.39%)
push_1000 31,316 (${\color{green}\textbf{-1.02\%%}}$) 45 (${\color{green}\textbf{-2.17\%%}}$) 150 (${\color{green}\textbf{-3.23\%%}}$) 36,791 (${\color{green}\textbf{-1.35\%%}}$)
Cold Load
load_1000 997,623 (+0.00%) 8,364 (+0.07%) 171 (+0.59%) 1,045,428 (+0.01%)
CollectionView
indices_100 259,384 (-0.13%) 873 (+0.23%) 381 (${\color{green}\textbf{-1.30\%%}}$) 277,084 (-0.18%)
load_all_100_from_storage 860,457 (No change) 3,782 (No change) 657 (No change) 902,362 (No change)
load_all_100_in_memory 470,010 (-0.05%) 1,392 (+0.87%) 738 (No change) 502,800 (-0.03%)
pre_save_100 354,826 (-0.09%) 1,340 (+0.07%) 583 (-0.85%) 381,931 (-0.13%)
try_load_10_from_100 141,797 (-0.06%) 632 (-0.32%) 228 (No change) 152,937 (-0.06%)
MapView
contains_key_10_from_100 72,673 (-0.46%) 85 (${\color{green}\textbf{-9.57\%%}}$) 194 (${\color{green}\textbf{-3.00\%%}}$) 79,888 (-0.74%)
contains_key_10_from_1000 493,628 (-0.00%) 2,978 (+0.20%) 199 (No change) 515,483 (+0.00%)
get_10_from_100 76,516 (-0.42%) 81 (${\color{green}\textbf{-2.41\%%}}$) 202 (${\color{green}\textbf{-2.42\%%}}$) 83,991 (-0.60%)
get_10_from_1000 497,506 (-0.00%) 2,982 (+0.03%) 208 (No change) 519,696 (+0.00%)
get_100_missing_from_1000 826,329 (-0.00%) 2,988 (+0.03%) 221 (No change) 849,004 (+0.00%)
indices_100 135,947 (-0.25%) 245 (${\color{green}\textbf{-2.00\%%}}$) 392 (${\color{green}\textbf{-1.51\%%}}$) 150,892 (-0.38%)
indices_1000 1,303,669 (+0.00%) 6,477 (-0.02%) 1,182 (No change) 1,377,424 (-0.00%)
insert_100 345,233 (+0.00%) 92 (${\color{green}\textbf{-1.08\%%}}$) 636 (No change) 367,953 (-0.00%)
insert_1000 3,892,087 (No change) 3,096 (No change) 3,966 (No change) 4,046,377 (No change)
post_save_1000 1,461,330 (-0.00%) 11,213 (+0.04%) 174 (No change) 1,523,485 (+0.00%)
pre_save_100 442,542 (-0.00%) 764 (+0.26%) 606 (No change) 467,572 (+0.00%)
pre_save_1000 4,560,439 (-0.00%) 10,091 (No change) 3,808 (-0.03%) 4,744,174 (-0.00%)
remove_500_from_1000 1,631,401 (-0.00%) 4,215 (No change) 172 (-0.58%) 1,658,496 (-0.00%)
QueueView
delete_500_from_1000 13,062 (+0.18%) 37 (${\color{green}\textbf{-2.63\%%}}$) 132 (+0.76%) 17,867 (+0.30%)
front_100_from_1000 11,498 (${\color{green}\textbf{-2.94\%%}}$) 36 (No change) 153 (${\color{green}\textbf{-3.77\%%}}$) 17,033 (${\color{green}\textbf{-3.17\%%}}$)
pre_save_1000 1,575,327 (-0.00%) 2,744 (+0.07%) 1,956 (No change) 1,657,507 (+0.00%)
push_1000 31,567 (+0.05%) 52 (${\color{red}\textbf{+13.04\%%}}$) 154 (+0.65%) 37,217 (+0.22%)
ReentrantCollectionView
contains_key_10_from_100 200,416 (-0.16%) 1,027 (+0.20%) 193 (${\color{green}\textbf{-2.53\%%}}$) 212,306 (-0.23%)
indices_100 324,216 (-0.11%) 1,202 (-0.25%) 355 (${\color{green}\textbf{-1.66\%%}}$) 342,651 (-0.17%)
load_all_100_from_storage 1,081,662 (-0.00%) 6,700 (+0.37%) 401 (+0.25%) 1,129,197 (+0.01%)
load_all_100_in_memory 566,587 (-0.00%) 1,573 (+0.32%) 529 (-0.19%) 592,967 (-0.01%)
pre_save_100 472,873 (-0.05%) 2,396 (+0.55%) 681 (No change) 508,688 (-0.03%)
RegisterView
get_set_100 118,636 (No change) 39 (No change) 175 (No change) 124,956 (No change)
pre_save 7,448 (${\color{green}\textbf{-4.22\%%}}$) 39 (${\color{red}\textbf{+8.33\%%}}$) 150 (${\color{green}\textbf{-3.23\%%}}$) 12,893 (${\color{green}\textbf{-3.65\%%}}$)

Cache metrics fluctuate because anything that changes the virtual memory layout
shifts which data lands on which cache lines, changing the L1/LLC/RAM distribution.
Probable causes: ASLR (even across identical binaries), executable binary size changes,
shared library size changes, and even filename length differences.

Cachegrind simulates a two-level cache (L1 + LLC) auto-detected from the host CPU.
Est. Cycles = L1 hits + 5 × LLC hits + 35 × RAM hits.

Runner cache sizes: L1d cache: 96 KiB (2 instances);L1i cache: 64 KiB (2 instances) L2 cache: 2.5 MiB (2 instances);L3 cache: 48 MiB (1 instance)

Comment on lines +716 to +718
/// RocksDB error while writing a batch. Unlike [`RocksDbStoreInternalError::RocksDb`]
/// (which also covers read failures), this error means a batch write may or may not
/// have been applied, so the in-memory view must be reloaded from storage.

@ma2bd ma2bd Jun 19, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it runs in the same process, I don't think rocksdb has ambiguous writes. Please link the doc if you found something.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this was supposed to be a draft. A lot of this still needs to be verified

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Np. (Probably) just revert this part

// error) the server may or may not have applied the batch, so the in-memory view
// must be reloaded from storage. These variants can also surface on read RPCs,
// where a reload is unnecessary but harmless; we err on the side of reloading.
matches!(self, Self::GrpcError(_) | Self::TransportError(_))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes sense!

@ndr-ds ndr-ds marked this pull request as draft June 19, 2026 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants