06-01-22 Solana Mainnet Beta Outage Report
At approximately 16:30 UTC on Wednesday, June 1, Solana’s Mainnet Beta cluster ceased producing blocks as result of stalled consensus, caused by a bug in the durable nonce transactions feature. Over the next four and a half hours, validator operators worked to identify the point of furthest progress and collectively instituted a restart of the network, which resulted in durable nonce transactions being temporarily disabled. Block production resumed at 21:00 UTC the same day, and network operators continued to restore client services over the next several hours.
Note: This issue was unrelated to improvements in 1.10 and 1.11. You can read more about these forthcoming improvements here.
What caused the outage?
A runtime bug triggered by the durable nonce transactions feature allowed, under a specific set of circumstances, for a failed durable nonce transaction to be processed twice. This led to nondeterminism, when a validator processed the transaction a second time and some nodes rejected the subsequent block, while others accepted it. Critically, more than 33% of validators accepted the block, but that number fell short of the 66% required to reconcile the nondeterminism.
How are nonce transactions supposed to work, and how are they different from normal transactions?
Solana utilizes parallel processing of non-overlapping transactions to greatly improve throughput. Networks that process transactions serially can use an incrementing nonce; Solana uses a different method to ensure transactions are not processed twice. For normal transactions, which make up over 99.99% of transactions on the Solana blockchain, the network utilizes a recent block hash and maintains a record of processed transactions within that window to ensure that duplicates are not processed.
Because durable nonce transactions are designed to not expire, they require a different mechanism to prevent double processing, and are processed serially. Such transactions use an on-chain value specific to each account that is rotated every time a durable nonce transaction is processed. After the value is rotated, the same durable nonce transaction should not be able to be processed again.
The processing of a durable nonce transaction in a specific set of circumstances revealed a bug in the runtime that prevented the network from advancing. This bug required the durable nonce transaction to have failed, and would not have been triggered by a successful transaction.
- The durable nonce transaction was processed while its blockhash was still recent enough for the transaction to be processed as a normal transaction
- Seeing a recent blockhash, the runtime assumed it was processing a normal transaction, not a durable nonce transaction
- This transaction failed and since it was not processed as a durable transaction, processing did not advance the on-chain nonce value as intended
- Because the failed transaction was successfully added to a block, its transaction fees were paid
- After the durable transaction was processed once and failed, it was still able to be processed again as a durable transaction because the nonce value it referenced had not been advanced and was still usable.
After the failed transaction was processed, but before the nonce was used again, the user resubmitted the same transaction for processing. This resubmission activated the bug in the runtime.
- The failed durable nonce transaction was re-submitted to the cluster
- The block producer incorrectly accepted this transaction into the block it was building, because the nonce value on-chain had not been advanced
- When validators validated the block, a portion found the new block included a transaction that had previously been processed, caused by the inclusion of this durable nonce transaction.
- One set of validators rejected the block, while another set accepted the block, because the previous instance of the transaction was no longer in their recently processed cache.
- Critically, more than 33% of validators accepted the block, but that number fell short of the 66% required to reconcile the nondeterminism.
What is being done?
The durable nonce transaction feature was disabled in releases v1.9.28/v1.10.23 to prevent the network from halting if the same situation were to arise again. Durable nonce transactions will not process until the mitigation has been applied, and the feature re-activated in a forthcoming release.