04-30-22 Solana Mainnet Beta Outage Report and Mitigation
At approximately 20:30 UTC on Saturday, April 30, Solana’s Mainnet Beta cluster ceased producing blocks as result of stalled consensus. Over the next seven hours, validator operators worked to identify the point of furthest progress, and collectively instituted a restart of the network. Block production resumed at 03:30 UTC on Sunday, May 1, and network operators continued to restore client services over the next several hours.
What caused the outage?
An enormous amount of inbound transactions (6 million per second) flooded the network, surpassing 100 Gbps of traffic at individual nodes. There is no evidence of a denial of service attack, but instead evidence indicates bots tried to programatically win a new NFT being minted using the popular Candy Machine program. Since the mint price had a fixed floor and not a dynamic Dutch auction, the first user to call the mint received the NFT, which created an economic incentive to send a huge number of transactions in hopes of winning the mint.
The specific reason why consensus stalled was due to validators running out of memory and crashing. The root cause of the high memory usage was insufficient votes landing to finalize earlier blocks, preventing abandoned fork cleanup. The number of forks validators had to evaluate exceeded their capacity to do so, even after a reboot, necessitating manual intervention.
What is being done?
Since early January, Solana has suffered intermittent congestion issues resulting from bot activity targeted at NFT mints. The previous outage of Mainnet Beta occurred in September, 2021, and lasted for 17 hours. The outage of April 30 shares characteristics with the September outage, but the network this time continued to function even as transaction request volumes reached 10,000% the level from September, reflecting subsequent updates made by the validator community.
The beta release branch, v1.10, which is currently stabilizing on Testnet, includes memory use improvements to prolong the time nodes can endure slow or stalled consensus. Test nodes running v1.10 deployed on Mainnet Beta continued for 2000 additional slots beyond their similarly spec’d v1.9 peers.
Three mitigations are in the works to address the stability and resilience of the network.
- QUIC - Today, Solana uses a custom raw UDP-based protocol to pass transactions between RPC nodes and the current leader. Since UDP is connectionless and lacks both flow control and receipt acknowledgments, there is no meaningful way to discourage or mitigate abusive behavior. In order to affect control over network traffic, Solana core protocols are being reimplemented atop QUIC, a protocol built by Google, designed for fast asynchronous communication like UDP, but with sessions and flow control like TCP. Once adopted, there will be many more options available to adapt and optimize data ingestion.
- Stake-weighted quality of service (QoS) - Leader network bandwidth has a fixed capacity and to effectively use it, stake-weighting is a must in order to end the current practice of indiscriminately accepting transactions on a first-come-first-served basis, without regard for source. Given that Solana is a PoS network, extending the utility of stake-weighting to transaction quality of service is a natural choice. Under this model a node with 0.5% stake will have the right to transmit at least 0.5% of the packets to the leader, and the rest of the network and no combination of the remaining stake will be able to fully wash them out. Stake-weighted QoS is in parallel development with QUIC today. Stake-weighted QoS will be more robust in conjunction with QUIC.
- Fee-based execution priority - Once ingested, transactions can still contend for modifying shared account data. This contention has been dealt with by a simple first-come-first-served similarly to network data ingestions, leaving users no means to express the urgency of their transactions’ execution. Given that anyone can submit transactions to the network, stake-weighting is not suitable for this prioritization. Instead a new instruction is being introduced into the Compute Budget program, offering users the ability to specify an arbitrary “additional fee” to be collected upon execution of the transaction and its inclusion in a block. The ratio of this fee to the requested compute units will serve as a transaction’s execution priority weight. Additional fees will be treated identically to the base fee today.
Fees are coming to Solana
Fees on Solana are not the same as global fee markets on Ethereum. On Solana there is less competition for blockspace than the ability to write to a specific piece of state, such as a minting contract for an NFT. Solana transactions do not interact with all of the Solana state, instead they specify which state they need to read and which state they need to write. Reads can overlap and be parallelized, writes cannot. This is generally called the database “hotspot” problem, which state auctions are designed to solve. For example, if there are 10 seconds worth of work that needs to be written in the same state, but the amount of time is 1 second, there needs to be some mechanism that prioritizes which work is done, and which work fails. Since transactions are packed into blocks by “highest fee/compute unit” priority, the higher paying transactions will get to write first if validators are acting in their self-interest.
By comparison, Ethereum uses fee markets to allocate a scarce resource, namely blockspace. Solana’s fee prioritization should only impact the specific state, and not the whole block. This creates a system akin to ‘neighborhood fees’ instead of ‘global fees.’ The subsequent transactions that are paying a higher fee, but can’t fit into this block because they have hit the maximum limits of writing to an account are spilled and scheduled for the next block, but other transactions that interact with other accounts can still be added to the same block, even if they are paying lower fees.
Fee prioritization work in process, and is targeted for the v1.11 release.
Editor's note June 16, 2022: This article has been updated with the term stake-weighted quality of service.