9-14 Network Outage Initial Overview
A detailed technical post-mortem and root cause analysis report is in process by the community, and will be released in the coming weeks.
On September 14th, the Solana network was offline for 17 hours. No funds were lost, and the network returned to full functionality in under 24 hours. Solana is designed for adversarial conditions. Over the years the community has developed robust tools and processes for trustless recovery, and has had several practice runs.
The cause of the network stall was, in effect, a denial of service attack. At 12:00 UTC, Grape Protocol launched their IDO on Raydium, and bots generated transactions which flooded the network. These transactions created a memory overflow, which caused many validators to crash forcing the network to slow down and eventually stall. The network went offline when the validator network could not come to agreement on the current state of the blockchain, which prevented the network from confirming new blocks.
At 12:11 UTC, the validator community noticed the transaction spike and network slowdown. The community took steps to help the network recover, but were unsuccessful. These transactions flooded a system known as the forwarder queue, causing the memory used by this queue to grow without limits. The transactions that were encoded into blocks were resource-heavy to process. The combination of the unbounded growth of the forwarder queues and resource-heavy blocks caused block producers to automatically propose a number of forks. The validator processes started to run out of memory and crash, and upon restart the validators were unable to process all the proposed forks in time to catch back up with the rest of the network.
Once the situation was diagnosed, the community proposed a hard fork of the network from the last confirmed slot, which requires at least 80% of active stake to come to consensus. Over the next 14 hours, engineers from across the globe worked together to write code to mitigate the issue, and coordinate an upgrade-and-restart of the network among 1000+ validators. This effort was led by the community, using the restart framework outlined in the protocol documentation. Validators opted to apply the upgrade locally, verify the ledger, and continue producing blocks. Consensus was reached, and the network was upgraded within two and a half hours of the patch being released. The validator network restored full functionality at 05:30 UTC, under 18 hours after the network stalled.
One of the biggest benefits of blockchains is that, even in complete liveness failure for any reason, the validators are individually responsible for recovering the state and continuing the chain without relying on a trusted third party. On a decentralized network, each validator works to bring it back and has their work guaranteed and verified by everyone else. This was a coordinated effort by the community, not only in creating a patch, but in getting 80% of the network to come to consensus. There’s a big difference between an outage like this happening on a centralized network (like Amazon Web Services) and a decentralized network like Solana. If AWS crashes, users have to trust Amazon to bring it back to the right state. The credit and obligation for restoring network operations on any blockchain is in the hands of the community. Operators across the world worked together to reach a solution and restored functionality.
Thank you to the validator community, engineers, and the whole Solana ecosystem for coming together to fix this problem. On the rare occurrence that issues like this happen, it’s disruptive to everyone — and when you need to fix something on a decentralized network, it’s a true community project.
A detailed technical post mortem and root cause analysis report is in process by the community, and will be released in the coming weeks. Stay tuned.