Cloudbreak — Solana’s Horizontally Scaled State Architecture
Solana is the most performant permissionless blockchain in the world. On current iterations of the Solana Testnet, a network of 200 physically distinct nodes supports a sustained throughput of more than 50,000 transactions per second when running with GPUs. Achieving as such requires the implementation of several optimizations and new technologies, and the result is a breakthrough in network capacity that signals a new phase in blockchain development.
There are 8 key innovations that make the Solana network possible:
- Proof of History (POH)— a clock before consensus;
- Tower BFT — a PoH-optimized version of PBFT;
- Turbine — a block propagation protocol;
- Gulf Stream — Mempool-less transaction forwarding protocol;
- Sealevel — Parallel smart contracts run-time;
- Pipelining — a Transaction Processing Unit for validation optimization
- Cloudbreak — Horizontally-Scaled Accounts Database; and
- Archivers — Distributed ledger store
In this blog post, we’ll go over Cloudbreak, Solana’s horizontally scaled state architecture.
Overview: RAM, SSDs, and Threads
When scaling a blockchain without sharding, it is not enough to only scale computation. The memory that is used to keep track of accounts quickly becomes a bottleneck in both size and access speeds. For example: It’s generally understood that LevelDB, the local database engine that many modern chains use, cannot support more than about 5,000 TPS on a single machine. That’s because the virtual machine is unable to exploit concurrent read and write access to the account state through the database abstractions.
A naive solution is to maintain the global state in RAM. However, it’s not reasonable to expect consumer-grade machines to have enough RAM to store the global state. The next option is using SSDs. While SSDs reduce the cost per byte by a factor of 30x or more, they are 1000x slower than RAM. Below is the datasheet from the latest Samsung SSD, which is one of the fastest SSDs on the market.
A single-spend transaction needs to read 2 accounts and write to 1. Account keys are cryptographic public keys, and are totally random and have no real data locality. A user’s wallet will have many Account addresses, and the bits of each address are completely unrelated to any other address. Because there is no locality between accounts, it is impossible for us to place them in memory such that they are likely to be close to each other.
With a max of 15,000 unique reads per second, a naive single-threaded implementation of an Accounts database using a single SSD will support up to 7,500 transactions per second. Modern SSDs support 32 concurrent threads, therefore and can, therefore, support 370,000 reads per second, or roughly 185,000 transactions per second.
The guiding design principle at Solana is to design software that gets out of the way of the hardware to allow 100% utilization.
Organizing the database of accounts such that concurrent reads and writes are possible between the 32 threads is a challenge. Vanilla open source databases like LevelDB cause bottlenecking because they don’t optimize for this specific challenge in a blockchain setting. Solana does not use a traditional database to solve these problems. Instead, we use several mechanisms utilized by operating systems.
First, we leverage memory-mapped files. A memory-mapped file is a file whose bytes are mapped into the virtual address space of a process. Once a file has been mapped, it behaves like any other memory. The kernel may keep some or none of the memory cached in the RAM, but the amount of physical memory is limited by the size of the disk and not the RAM. Reads and writes are still obviously bound by the performance of the disk.
The second important design consideration is that sequential operations are much faster than random operations. This is true not just for SSDs, but for the entire virtual memory stack. CPUs are great at prefetching memory that is accessed sequentially, and operating systems are great at handling sequential page faults. To exploit this behavior we break up the accounts data structure roughly as follows:
- The index of accounts and forks is stored in RAM.
- Accounts are stored in memory-mapped files up to 4MB in size.
- Each memory map only stores accounts from a single proposed fork.
- Maps are randomly distributed across as many SSDs as are available.
- Copy-on-write semantics are used.
- Writes are appended to a random memory map for the same fork.
- The index is updated after each write is completed.
Since account updates are copy-on-write and are appended to a random SSD, Solana receives the benefits of sequential writes and horizontal scaling of the writes across many SSDs for concurrent transactions. Reads are still random access, but since any given forks state updates are spread across many SSDs, the reads end up horizontally scaled as well.
Cloudbreak also performs a form of garbage collection. As forks become finalized beyond rollback and accounts are updated, old invalid accounts are garbage collected, and memory is relinquished.
There’s at least one more great benefit of this architecture: computing the Merkle root of the state updates for any given fork can be done with sequential reads that are horizontally scaled across SSDs. The drawback of this approach is the loss of generality to the data. Since this is a custom data structure, with custom layout, we are unable to use general-purpose database abstractions for querying and manipulating the data. We had to build everything from the ground up. Fortunately, that’s done now.
While the Accounts database is in RAM, we see throughput that matches RAM access times, while scaling with the number of available cores. At 10m accounts, the database no longer fits in RAM. However, we still see performance near 1m in reads or writes per second on a single SSD.
Solana’s utilization of Cloudbreak, alongside innovations like Proof of History, Sealevel, and Tower BFT combine to create the world’s first web-scale blockchain. Solana’s testnet is live today. You can see it at https://testnet.solana.com. For cost purposes, we are only running a handful of nodes. However, we have spun it up on many instances to over 200 physically distinct nodes (not on shared hardware) across 23 data centers on AWS, GCE, and Azure for benchmarking.
Solana will soon launch a public beta incentivizing validators to run nodes via Tour de SOL — analogous to Cosmos’ Game of Stakes — that challenges the public at large to test the limits of the Solana network while earning tokens for doing so.