Archivers — Solana’s Solution to Petabytes of Blockchain Data Storage

by Solana Foundation

Archivers — Solana’s Solution to Petabytes of Blockchain Data Storage

Learn more about 1 of the 8 innovations that make Solana the most performant blockchain in the world.

Solana is the most performant permissionless blockchain in the world. On current iterations of the Solana Testnet, a network of 200 physically distinct nodes supports a sustained throughput of more than 50,000 transactions per second when running with GPUs. Achieving as such requires the implementation of several optimizations and new technologies, and the result is a breakthrough in network capacity that signals a new phase in blockchain development.

There are 8 key innovations that make the Solana network possible:

In this blog post, we’ll explore Archivers, Solana’s distributed ledger store for petabytes of blockchain data storage. We were introduced to Proof of Replication by Filecoin in 2017. In 2018, we built our version of PoRep for Solana using a VDF, and optimized for batch verification.

At full capacity, the Solana network will generate 1gb/s * 365 days = 4 petabytes of data every year. If every node in the network is required to store all that data, it would limit network membership to a centralized few that maintain that kind of storage capacity. Our Proof of History technology can be leveraged to mitigate this problem by allowing a fast-to-verify implementation of Proof of Replication and enabling a bit torrent-esque distribution of the ledger across millions of Replicator nodes around the world. Archivers are not consensus participants, and have very low hardware requirements.

At a high level, the Solana Replicator network functions as follows: Archivers must signal to the network that they have X bytes of space available for storing data. On some frequency, the network divides the ledger history into pieces to target some replication rate (currently we’re expecting a target rate around 100x) and fault tolerance (achieved with erasure coding) based on the number of Replicator identities and total available storage of Archivers. Once Archiver:data assignments are made, each Archiver downloads her respective data from consensus validators. On some frequency, Archivers will be challenged to prove they’re storing data, at which point they must complete a proof of replication (PoRep). Archivers are awarded ~3% of inflation for their efforts.

Proof of Replication in Further Depth

The basic idea of Proof of Replication is to encrypt a dataset with a public symmetric key using CBC encryption, and then hash the encrypted dataset. This method is explained in detail in Filecoin’s Proof of Replication Technical Report. Unfortunately, the problem with this approach is that it is vulnerable to attack.

For example, a dishonest storage node can stream the encryption and delete as it’s hashed. The simple solution is to force the hash to be done on the reverse of the encryption, or perhaps with a random order. This ensures that all the data is present during the generation of the proof, and it also requires the Validator to have the entirety of the encrypted data present for verification of every proof of every identity. The space required to validate becomes (Number of CBC keys)*(data size).

We improve on this approach by randomly sampling the encrypted blocks at a faster pace than the speed of encryption, and record the hash of those samples into the PoH ledger. Thus the blocks stay in the exact same order for every PoRep and verification can stream the data and verify all the proofs in a single batch. This way, we can verify multiple proofs concurrently, each one on its own CUDA core.

With the current generation of graphics cards, the Solana network can support up to 1500 replication identities or symmetric keys per GPU card. The total space required for verification is (2 CBC blocks) * (Number of CBC keys), with core count of equal to (Number of CBC keys). A CBC block is expected to be 1MB in size.

Next, we have to construct a game between Validators and Archivers that ensures that Archivers are generating proofs, and that validators are actually verifying PoReps.

To begin producing PoReps of the ledger, the Replicator client does the following:

  1. Clients sign a PoH hash at a regular period
  2. Signature is used as the source of randomness to pick a specific slice of the ledger
  3. Signature is used to create a symmetric CBC key and the client encodes the slice of the ledger with the key.

Since each client sligns the same PoH hash, the signatures are randomly distributed between all the clients. Clients then continuously sample the encrypted sample:

  1. Clients sign a PoH hash at a regular period
  2. Signature is used as the source of randomness to sample 1 byte per 1MB of the slice.
  3. Samples are hashed with SHA256

All the clients are forced to use the same PoH hash value as the signature. Since the signature tied to PoH, the resulting hash of samples is unique to that point in time and to that specific replication.

Validators in turn check the clients’ proofs:

  1. Validator declares how many PoReps it can verify, based on number of GPU cores
  2. Periodically validators will sign a PoH hash
  3. The signature is used to select a slice of the ledger to verify, and a mask to select which samples to verify up to the capacity of the validator
  4. Validator uploads the proofs that failed verification

A client can challenge a Validator for a failed proof by fishing for lazy validators. To prevent grinding attacks, clients must use the same keypair identity continuously. To prevent spam, all the messages in the protocol incur tx fees. Archivers earn rewards based on the number of successful submitted proofs. Validators earn a stake weighted reward for verifying proofs, and fishermen earn a reward by taking a validators slashed coins when fishermen publish a proof of a fake proof.

Learn more about Tour de SOL—Solana’s incentivized testnet event.

Solana’s utilization of Archivers, alongside innovations like Proof of History, Sealevel, and Gulf Stream combine to create the world’s first web scale blockchain. Solana’s testnet is live today. You can see it at https://testnet.solana.com. For cost purposes, we are only running a handful of nodes. However, we have spun it up on many instances to over 200 physically distinct nodes (not on shared hardware) across 23 data centers on AWS, GCE, and Azure for benchmarking.

The runtime is functioning today, and developers can deploy code on the testnow now. Developers can build smart contracts in C today, and we are aggressively working on the Rust toolchain. Rust will be the flagship language for Solana smart contract development. The Rust toolchain is publicly available as part of the Solana Javascript SDK, and we are further iterating on the Software Development Kit.

Solana will soon launch a public beta incentivizing Validators to run nodes via Tour de SOL — analogous to Cosmos’s Game of Stakes — that challenges the public at large to test the limits of the Solana network while earning tokens for doing so.