SF Blockchain Week — Attacks and Security
Hack the blockchain workshop
On October 9th, Solana CEO, Anatoly Yakovenko took to the podium to discuss attack vectors and security issues that affect blockchain and decentralized application development.
Check out the video and raw transcript of the talk below:
Solana GitHub: https://github.com/solana-labs
Anatoly: Hey everyone, I am the founder of Solana. It’s a high-performance Blockchain. And this is our latest test net perf release. We can do about a 120,000 transactions per second steady-state without blowing up. And it peaks sometimes around 400,000 to 500,000. And these are atomic operations that are cryptographically signed, so the whole thing either it does or doesn’t. It’s not batching any of the transactions together.
So, how do we make this work? This talk is actually about security, but I can give you guys a brief intro about how everything works. So, the basic premise or basic innovation isn’t in the consensus layer at all. It’s introducing a verifiable delay function into the ledger data structure itself. So, ledger like every append-only ledger, just like proof of work. It’s a forking data structure. So, this data structure gets split up and is created all over the place by everyone that’s trying to compete to <inaudible>. So, when you add a VDF into this data structure, it kind of starts acting like a water clock. So, imagine water dripping and data rising. In a similar way this VDF grows a proof of time passing in every branch and every partition of the data structure before consensus. And this sense of time becomes a synchronous clock for the entire network. So, when you have a synchronous clock you could solve a lot of distributed systems problems that have been solved like in the last 40 years. So, using this synchronous clock we can simplify messaging overhead when we need to actually reach consensus.
Hacking the Blockchain: Attacks & Security
And intuitively you can think of it this way, imagine if I send messages to everybody in this room, a bunch of messages, they all arrived out of order. If you can actually put them back together in the same order and you can trust that order of events. And more so, you can trust the timestamps in those events and get a sense of time in the network, when all those events occurred. You can do that without witnessing them directly by just looking at this, at the timestamps. And you don’t have to talk to each other. And you can come to conclusion about the state of all the active nodes in the network without actually exchanging any information between each other. And then you can optimistically say, “Hey, this ledger looks okay. I’m gonna vote on it.” And then you can observe your vote in the future version of the ledger and keep rolling this kind of, this append-only system. So, given all that, that’s how it works. Basic premise, it’s all open source code. Apache 2.0, so, please steal our code, fix our bugs. And I will be talking about some of those bugs.
So, this topic is about security. So, I spent most of my career working on mobile operating systems, if you guys ever had a CDMA flip phone like a Motorola razor, they run this OS called BREW. I was the core BREW kernel developer. And this was an OS written in C with hand-rolled C++ compatible V tables running in a single address space without memory protection, but with downloadable external applications by untrusted third parties. So, how do we make this work? With a lot of blood, sweat, and tears, and code signing. So, in the Blockchain, the way I think of this project as I’m building an operating system. The actual state of the chain, like the state machine, that’s the kernel, that’s what we want to protect. We also want to allow users and applications and everybody to interact with it.
There’s a lot of interesting challenges there. So, we need to protect users like in just building an operating system you’re not really just building a kernel. You have to build APIs that are secure, like the parity bugs, some developer accidentally just was experimenting, right, and locked up hundreds of millions of dollars worth of value on Ethereum.
Users constantly do the wrong thing. I think if you guys have been paying attention that miner had a misconfigured mining node that was censoring transactions on Bitcoin and people are like, you know, all up in arms about them trying to attack Bitcoin but this was just a misconfigured node, right? These are just very simple mistakes that people can make that have huge consequences. And when they’re tied to smart contracts, those consequences have huge financial impacts.
Applications need to be protected from each other and from users like the DAO hack, those was like the first kind of big hack in Ethereum or the big bug. And this is unintended code that was really, you know, code is law, right? You build something and then it works in a totally busted way. The other thing is, right? We’re building a distributed system, so, nodes themselves are part of the resource we need to protect. Stakes can be slashed. So, any kind of system that has slashing, that’s an attack lecture.
Nodes need to actually trust the network. We actually need to boot up and verify that the network we’re talking to is the one we expect. So, there is attestation. This is a very complicated thing. No one’s actually has a really good solution for this. Nodes have to be protected from DDOS. There was a…again, recent Bitcoin bug that could have created a denial of service attack if the miner wanted to spend $80,000. So, if you’re shorting Bitcoin might not be a very large amount of money. The network itself, it’s a distributed system. We need to actually protect it from partitions. So, the consensus mechanisms actually have to come back together no matter what we do. We’re building a proof of stake system. As far as I’m concerned, these are all still unproven. So, there’s a lot of challenges to actually decide like what choices we make, right? And like how those will have impact later on and whether things will actually work. There’s a lot of still technical problems to solve.
Also, right, we’re building an economy. I’m not an economist — I’m an operating systems engineer. How do we balance all these things especially in state-based systems when there’s voters and voters are typically apathetic. I don’t know if any of you guys own any stock. I have never voted in any corporate election. So, who is going to actually vote in any of these systems is very interesting. I think Eos’s approach while on the outside looks quite broken, 21 validators, right? It’s not a decentralized like system. I can only keep track of maybe 20 things in my head at a time, so, simplifying it to a set that a human can actually manage in their brain I think is a interesting design choice. Because we’re still bound in all our design choices about what humans can do. So, I don’t know if you guys looked at a GitHub or followed any of the instructions for this thing. But we’re fast developing projects and we’ve been open sourced from day one. So, the state of the code base is what it is. So, it’s fast and there’s a bunch of attack vectors.
So, I’m gonna go over some of the theoretical ones and then show you some actual ones. So, this one is a…in my compiler course like back in…I don’t know, year 2000. This will date me. This was something my professor talked about. So, this is a called the Ken Thompson Hack. So, this was in 1984 ACM Turing award speech, he talked about the worst virus you could build. And this is a virus injected into a compiler. The compiler knows it’s compiling the target via the target function of the virus. It knows also when it’s compiling itself. So, the source code for the compiler thereafter does not contain any evidence of the virus. So, imagine you have a compiler, you’re targeting our Blockchain, we’re building in Rust, there’s only one Rust compiler. All right. Somebody can inject this particular vector. Rust is built with itself. The next version of Rust does not contain this virus but the output always does. And it would be extremely hard for us to even detect this. And this virus could effectively, you know, like ignore a particular public key. But it’s doing a cryptographic verification. These kind of attack vectors were all theoretical in the 80s. But because we live in this distributed economic like new world, I think people are gonna try them, so, be exciting to see what happens.
So, some other attack vectors, I think the famous one by XKCD is a $5 wrench, simply bashing somebody over the head until they give you the private key. But I think a much simpler one is a bored AWS employee that’s like, all these nodes that are running in AWS are all part of this Blockchain, and it looks like it’s the whole Blockchain is being hosted in AWS, so, I wonder what will happen if I push this button? All the code is hosted or the majority of it, for us at least, is hosted on GitHub. So, a bored employee there it could also inject something and we would almost never know. Git does provide some defenses against this, but oftentimes when I do a pull request, I just, you know, a hard reset because I don’t want to deal with whatever messages it gives me. We have an interesting thing, Blockchains right are a virtual machine that are all executing the same instructions probably in the exact same order of operations. So, Rowhammer is an attack where you hit the same memory location over, and over, and over again until you change a memory location that’s physically right next to it. So, this attack can actually go through a hardware enclave depending how these enclaves are designed. If the physical memory is actually next to each other and it posts a contract that it’s the same memory location over and over, causes a flush into the right. This could leak over to the state that’s right next to this on all the machines that are doing this verification, right? Think about it, like this one attack can leak and corrupt code in physical memory and all the computers are supposed to be doing this decentralized verification and validating things. So, when all of them sign the state, they all sign the same corrupted state and this thing goes on and nobody notices until they realize there’s extra money in somebody’s account.
So, I don’t know. Did you guys follow or read me or any of the instructions? If you actually go to my fork, there’s kind of…all the answers are there but you can do that later, kind of go through them. So, the state of where we’re at is we have a very simple gossip implementation for our network. Are you guys familiar what gossip is? Gossip is a protocol that’s used all over. I think every Blockchain probably uses some version of it. For us, we’re a UDP base written in Rust, so we’re not using with p2p. But our protocol is very simple.
You have a table which a map of public keys to node structures. If you have a index of when this table was updated. How many times it’s been updated? And you have a list of when you updated a particular public key. And you also have a list of when a remote node told you that they updated something. So, when you exchange messages, you simply say, “Hey, note node, last time I talked to you gave me this update index. Give me all the stuff you’ve seen since done.” So, it reduces the overhead of how much data you need to broadcast between nodes to just their latest. And when you roll up to the latest, you can kind of squash all the other changes. This all works except that these messages should be signed. So, if they’re not signed, when you ask the remote node, “Hey, give me information about everybody else that you’ve seen.” They could lie to you, right? So, they can simply munch the data to kind of monkey with you or they can just give you a totally different view of the network that they want you to see. And then you’ll start talking to them. So, this goes back to the kind of a network attestation problem.
So, there’s a big to do there. These messages should be signed and go through the GP pipeline for spam filtering and stuff like that. So, that’s not implemented. So, what you can do on our testnet right now, is you can actually munch this data and point it and misconfigure it and kind of take over the network. This is a very simple attack vector, but it’s kind of fun because you can actually see it working.
So, is that at all visible?
So, really this is all just a way for me to try to get you guys to test our code.
So, what’s hacking? It’s really just testing, right? There’s a bug that’s the attack vector, right? Write a test to demonstrate and then you put it in the fix. So, where’s our attack?
So, in this test, we spin up kind of a network of validators and this is all within your local machine. So, you can build your own test net in your own machine and then you can start manipulating the data structures that are distributed in this testnet. So, after you boot up this network, you can create, what in my test code I call a Spyna, just a gossip node that connects to it and discovers a network. And then you can start injecting invalid states. So, when you do this, you get to see that the network doesn’t converge, that transactions can’t effectively start failing because the data that’s being replicated is pointing to a dead port.
Is this interesting to you guys?
So, you can configure how many nodes you want to run. Because it’s just running on my local MacBook, 20 is plenty to kind of simulate the effects. And this is all written in Rust, so, we use Rust log which is awesome. And this whole thing is just a simple cargo test. Let’s see if the demo gods are on my side today.
So, right now this is uncommented so, we should actually see this thing run and converge and do the right thing. And all this test is doing is simply sending some transactions, signed transactions to a node and verifying that the balances are what you expect. And what you uncomment the attack vector, you inject some invalid data, this data gets propagated through the network through gossip and you will be unable to see the balances actually propagate, and that’s a very simple way to hose the network.
We can come back to <inaudible>, it’s compiling. Geez.
That’s weird. I swear it worked right I got here. All right. Updated the compiler, so, yeah, I ran a Rust update.
So, yeah all these tools are totally automatic. When you run Rust update, you download code that you don’t really know what it is, it’s a compiler that we think we trust. It’s kind of interesting, and it is an awesome language, but it is the only compiler that there’s only one implementation of Rust. So, it is very easy for somebody to inject a specific attack vector on projects that are using just that compiler. And building a high-performance Blockchain, you actually have to make design decisions for performance first which means complexity. So, when things get complicated, they’re very hard to port. So, I really honestly suspect there will be one implementation of this. And that means that there’s only…that attack vector right, will always be open because it will probably only be one implementation of Rust.
So, what this thing is doing is just sending balances around and then verifying them against a node that’s part of this spun up testnet.
Cool. So, it even failed before that.
Okay. That’s really weird. So, let’s see if that thing works, but we can kind of go after the next attack vector. So, this is the code that actually does the attack. All it does is put up a node, connect to the network and starts injecting bad state. What’s cool too is right now, we don’t have consensus implemented. So, we’re implementing it as we speak. So, our very, very dumb consensus algorithm is to look at the network and count what everybody else thinks is the leader without any civil resistance. So, you can actually inject a bunch of fake nodes and count and say, I’m the leader, right? So, you can easily take over the network. So, this is another interesting way to kind of monkey with the network. Again, this would be less likely if you had…if we actually verify the messages that are signed. And a very simple approach, after that is to count by stake late and what the network thinks everybody should be pointing to amongst other approaches, but all those approaches are kind of half-measures. The problem we’re actually trying to solve with security is this understanding that we can’t solve all the problems, right? We live in the world where all the code we write will have bugs, so, we actually need to segment all this code and all these bugs from each other. And how we do this? How we solve an attack class of problems? Is for us, we can actually implement a proof of stake enclave. So, going back to our first slide, if you guys remember it, our ledger’s a verifiable delay function. So, we can actually extract the proof of this time passing as a separate data structure from the actual transactions, the severe lightweight data structure. The enclave can verify that time is passing in the ledger that’s being presented to it very cheaply and very securely. All it’s doing is just looking at chapter 56. This enclave can have a very small amount of code and this is our hypervizor. This is the smallest thing that actually has security on the line because when you vote with your stake, you’re creating a slashing condition potentially. So, when you can verify this historical record, you can verify that time is passing. You can actually verify that the validator messages you expect or present in this chain at the moment and time that you expect them to.
So, when you submit your message, you’re doing a very, very small and simple calculation to say that the network I’m observing is actually behaving well and I can submit my message and thus minimize the potential for actually being slashed. So, all these bugs like Rowhammer and all these other attack vectors become less likely to actually take over this private key that’s hosted and controls capital and is actually providing security to the network. So, that’s in my mind very cool. So, even if we have bugs in the gossip layer and bugs in these other approaches, you can still do the malice service, you can still bring down the network but you can’t force this enclave to actually vote on something invalid such that it’ll get slashed. And that is I think a very hard security boundary that we can provide. And what’s interesting is that because we have this verifiable delay function as part of the ledger, you’re not just implementing a signing oracle that you can throw arbitrary data and track. This thing can actually verify that the chain that it’s signing on is valid and succinct and is derived from the rules of consensus. Cool. That was it. You guys have questions? You’re very quiet.
Audience Member 1: Can you explain the [inaudible 00:25:37] behavior?
Anatoly: I’m not familiar with that. But what happened?
Audience Member 1: There was an un-updated state within the state channel which caused a cursor function to allow them to remove the Ethereum and the BoostCoin from the [inaudible 00:25:57] channel.
Anatoly: Oh, interesting.
Audience Member 1: I think that’s similar to DAO.
Anatoly: Yes, there’s a whole other set of problems with doing any kind of cross chain or state channel solutions because you have…there’s no way to do attestation of between chains, right? So, you have to kind of trust the setup. So, yeah, that’s like something I haven’t even thought of yet. Anyone else?
AM 2: Can I say something?
Anatoly: Go ahead.
AM 2: Can you [inaudible 00:26:28]
AM 3: Yes, the problems with Solana is scalability. What are some security considerations that are gonna come up with [inaudible 00:26:35]. What do we need to think about when scaling to dAPPs.
Anatoly: So, with dAPPs, I don’t think you need to worry about like Ken Thompson effects. I think those are theoretical still, but I think when there’s money on the line the theory becomes more practical to somebody. For dAPPS, I think that the attack vectors for dAPPs are centered, I mean, just beyond the bugs, you have to be careful with the economics because of part of the reason why you want to dAPP, right, is you have some token that’s representing a resource that’s derived that the rules of the dAPP kind of derived from, and how people can acquire them and kind of spam you, I think it is the real concern but it’s where the attack vectors there but also, just bugs. So, you really want to like, you know, the way I think a dAPPs, is I think of them as kernel drivers. So, you want to do the least amount of work in the kernel, because that’s where all the exploits can occur. Go ahead.
AM 4: This may be low-level to address. But I was trying to follow along with your [Inaudible 00:27:50] How to fix that.
Anatoly: Oh, totally.
AM 4: Something about readme with instructions to follow along.
AM 4: But I wasn’t sure [00:28:10].
Anatoly: Yes, so, there’s like a Rust update in the middle of me working on this I guess. So, all you need to do is do as do I64 and that’s a cast. And it’s cool because it’s Rust, it’s a safe cast, so it actually will verify that’ll fit within the I64, it’ll throw an exception or a virus because the compiler’s impact. Go ahead.
AM 4: Thank you.
Anatoly: Oh, yeah, totally. Go ahead.
AM 5: So, when they do the [inaudible 00:28:53] update, is that SGX or…
Anatoly: Can you…it doesn’t matter. I actually worked on some trust phone stuff back in the day at Qualcomm. You just need some segmentation between the thing that’s doing validation and the thing that’s actually doing the signing. In theory, you could use like a raspberry pie. I think we’re working on SGX, that’s kind of the most likely target because it’s available in like a laptop. Yes?
AM 6: [Inaudible 00:29:30].
Anatoly: So, I am an operating systems geek, not a cryptography geek. So, we picked the simplest VDF that works. There’s a lot of very interesting designs, if you guys read up on them. So, rVDF is just a shot 256 loop. It’s not actually, I think the technical term for VDF has been expanded to imply an asymmetric verification time between generation. For us the verification time is over not how many cores you have. So, modern day GPU like the ones in your phone can do somewhere between 50 to, you know, thousands of times faster verification. So, for us it was really kind of [inaudible 00:30:21] law. When your, you know, two-year-old Nvidia card can verify second in the quarter millisecond, that was good enough. Go ahead.
AM 7: Do you often think about Rust [Inaudible 00:30:31].
Anatoly: So, like I mentioned, I spent 12 years working in C, and you just can’t do this in C to have like a structure with a hash map. It’ll just take so much work and so many bugs, and so many macros. I actually started this project in C for about two weeks I was making good progress and then I needed to get some cryptography libraries because I’m not going to implement my own like the <inaudible> curves. So, I started thinking about how I’m going to manage libraries in C and like the source code and it just quickly spiraled into me thinking about building a build system and a package manager and I said screw this, I’ll try this in Rust, and in about a weekend I was I ahead of where I was. So, Rust has no garbage collector, actually can be memory efficient and you understand what memory you’re using. So, that is important for any time you’re doing performance work. Our goal is to actually have Rust as the primary smart contracts language. And we’re kind of there. We can build stuff in C and if you can build the program in Rust that takes less than 512 bytes of stack, you can actually use it. But there’s some toolchain work to actually enable the full suite that still needs to be done. Cool. Anyone actually trying to follow the Hackathon? Sweet.
AM 8: You mentioned something about [Inaudible 00:32:27].
Anatoly: Yeah. So, if you look at our GitHub page, there’s a readme that covers everything you need to do to actually run a testnet and a demo and actually ping your live testnet and see your transactions there. And I was hoping everybody would do this before they got here, but I suspected that, you know, it’s a conference, right? Go ahead.
AM 9: Yeah, there is something in the instructions about hosting the [Inaudible 00:33:02].
Anatoly: So, if you run the demo locally you don’t need to do that because use localhost. But if you’re trying to talk…like run a validator node you need to open UDP ports and kind of poke them through the firewall. So, I’d recommend using a cloud service or something like that. Go ahead.
AM 10: [Inaudible 00:33:42].
Anatoly: Some what?
AM 10: [Inaudible 00:33:47].
Anatoly: Oh, to deploy the network, there is a snap, a Linux snap which kind of wraps everything for you. We haven’t had the need to run like a full docker thing but do you… sure.
AM 10: [Inaudible 00:34:09].
Anatoly: No, it’s like an ubuntu image with Rust. It’s fairly easy to deploy.
AM 10: The apps download the source code.
Anatoly: So, if you run like a validator, we’re using GPUs for all the elliptic curve verification. You actually need access to the GPUs, and the Nvidia docker is very busted. So, it’s kind of like…you can actually run the non-GPU version, it’ll be pure Rust and it will do like 35,000 transactions per second on my MacBook, but that’s no fun. So, if you need GPUs you don’t want to use docker anyways. Go ahead.
AM 11: Is there any chance you can give us [Inaudible 00:34:50].
Anatoly: Sure. It’s a good question. All right. This link will expire in one day. Let’s see.
Spam our discord, that’s where we do all our development. So, please be nice.
Let’s see if this actually worked.
Git reset hard. It’s my favorite command.
AM 12: Anatoly?
Anatoly: Yeah, go.
AM 12: [Inaudible 00:37:26] Just wanna know, what do you see…what are some security concerns, attacks, issues that see that are not happening that you would want to protect or at least start thinking about for the future.
Anatoly: So, like what’s been keeping me up is just key management like, this is like a real problem that…
AM 12: Wallets? Or..
Anatoly: Yeah, it’s just like we’re all bad at it. Like even like companies are bad at it and it’s very expensive to actually secure a key and cold storage or any kind of storage. You have to constantly do proof that that key’s actually still there and how do you…what do you do if it’s not like if there’s like, you know, a neutron hits that device and actually like creates an error.
AM 12: [00:38:15].
Anatoly: Sure, what if it gets lost, like, I mean, right do you have to like…you’re effectively guarding gold. So, there’s like a whole chain of people, and you have to manage the people. Like that’s…I don’t know what this space is gonna do when we actually need to deal with this. I think that eventually people just use something like Coinbase but have like wallets that are storing just their kind of $20 worth of crypto because it’s what you can afford to lose.
AM 12: [Inaudible 00:38:53].
Anatoly: But I know like, do you…do you do your own key management?
AM 12: I do both.
Anatoly: I guess. So…
AM 12: Some on is used, some on [Inaudible 00:39:08]
Anatoly: Because I worked in hardware, I like don’t trust the hardware wallets, like. You’d have no idea, right? You have no idea if those…if the cell generated keys are actually generated with a secure random source that’s not just totally hackable, like…
AM 12: That’s definitely at tax on random numbers [Inaudible 00:39:30].
Anatoly: Yeah, that would be the easiest vector and even the company might not even do it maliciously, right? It’s just accidental like misconfigured like, you know, whatever anthropy they’re using. It’s just so many things can go wrong in that whole chain that I wouldn’t trust it. So, yeah, bashing your keys and printing out the pieces of paper and storing them and like safe-deposit boxes probably, I don’t know. But yeah, most…I don’t have that much crypto but most of it is in Coinbase sadly. In terms of like dAPPs I think there’s a lot of stuff that we haven’t really figured out especially when communicating between applications. How that’s going to work and whether those channels are secure and how error propagation will go back and actually execute back in the state in a robust way. As soon as like, now Ethereum is so slow that we put the smallest amount of stuff on it. Like if we have a fast change you’re gonna start putting more and more things on there and those things are going to get complicated and those state transitions are gonna get more and more complicated. And it’s gonna become harder and harder to test. So, what do you do when a dAPP effectively deadlocks, right? Like there’s just so many things get worse as code gets more and more complex. Cool. Thank you guys for coming.