Software Huddle - Distributed Financial Databases with Joran Dirk Greef of TigerBeetle

Starting point is 00:00:00 That's why I love, I've always loved MIT and I've loved Apache because you can build businesses on these licenses. And it's not zero sum. Again, it's all about trust. So operating systems, three easy pieces. For some people, it's like the KNRC. It's like up there. And it's by Ramsey and Andrea Apache-Dusseau. And it's like my favorite computer science book. But then in the world of energy, I've kind of been learning this as we got into Tiger

Starting point is 00:00:27 Beetle is that our energy sources are changing. And that's fascinating because the sun is more transactional when you switch to solar. Energy prices change every second, every minute. And you can be so much more efficient if you are able to transact energy at a much higher resolution. Hey folks, this is Alex Sabri. I have a great episode for you today. I spoke with Joran Dirkgrief, who's the co-founder at Tiger Beetle. You know, Tiger Beetle makes this financial transactions database that's really focused on not only correctness and safety, but also performance, right? Getting orders of magnitude more performance than other solutions in the space. I think Yaron and his team are really interesting just in the

Starting point is 00:01:08 content they put out there and the way they think about developing new databases is really interesting. A lot of interesting practices, you know, one's called the VOPR. It's this distributed simulation testing tool that helps them discover distributed systems, bugs, safety bugs, liveness bugs much quicker than they would in sort of regular testing scenarios and things like that. So I tell you're on a few times here, I think he's really living in the future just in terms of how he sees, you know, not only what database needs are going to be, but also development needs and the tools they're building on. So I hope you enjoy this show. As always, if you have any suggestions or comments or anything like that,

Starting point is 00:01:45 feel free to reach out to me. I'd love to hear about which guests you'd like to see on next. Other than that, enjoy the show. Yoran, welcome to the show. Hey, Alex. Thanks for having me. Pleasure to be here with you. Absolutely. You're the CEO and co-founder at Tiger Beetle, which is a new database for financial transactions, really focusing on performance and speed. And I'm super excited to talk to you because I sort of feel like you're living in the future in some way. You have these sort of opinionated, idiosyncratic views about just how things are changing, how technologies are changing, and what the implications are for databases and how they should be built and used.

Starting point is 00:02:23 So I'm excited to hear about that today. I guess for the guests, do you want to just give them a little bit more background on you and your history? Yeah, well, thanks. So nice to hear that. I kind of feel like I'm just living in the past and trying to just get a chance to put it into practice.

Starting point is 00:02:39 All the last 30 years, we've had so much research, so much hardware advance. And looking back on that. Yeah, so a bit of background on Target Needle. We started the project in July 2020, so just a little over three years ago. And to put that in perspective, it's interesting because, again, just the timing of things. So we'll get into all of this hardware research. How do you build a distributed database today?

Starting point is 00:03:08 And we're just so lucky that you've got new, safer systems languages, Rust and Zig. So many things have changed. We've got IOU ring. But at a high level, the big picture of Tiger Beetle is it's something that's fast, but also small. So what sort of OLTP database do you use today if you want to track financial transactions? That was the question as we got into Tiger Beetle. We were working on an open source payment switch, and this was using a traditional 30 years old OLTP database, MySQL. And that experience was interesting.

Starting point is 00:03:53 At the end of the day, we realized we could work on this existing system and optimize it for a year with consultants and all of that cost. And maybe we could have an incremental improvement and this payment switch would be more efficient and more scalable. And then we thought, well, you know, we could put this year to use and see, well, how would you build an OLTP database today for tracking financial transactions? And that became TigerBeal.

Starting point is 00:04:24 Instead of being incremental, how could we go, typically these systems do a thousand to 10,000 financial transactions a second, and that's using several hundred dollars a month of cloud hardly. And we wanted to say, well, how can we do a thousand times that? How can we do a million a second using the same hardware or less? And then how can we be much safer given Postgres and MySQL are tried and tested for 30 years? We must have seen where do they fail, where do they fail the test? Is that the case today? Are there situations where these databases are not as durable as we think or as highly available?

Starting point is 00:05:16 That was the timing and the context and the goal. How can we be three orders of magnitude faster, but with the same unit of scale? So this was really about asking, how do we optimize our unit of scale? You know, the cloud is already more than 10 years old today. And so we've actually had a lot of cloud databases that were also designed, you know, more than 10 years ago. And, you know, how have things changed? And I think in the past, it was a lot about, let's just try and scale. Let's forget about cost. Let's just scale. Let's have a design that scales. And today, I think the question is different. Today, the question that we wanted to ask was,

Starting point is 00:06:01 how can we scale cost efficiently? But the way you can answer that question then is you can say, well, let's play with the unit of scale. Sure, we could add a thousand more machines and hope that we get to a million a second, but that's very cost prohibitive. How can you take the same machine and with that track a million financial transactions a second. So that was the goal of Tiger Beetle. Absolutely. And are you building for, I guess, sort of like existing financial use cases, like giant banks and stuff that have these systems that are processing all those transactions and they need a way to scale better?

Starting point is 00:06:37 Or is it like, hey, we think there's a future out there that there's just like more financial transactions or just more use cases whether that's micro payments or like you know in-game payments things like that or even like usage-based billing for sass like are you building for that new world or is it is it also like the old world really needs this they can't they're having trouble scaling yes i love i always love the question where you know it's either or and then you answer it with yes and you say yes. So it's or not XOR and both. And we came at it from the context of micropayments. So this was a switch that needed to do a lot of payments and very cost efficiently because the value was for $5 or whatever. So fees become very important at that scale. But Tiger Beetle, it solves a lot of the same problems.

Starting point is 00:07:38 They're all kind of interrelated. If you can build something that can do a million a second you can you can help the very large scale systems or you can trade that performance for a very small hardware footprint and be very cheap um and it's very easy to operate so so yeah so both but we did come at not not not from the crypto world i've never been involved in that uh but from the micro payments well that that was the like the impetus for target deal because of where the this kind of interesting you know you touched on it um a future world where the world is going so gaming you know like you get all these economies in these massive multiplayer games um and we actually had people reaching out to us

Starting point is 00:08:25 saying that their database was falling over. We'd have a call and their pager would literally go off. I kid you not, this actually happened. At the end of the call, the pager went off. It said, our ledger database is falling over because the game was doing so well. They were doing so many transactions. But then in the world of energy,

Starting point is 00:08:44 I've kind of been learning this as we got into tiger beetle is that our energy sources are changing and that's fascinating you know because the sun is more transactional when you switch to solar energy prices change every second every minute and you can be so much more efficient if you are able to transact energy at a much higher resolution. And it turns out existing databases, they struggle to keep up with this more transactional energy future. So energy systems are looking for Tiger Beetle and then the traditional fintech world. And then that too is interesting because so much of that is moving to new technology now.

Starting point is 00:09:30 And then you get this problem where a lot of open source LTP isn't really online. It's not highly available. It's single node, and it doesn't fail over to multiple data centers. You need to use proprietary solutions to handle that. And what we also thought was, let's see if we can shrink the footprint of the system so you can run it on a Raspberry Pi.

Starting point is 00:09:56 What would that unlock? You know, then that's what I'm really excited about. All the use cases that we can't imagine because you can just give people, you give people three new orders of scale or efficiency and then let them come up with... And again, usage-based billing. What was interesting I heard, there was actually a conference a while back

Starting point is 00:10:17 with people like GitHub and places like that. They were actually sharing hacks for how they do billing because billing scale has increased that they can't actually track it all. So they have to like try and probabilistically charge people because they can't afford. Approximate it, yeah. Yeah, you know, with like serverless billing,

Starting point is 00:10:38 how do you track, you know, so many seconds in a day and usage that just explodes? So if you have something that just very easily just thinks in a day and usage that that just explodes um so if you have something that just very easily just thinks in a new order it becomes easier so that was the other goal you know safety performance but also operator experience let's just have a like a little ant that can do a lot of work and be you know you sleep all at night too yep that's amazing and that raspberry pi one it kind of reminds me we're seeing like this trend with like SQLite and Terso and different things that people are doing of just like, Hey,

Starting point is 00:11:09 having lots of little tiny local databases or things like that. So it's, it's cool to see, I don't know, all these different changes that are happening, you know, giant databases in, in cloud administrators, but also small local ones as well. So. Exactly. I think, yeah, it's definitely like I love that, you know, what people are doing like Terso with SQLite and I learned

Starting point is 00:11:30 a lot from, you know, from Glamour Costa when he wrote about IONUring as well that made an impact on Tiger Beetle, the Ciladb people, Red Panda, Martin Thompson of LMAX. They all, you know, think similarly but what I love is when people say,

Starting point is 00:11:46 well, let's just allow people to have 50,000 databases. What will happen when you just think, you just play with the experiment and tweak the experiment instead of being incremental, because it's just more fun and more efficient. Yep, absolutely. So you just spoke at QCon, and I've heard just like a lot of really great things on it from Twitter, from people that were there and things like that.

Starting point is 00:12:11 I can't wait to see it. I don't want to rehash the whole talk, but I do want to get people sort of interested. And I think also set up the problem you're looking at and how these technical changes impact it. So like one thing you say is like the OLTP workload is becoming more contentious. Can you tell me what you mean by that? Yes. So this was something, I think like this was totally new, new for me. Um, what's interesting with OLTP, just a bit of backdrop. Um, well, let's start at the beginning and then, then we come back to you know to to to QCon um what we saw like looking at this payment switch we will you know analyzing this payment switch we are the bottlenecks and you see that the switch is really trying to do transactions

Starting point is 00:12:58 in the logical sense business transactions um who, what, when, where, why, how much. It's business, everyday business. There's counterparties, there's a transaction that happened in the real world. And that was all the switch was trying to do. And when you trace the physical SQL queries, the database queries, for every logical real world transaction, there would be 10 to 20 physical SQL queries. To implement something that was quite standardized, because what the switch was really trying to do, like every business is you're recording money, you're doing double entry accounting, there's debits and credits, but it comes down to that. If the database doesn't do it to begin with next month, the auditors will have to convert it into that format anyway.

Starting point is 00:13:50 And so we saw the switch trying to do debit credit. All the code was that around the generic general purpose database. And you had this query amplification, which was costing you, because if your switch can do a thousand a second it's doing 10,000 network requests a second behind that talking to the data so that's very tough to scale because as whenever you scale we just multiply by 10 and typically that's all these systems they all have it's you chat to other people, you know, in India, working on the massive UPI nationwide,

Starting point is 00:14:28 the INPS switch that does, it crossed 10 billion real-time payments in August. They can't use OLTP databases anymore because they don't, it just doesn't handle the scale. So they reinvent OLTP over a durable log Kafka or, you know or Red Panda today. And then they do the transactions processing logic in Java services above Redis, just because it doesn't keep up.

Starting point is 00:14:53 And this was really the problem is how do you do logical transactions that we've got an impedance mismatch? So coming to the contentious workload, this is the other interesting thing is that people like I grew up in the era of like, I think we both did, you know, eventual consistency and let's scale, let's go horizontal. People thought Moore's law was dying 10 years ago. And so, well, we're going to have to scale out. And the problem though is that the workload that these switches do in most financial transactions is there's only so many bank accounts. And what I mean is that you could have like a hundred million customers and you could have a thousand machines and partitioning your customers across those machines fine. You only have so many bank accounts like 10. So if you have a thousand machines and a thousand shards, all those shards are going to serialize and bottleneck on the one shard that

Starting point is 00:15:53 has your bank account or your fees account. So this is classical, the way the real world works is you always have on one side of the equation, you can go horizontal, sure. But on the contra side, there's so many bank accounts, so many energy providers, so much inventory in the game. You're always limited. So the horizontal scan out strategy works very well for S3 or key value storage. But this is what we started to ask.

Starting point is 00:16:24 Does it still work well today for transactional workloads? Like, for example, the switch could also be deployed where there's only four banks. So you literally only have four accounts so that you just can't logically, you know, you could have four machines, but they're all going to wait on each other. So it's actually better just to have one machine and scale up.

Starting point is 00:16:48 And so this is the trouble of the contentious OLTP workload. But it's also, we're starting to get into, what is OLTP? Is it the logical real transactions or is it database transactions? Which came first? Who got the name Promgoo? Yeah, interesting. which came first you know who got the name prom goo yeah interesting i'm excited to talk a little bit more about how yeah how you're solving that issue and what implications it has but i want to talk a little bit about some technical stuff because you've talked about changes in hardware changes in like core software libraries or languages that are that are you know just

Starting point is 00:17:22 changing what's possible and things like that. So maybe walk me through this. This is like an area I try to read up on and stay up on, but it's just also like so far removed from what I do day to day. So I might ask some dumb questions. But first of all, like I hear a lot about IOU Ring. What is IOU Ring? Why is it such a big game changer when it was released? Yeah, so huge thanks to Glauber.

Starting point is 00:17:41 He wrote this magnificent post on IOU-Ring early 2020 that took the scales. I knew iE-Ring was thinking about it, but that post really, I mean, you've got it. if you want to read and write from the disk or from the network, what's interesting is it's actually the same essential syscall. You're reading into a buffer that you provide. So you give the kernel a buffer and you say, read into this buffer from a file descriptor for me. Or you say to the kernel, here is a buffer that I have right from this buffer to a file descriptor, which can be disk or network. This is pretty much it. Then you've got a problem where on Linux traditionally, you couldn't do that asynchronously. So what I mean is that if you call that syscall, your program will block while the kernel goes and does that work, and then it comes back and tells you it's done.

Starting point is 00:18:57 And then obviously, if you're all familiar with JavaScript and how Node.js works, there was that async revolution with LibUV under Node.js where it was like, well, we can... So the Linux didn't really have the proper APIs to do this asynchronously. So how do you let your program keep running while that task is off executing?

Starting point is 00:19:19 While you're waiting on the network or you're waiting on the disk? Because you want to keep going so that you can do CPU and IO. You can pipeline them. Otherwise, you're wasting CPU. And Linux didn't really have good APIs for this. And at the same time, people like Morten Thompson were talking about mechanical sympathy and context switches.

Starting point is 00:19:45 These happen as you make the syscall. You're switching context. You're messing with the flow of the CPU. And I mean, as programmers, we know if someone interrupts you, it's expensive. But the CPU, it gets really expensive. And this was becoming more of a thing because CPUs were getting faster.

Starting point is 00:20:08 And so an IO was getting faster. So what you saw is that NVMe is so fast that the context switch is becoming as expensive as just doing IO. So the interface is starting to cost as much as the work itself. It's not exactly the same, but within the same order. And so the interface was becoming a problem. And up until that point, people like, you know, if you understand the BV, you have a user space thread pool. So if you want to, you have a control plane,

Starting point is 00:20:45 your program, your code is executing. If you want to do some async IO, you drop that into your user space thread pool. That will then do the blocking, it's just called to the kernel. The problem is, yeah, again, context switches, you know, going across threads. So IOUring came out just before and solved all of this

Starting point is 00:21:07 by giving very nice ring buffers. So you have a submission queue ring buffer from user space to kernel space to send commands. And then you have a completion queue going the other direction from kernel space back to user space to give you the results of these commands. And it's just shared memory with the kernel. And it's pretty much efficient.

Starting point is 00:21:29 So no more context switches, no more syscalls. You just drop things at the head of a queue, read things off the tail of a queue or vice versa. And it's a beautiful design. And it's very sympathetic to the hardware um this actually isn't the the big performance insight for targeting although so i think that's this interesting thing this is like a marginal incremental improvement so it's the right interface it's far more efficient um and and the reason why we used iURing is also because it's just so much easier because the Linux APIs, you're actually doing the same thing, whether you work with network or disk, but the Linux APIs for these were totally different.

Starting point is 00:22:15 So you could do async networking, but not async disk. But with IURing, I think this is the biggest contribution is you get a unified API for doing all async network, async this, perfect, beautiful. It's perfectly efficient. It's very simple. And you have one interface. So that's what I really liked. We didn't have to go and write the BV again. Yeah. Okay. So you mentioned that's not Tiger Beetle's main performance secret or key there. What is the key that Tiger Beetle is using there? 10% and you add all of that together and it's just, it's the right way to do it, but it's also simpler. So you've got less current for if you use Iodine, it's just,

Starting point is 00:23:09 it's much nicer. A lot of the Tiger Beetle design decisions were just what would be quicker? What would be the right thing? And it would be the right, no technical debt, but it would actually just be easier and quicker and safer, simpler. So that's kind of why we made all the decisions we did. Yeah. it, but it would actually just be easier and quicker and safer and simpler.

Starting point is 00:23:28 That's kind of why we made all the decisions we did. What was the big performance win? The thing that gave us four orders of magnitude, not three. We actually got four, I think. That was this insight that is the coming back to this query

Starting point is 00:23:43 impedance dispatchatch logical physical transactions we thought we have to solve this problem because we're losing an order of magnitude if your amplification is a factor of 10 so we thought well the information that you're really tracking is who

Starting point is 00:23:59 what when where why how much that is in the data plane hot data, but it's actually small. You could fit that into 128 bytes. And then a lot of the big data, variable links data, that was in the control plane. That was like, who are our customers?

Starting point is 00:24:18 What's their favorite color? And that stuff doesn't change. It's not in the data plane. But the transactions that they do, that's in the data plane. So we realized, well, we can separate these. Before it was just all mushed together. We said, well, let's separate. Let's have this idea of transactions processing.

Starting point is 00:24:40 And then we take of our business transactions. And then we take our business entities out. Because they're not the business events. They take our business entities out because they're not the business events they're the business entities let's put those in a general purpose or what we call olgp you know online general purpose processing let's have a olgp control plan post grizzle my sequel they're great at that and let's have like what if we had a real like like today, let's do LTP. And then what we thought was, well, this is what the switch really needs is financial transactions. It's double entry debit credit. So let's pack 10,000 debit credits in one meg.

Starting point is 00:25:29 And then in a single database query, like one network round trip, you've executed 10,000 business transactions in one database. Because then, again, you're playing. You're taking what was a problem of query amplification. You're judoing it, like inverting it and saying, well, I see your 10x and I'm going to make it a fraction that in one request we've done 10,000. So this was the motto, you know, let's do more with less or more with the same. Because that way it's just so nice. You know, if you need to do a million a second, I don't know how many network requests is that, you know, but you can count it on a, you know, very easy.

Starting point is 00:26:03 It's not 10,000 a second. Yeah, yeah, very cool. Yeah, so 100 a second and you're there. You know, and we all know, you know, Node.js can do 100 network requests a second. Not that we wrote it in Node.js. Yeah, no. Speaking of what are you using?

Starting point is 00:26:19 I see you're using Zig. Tell me about the choice of Zig for Tiger Beetle. Yeah, yeah. So that, yeah, like I said, so that was the big win, you know, just fixing the interface. Kind of like IAE all over again, you know, the interface is the problem,

Starting point is 00:26:32 how we talk to the database. And yeah, just to add a little bit more there, well, we can come back to this. I think OLAP, you know, if you compare OLAP and LTP, you get a similar answer. But in full language, we prototyped Tiger Beetle in five hours, just a quick performance sketch of network protocol,

Starting point is 00:26:52 f-sync, checksumming, what could the whole thing do? And we saw, well, this could do like 400 miles in a second, just fixing this impedance mismatch. And this also solves the contention problem, interestingly enough, because it gets rid of RoNOx within the database. But then

Starting point is 00:27:11 coming to language now, we thought again, what's the right thing? Should we invest in systems languages of the last 30 years, like C or C++? And obviously writing a distributed database like Tiger Beetle, it's a big investment.

Starting point is 00:27:28 So it's going to take time. We're going to take time to, you know, to build this. And we're three years in and coming to our first production release. So which is pretty quick, but it's still three years is long. So we also realized, well, we've got three years of grace. We don't have to pick something in 2020 that is as widely adopted as Rust was then. We could actually just pick what we thought

Starting point is 00:27:59 would make the best choice for a database. And this comes to hardware advances. Our feeling then was that Comtic switches are more expensive. But the other thing was that CPUs are really fast. Disks were becoming really fast with NVMe.

Starting point is 00:28:18 Networks also. You can stripe the disks if you ever have a disk bandwidth problem. You can just stripe. Networks, you know, they're so big the pipes network latency isn't going to change but bandwidth is just increasing but memory bandwidth that's the thing you know because you can't if you if you have a cpu socket how do you stripe memory on it like just to increase your memory balance you can't really you know it's like cached and so we. You can't really. It's capped. We thought, well, this is really where this is going to be our thing.

Starting point is 00:28:49 Numbers every programmer should know. We've got the different colored columns like blue, red, green. I don't know which color the memory column was. I think it was green, but we said, in the past, we used to think of disk seeks and spinning disk as the problem. Let's just move into another column. Let's

Starting point is 00:29:06 see memory bandwidth problem. And then coming out of that, then the decision was ZEG. It could have been Rust, but we wanted a single threaded control plane because we thought in a context which is we don't

Starting point is 00:29:22 actually want multi-threading anymore, fearless concurrency. We just wanted simple single threaded control plane, it's easier. So the Borrower Checker could still have helped us with logical data races on a single thread with concurrency, but the memory efficiency sort of won the day because Zig is obviously great. You can choose your allocator and you can be very specific around alignment in the type system. We wanted to do direct IO because again, you can save memory copies to the page cache

Starting point is 00:29:55 because you don't use the page cache. So it was all about memory, like zero copy, zero deserialization. These are again, those cherries, but they add up. And so we picked Zig because it's just so precise. We didn't see anything

Starting point is 00:30:11 to us that seemed more efficient for working with memory like just being really sharp around memory layout. But the other thing is Zig fixed all the problems with C. So C would have been the other choice because we couldn't choose Rust because we had to handle allocation failure.

Starting point is 00:30:32 That was like a thing. We adopted NASA's power of 10 rules for safety critical code. One of the rules is static memory allocation, which we don't see that in the C sense, you know, in your stack. We see that as, okay, you can use dynamic allocation at startup,

Starting point is 00:30:51 but there's an initialization where you allocate everything you need. And then at runtime, there's no more malloc or free. So everything is statically known, like in the logical sense of the word. So all your memory is statically allocated. You have limits on everything.

Starting point is 00:31:08 But Zig again was better for this because we could handle memory allocation failure at startup and during runtime, you know, we could use the standard lab and know that there's going to be no hidden allocation. So we actually didn't want abstractions. You know, abstractions always carry a cost. They've got a probability of being needed. So in the zero cost thing, I was nervous about that. We wanted less abstractions, a minimum of excellent abstraction is how we put it. And

Starting point is 00:31:38 Zig was very explicit about all this. So those are just some of the reasons, but it just seemed, you know, such a small language and comp time is so powerful. Again, like safety macros can be really dangerous. We didn't want macros, we wanted comp time, and we wanted to do shared memory with the kernel,

Starting point is 00:31:59 and we use a lot of intrusive data structures in Tiger Beetle because it is simpler and safer in a logical sense. And Zig is just very natural for doing that. And the code is very readable. So the switch happened to be the community around that switch was JavaScript developers. And we said, well, hey, Zig reads like TypeScript. They wouldn't be able to write hey, Zig reads like TypeScript. They

Starting point is 00:32:25 wouldn't be able to write it, but they could understand it. So that was quite a big win. Very cool. Another thing I hear you and the Tangle Beetle folks talk about a lot is protocol-aware recovery. Maybe just tell me a little bit of background behind that paper,

Starting point is 00:32:42 what the implications are and what that means for you building a database. Yeah, thanks, Alex. This was, again, just being lucky with timing. Because we thought, well, we can solve the interface. You've got low latency batching. You actually get better latency because your system has more throughput, so there's less curing.

Starting point is 00:33:05 It's kind of counterintuitive, but you have a new database interface and you can just do 10,000 per query. We've got this really efficient, memory-efficient language. We've got IE ring. But we realized, like if you're going to say to a payment switch,

Starting point is 00:33:22 okay, we've got a new OLTP database for you. It's going to be as safe as something that's 30 years tried and tested. They're still going to be nervous. They're going to say, well, see us in 30 years. It's great, you know, but really it's table stakes. But we realized it's not enough to be as safe as Postgres or SQL.

Starting point is 00:33:44 But then like my hobby was always following So we realized it's not enough to be as safe as Postgres or YSQL. But then my hobby was always following the FAST conference in Santa Clara. And that was all the storage fault research from around 2008 over the last decade coming from Wisconsin, Madison. So I don't know if you've read OSTEP, Operating Systems, Tweezing Pieces. No, I haven't read that one. Write it down. I'll link that in the show notes. Okay, cool. So Operating Systems 3D pieces. For some people, it's like the KNRC.

Starting point is 00:34:14 It's like up there. And it's by Ramsey and Andrea Apachidiso. And it's like my favorite computer science book. I actually wrote to Ramsey. I said, you know, it used to be KNR and now it's yours. And I meant it because it's got so many great research papers at the end of every chapter.

Starting point is 00:34:30 They've got these great examples all the way through how to understand an operating system in terms of memory, concurrency, and storage, I believe. Yeah, maybe just the process model i'm not sure but they those same authors have sort of um been the vanguard of all the storage folk research coming out of wisconsin medicine so looking into file systems all the file system bugs do file systems give us durability answer is no they don't zfs does that's sort of the answer. And again, for databases. And there, it was very interesting because you will know, 2018, there was FsyncGate. So Craig Ringer,

Starting point is 00:35:15 he ran into real data loss actually because of Postgres. There was a latent sector error on the disk like a temporary IO error but he actually lost data because of a routine fault that could have been correctly handled by the database and the kernel was also to blame again the interface was surprising but still you know with direct IO this could have been avoided

Starting point is 00:35:42 and so that was 2018. And then it affected almost everybody. My SQL, I think SQLite, Postgres. And everybody patched it by crashing the database. And at startup, they come back and read the write-ahead log and recover from the log. That's how they solved F-Sync-IT. And then in 2020, literally, I think a month before the design of TigerBeagle, UW-Madison came out with a paper, can applications recover from Fsync failure?

Starting point is 00:36:11 And they looked into FsyncGate and asked the question, did the fix work? And it didn't. And the reason is because the databases would come back after the crash, recover from the writing headlog, but they didn't use direct IO. So they didn't realize they were actually reading from the kernel page cache in memory rather than recovering from what's Jiribi on disk. So the database is now externalizing decisions to users and saying, okay, I committed your transaction.

Starting point is 00:36:43 And actually that transaction data was only in the page cache, not never durable. So still the same problem. And Postgres, you know, in 2018, they immediately started with direct IO. It's still not in, you know. Some of it is in, and I'm not sure how far now, but the last few months, it's still, like, sort of preview is in, and I'm not sure how far now, but the last few months, it's still like sort of preview mode, you know, getting into production. So this is, you know, how do you fix F-SyncGate?

Starting point is 00:37:13 You really need direct IO. But then the same people from Madison, that same way in 2018, they had this paper, like you mentioned, Protocol-Aware Recovery for Consensus-Based Storage. And this was asking, so if the gate was about a single node database, like MySQL, Postgres, one machine, and you have just one sector fail, and you see the database accelerates that into data loss. And Protocol-Aware recovery took that storage fault model that basically just to say disks

Starting point is 00:37:47 do fail, ZFS has been leading the charge. And protocol-aware recovery took this world of storage research which was like one discipline, one domain, and then you had the world of distributed systems with a whole other department.

Starting point is 00:38:04 And in the distributed systems world, you had Paxos and Raft, and they would have formal proof. And the formal proof would say, you can't use the clock. You have to have logical timestamps. You can't depend on physical time. And you have to...

Starting point is 00:38:20 You can't trust the network. The network has a fault model. Packets are dropped. Even if you use TCP, you can't get total order. The network has a fault model. Packets are dropped. Even if you use TCP, you can't get total order out of TCP. Otherwise, we wouldn't need consensus. But TCP connections get dropped, and then a lot of bits are off. So then comes Paxos and Raft. And they had formal proofs.

Starting point is 00:38:41 And the model was like, yes, network fails. Clock can't be trusted. You know, you can't, you've got to solve this problem without using clocks. And then processes crash. And that was it. And then they said for storage, yes, we're going to depend on storage to write the votes. You know, each node is going to remember the vote that it makes. It's going to record that vote to stable storage that doesn't backtrack.

Starting point is 00:39:05 You know, once you make a promise, you've got to keep that promise. Otherwise, you're going to slip ring. So that's how all these systems worked in the distributed world. And UW-Madison said, well, it's interesting, you know, that you say storage is stable, that it's perfect and pristine because it's not actually the reality. Because users again, I've already seen this with Postgres, with Epstein-Kate, you get real data loss. So they ask

Starting point is 00:39:34 the question, well, let's take a replicated database, any of the well-known ones, let's put a single sector fault on one machine. So you've got a cluster of five. We'll just corrupt one sector in the log of one of the machines. And then they found that that would actually result in global cluster deals. And that really surprised everybody

Starting point is 00:40:00 because people thought, well, redundancy implies fault tolerance. And UW Medicine came out with a paper before protocol-aware and they said, well, redundancy does not imply fault tolerance. And then protocol-aware recovery was, well, how do you actually build a distributed database if you assume there is also a storage fault? If you're going to have a formal proof, is there also a storage fault model? So we took protocol-aware recovery, and there's two big papers behind target build. And there's a few, but there's VStand replication as our consensus protocol, because that was the only one that actually worked completely in memory. So I thought, well, this is not only the first consensus a year before Paxos. To me, it was easier to understand than Raft, the 2012 paper, but it also seemed to have good sensibility

Starting point is 00:40:54 that you shouldn't trust the disk. It's just easier to get this right if you don't have to trust the disk to be perfect. So we started with that. But the second paper that was key was protocol-aware recovery because that showed us how we could actually do storage, stable storage for our consensus protocol. We obviously don't want to be in memory. We want to get it to disk at some point.

Starting point is 00:41:18 How do you do this? The nice thing with this, what is really cool, it's so obvious. People people say well why not just run z-rate underneath each node in your cluster use local redundancy z-rate would would solve your storage fault model the problem is you know if you have a cluster of three and you do z-rate three times on each you've got a storage overhead of 9x, which is expensive. So again, at scale, you've 10x'd your cost. And what Protocol Aware said is, why do you need local redundancy if you already have replication?

Starting point is 00:41:57 You have global replication. we still writing lsm trees as one component and consensus protocols you know in different departments you know the comp sci building let's integrate these worlds let's say you know if you find a local storage fault you can you are allowed to use your consensus protocol to to recover okay but actually most systems don't do this so tiger beetle you know if it finds a local storage fault it'll actually use the consensus protocol to heal itself and everything is integrated like so that you get the sufficiency just for 3x overhand gotcha and so just so i understand that's a lot yeah no no this is great so protocol aware is just saying like you can't have your storage over here your your sort of consensus system over here and be isolated from each other they really

Starting point is 00:42:49 need to be integrated because if if one fails like you need to understand how to recover from that uh you know in a coherent way exactly because what would happen is you know the local storage engines they would see the disk sector in their log. They would say, oh, I do have checksums, yes. And I see the checksum doesn't match. And they would then say, therefore, the power must have failed while I was writing to my log. And they truncate the log back in time, which is fine if that was the case. But if it's actually bit rot in the middle of your log and you truncate, you're actually truncating committed transactions on that local node. And in the context of consensus, you're truncating your votes.

Starting point is 00:43:36 So now you've just gone into split brain. So that is the problem. So this protocol is two things. You can't be correct as a distributed system if your storage is not integrated with consensus. You cannot be correct. It's not safe if you assume storage faults and if you're not running on Z-rate.

Starting point is 00:43:56 So you could solve it with Z-rate, but again, efficiency. The second contribution of the paper was that you can have higher availability if your local storage can heal itself using global redundancy. So if you have global redundancy, why not use it to recover from faults? It's basically distributed RAID. And that's the protocol-away paper. It's hardly distributed RAID, which is fun. Okay, so that's a bunch of background on, I'd say, research or fun. Okay. So that's like a bunch of background on like, I'd say

Starting point is 00:44:26 research or tech improvements or things like that. And now just like getting into what are the idiosyncratic ways about how you build Tiger Beetle and the Tiger style and all that sort of stuff. And I think sort of flowing directly from that protocol aware recovery, like, Hey, you're building your own consensus and storage that need to, you know, be tightly integrated and know about each other. And like you're saying, people want to see a lot of history. They want to see 30 years of, um, testing and reliability on that. So tell me about Voper and what's that, what, what Voper is doing and how that's sort of helping you. Okay, cool. So again, comes back to that question, question you know it's not enough to be as safe as something that's 30 years tried to test it we actually realized we'll we'll have to make people nervous

Starting point is 00:45:12 like like if you know but the good news was you had like uw madison had done this people you had this great research just waiting to be implemented you know from the past you know since 2008 these amazing papers they used to win best paper every year at fast and in some years i think they didn't present to give people a chance you know that's well that's our joke not theirs but um it's it's an amazing research and um and zfs had done this already and we were like well let's just go and do this because we've got the opportunity. The timing is perfect. We're just lucky.

Starting point is 00:45:48 And so we could actually be so much safer because we could handle – you had that 30 years of hindsight. Now we can – we could solve things that – for Postgres, it was really difficult. They're still trying to retrofit DirectIO because it was never designed for asynchronous direct IO. But we designed from the very beginning for these things, IO, URN, direct IO. So it was easy for Tiger Beagle. It was just the easiest way to do it. But then again, now you're building your own consensus, building your own storage engine. And we had to do that because the existing off-the-shelf solutions were not safe. They didn't have a storage hold model. And we wanted to just do it the right way, but do it quicker. So just one more example, you know, where these systems don't follow protocol aware,

Starting point is 00:46:38 what you will find is they have two write-in head logs. So rocksDB or levelDB or whatever the storage engine is, that has its write-in head log. And the consensus protocol raft has its write-in head logs. So rocks DB or level DB or whatever the storage engine is, that has its right and head log. And the consensus protocol raft has its right and head log. So you're just halving your throughput through the right and head log. And most systems are like this, you know, off the shelf raft plus storage engine. But if you integrate the two, you only have one right and head log. So it's nicer. When you're building these systems, you feel better because you know there's only one write-all log and it's efficient.

Starting point is 00:47:09 You haven't halved your... You haven't doubled your EpSyncs. But how do you test this? So there again, it's just timing. So FoundationDB, I had done a lot of fuzzing and also simulation testing, kind of like Jepson, where you do storage fault injection so that you've got a very nice replicated state machine architecture, you have a write-ah logical ticks and you're just careful, like message passing on the network, you're not leaking IPv4 everywhere,

Starting point is 00:48:11 you're not doing multi-threading in your control plane code, you're just architecting cleanly, then you can actually shim the non-deterministic elements of thread scheduling or timing or IO. You can shim those, but again, because you have storage fault models or network fault models, like you could do message passing and in your interface have a very narrow interface

Starting point is 00:48:31 and say, we'll send a message to a node and receive a message. Messages can be dropped, reordered, duplicated, whatever. And you actually make that very explicit. Then you have a very narrow network interface. But the same with storage, You just say, well, the disk is going to forget or write to the wrong place. It's fine. Now you've got two very simple interfaces, then you can shim those, and then you start to run your whole distributed cluster in a virtual simulation where your network and storage is deterministic, and you can do fault injection, but in a way that is repeatable.

Starting point is 00:49:08 If you find a bug, you can reclare it again and again and again from a random seed. You generate a whole chaotic universe. It's like chaos engineering, but it's deterministic. The tension. It just comes from computer games.

Starting point is 00:49:24 It's like those randomly generated levels. It all comes from computer games you know it's like those randomly generated levels from it all comes from a number and it explodes out but if you do that you can you've got this reproducibility and that so this is kind of answering the question like we've got to be much safer but we've also got to be these systems take 10 years or 30 years to build so how do we build it in three years and again here's the answer because normally if you're testing a distributed system you have a bug it can take you i've had it before with systems where we knew it was there and it took us a year to finally find it and fix it and you know it's there in a distributed system you just can't

Starting point is 00:50:00 reproduce it you you hunt it for a year and you can't afford that. So we thought, well, if we have a deterministic simulation, you know, testing environment, we don't have that problem anymore. And then finally, yeah, the second part of the answer is that if you have a simulation, you can speed up time. You know, if you have abstracted time, you can then speed it up. And that's the big difference

Starting point is 00:50:28 between how we test and how Jepson would test because Jepson can't speed up time. But with Jepson, if you want to find a two-year bug, you have to run it for two years. With Tiger Beetle, we can tick time in a while-true loop. So we worked it out.

Starting point is 00:50:44 Actually, we end up running about 700 times faster. So that's the acceleration factor. In one second you've done 700 seconds. In three seconds you've done 39 minutes. In a day you've done two years.

Starting point is 00:51:00 And that's on one core. And then what we do is we run 10 of like you've said it's the Vop do is we run 10 of, like you've said, it's the VAPA simulator. We run 10 of these VAPAs just 24-7. So every day we do 20 years of real-time testing that we can actually play with time. And the name just comes from Wargames. That's the ultimate inspiration.

Starting point is 00:51:23 Let's simulate. In Wargames, there was the VAPA, the War Operation, I don't know, planning and response. So we had the stamp replication, and we said, well, let's call it the WAPA, the stamp operation replicator. But the funny thing, Alex, is like we used to, if we didn't implement protocol where it just arrived and there were cases where we were slightly not

Starting point is 00:51:47 as efficient as we could have been, the VOPR would find those cases and say, you know, your cluster has become unavailable or deadlocked because of the storage fault sooner than it should have or correctness bugs and it was interesting how a simulator

Starting point is 00:52:03 found the research. That it that was quite amazing you know and only if we implemented protocol recovery could we actually handle it correctly and with you know with optimal availability it's amazing how this like it feeds back on each other and i remember like a few months ago the vopper had found a bug and you all the team just like live streamed sort of like reproducing it figuring out what's going on and it's just like so interesting to to see that all work i guess like how often is is vopper finding a bug is that a regular thing or is that like pretty rare for that to happen at this point what's that look like yeah at this point so it runs we've got it hooked up um you know them running. And if they find something, they automatically classify it as either violation of strict

Starting point is 00:52:50 serviceability, so correctness. It can know that already. So it checks strict serviceability like Jepson would, except it can check it as things are running using a recorded history. It can also check a lot of things because it controls the world. It can actually check a given Tiger Beetle node. It can go and look at the page cache

Starting point is 00:53:14 that we have in user space, and it can compare that to disk and see that we can check and verify that we cached a garant. So it can do all these really cool checks. And it doesn't really find bugs anymore unless we play with the protocol. So what's very nice now is we've got a way

Starting point is 00:53:34 to climb mountains safely. So we can do a lot more climbing. We can really push the protocol and do very nice, simple things that I think people normally wouldn't do them if they didn't have these safety techniques because you just can't climb as creatively as you can if you're safe like this. It kind of depends. Are we doing new protocol work when then we do find bugs. Yeah, and we're still doing a lot of work on the VAPA. There's still a ton of stuff we can do.

Starting point is 00:54:10 But the first time we switched it on, I think we built Tiger Beetle for a year. We always planned it for this testing so that it was architected for it. Then it took three weeks to create the VAPA, the first one. But when we switched it on, like the sensation of actually accelerating time, like we fixed 30 bugs in three weeks,

Starting point is 00:54:33 and they were all tough distributed systems bugs, so we could fix five a day. Because, you know, that live stream you watched, that was a three-hour stream, so that was a tough bug. It still took three hours hours but would have taken a few months otherwise but just that feeling that you get you know so um and again like thanks to foundation yeah very cool well i mean i think the voper is so interesting and like i want there's so many other like unique practices i love the the tiger style doc you have i'll link that in the in

Starting point is 00:55:01 the show notes and how you think about assertions and the static memory allocation like lots of cool stuff there uh we're getting close on time and i do want to talk like some business stuff just a little bit because i think it's interesting like one one issue you mentioned is just like the trust issue when you're building a database how long does it take you know it takes a lot of building and then showing that trust and reliability to get people to go over it so So we talked about that with Voper. I also want to talk to you just about open source. So Tiger Beetle is Apache 2. We've seen a lot of databases or tools lately switch to BSL, the business source license or things like that. How do you think about open source and the BSL?

Starting point is 00:55:44 What do you think about that? Yeah, thanks thanks so again like it's all about building trust you know build trust so i studied accounting and my professors i'm really thankful because the one thing they would drill into us they would ask the question what do auditors you know the big four what do they sell and the answer was trust and And I think in software, so much comes down to trust. At the end of the day, it's about trust. So you can, I think this is what people miss is the business side. I think as engineers, we retreat and we say, well, we don't know much about business. So we're going to choose the license that has business in the title, because obviously a monopoly is the way to build a great business.

Starting point is 00:56:26 But I think you can build amazing businesses that don't have to be monopolies. It's not zero-sum. If you understand that business at the end of the day is about trust, how do you build trust? And I don't think you build trust by saying to your future customers, hey, you should depend on this critical dependency, but you know what? It's going to be a monopoly. It's going to be one business that can support it. I don't think you build trust like that. And I don't actually think, I think we should be more, as engineers, I would like us to be, I come from a business background. I think the way I look about, think about open source licenses is I always ask the question, what's good for business?

Starting point is 00:57:06 Is it free? Is it permissive? Because if you can't have a lot of entrepreneurs building businesses on this, it's not a great, in my book, business license. So that's why I love, I've always loved MIT and I've loved Apache because you can build businesses on these licenses and it's not zero-sum you know again it's all about trust so you know your customers want to trust you and they do trust you but they also want to trust that if something happens to your corporate entity well there's another one you know I mean that that is better for business um if you look at if you look at things you know as a system you know as an ecosystem um and if you want to build a great business you

Starting point is 00:57:53 can't think only you know of your little castle you have to think of the fields beyond it and you really want to get your cavalry into the field you don't want to be the fencer digging moats no one wins a war like that you know you actually want to have the best cavalry in the field you don't want to be defensive digging moats no one wins a war like that you know you actually want to have the best cavalry in the field just go and innovate make a technical contribution builds trust this idea that we're going to lob four-year-old pencils over the wall i i don't know i'm happy you know i've got friends that you know that have you know, I've got friends that, you know, that have, you know, thought they needed to do that. And that's great. It's a different philosophy, but I don't think, it's not great for business.

Starting point is 00:58:30 You know, I think you'd get far more business if you're just open source. And we've also, I mean, we have this legacy, you know, we had people, the previous generation fought hard for open source. And our generation, I see, you know people are saying it's we we say well we're engineers what do we know about this but actually i think i wish people would just say you know what let's just build trust and yeah let's just just be good um yeah yeah what about i mean that trust is such an interesting one and is there a way that you can sort of uh more permanently lock in some of that trust or things like that? Because like one thing that's difficult is we've seen companies that talked about open source, talked

Starting point is 00:59:10 to good open source game for a while and then all this and built up that trust, it seemed like. And then, you know, once it got to day two and things like that, now it's like, hey, now we're BSL going forward. So like, I mean, it's probably hard to donate your database to like the Linux Foundation, you know, like that wouldn't be great. But like, are there ways you can sort of like lock in that trust to where it's not just, you know, subject to the whims of the company and whoever happens to be in leadership, you know, in five, 10 years as well? Yeah, I think that's also good. So we like Tiger Beetle came out of payment switch, which was Apache 2. So we had to be Apacheache 2 by contract so that's

Starting point is 00:59:46 nice you know and so we are apache 2 um i think the other thing you always often what you find is when people bait and switch it's new management you know and maybe they don't understand open source they actually because it's always like how do people play chess are they trying to just capture you know material on the board like just capture a pawn or actually trying to win the game and that's always my question are they thinking second order because everyone is thinking business you know bsl first order oh sleep well at night actually if you look at it like your red panda rewrote kafka i don't know how many years in seven years in they didn't even want Kafka source. They just said, well, things have changed.

Starting point is 01:00:27 We're going to rewrite it, you know? So BSL wouldn't have given, not that Kafka or BSL, that it actually doesn't give you protection in the defense you think because by the time someone is going to compete with you, things have changed. They'll rewrite. And again, you know, competition happens at the interface, not at the implementation. So you can you can license this is my feeling that you can license the

Starting point is 01:00:49 implementation people always compete on the interface it's going to be you know casca drop in replacement um so i'm a huge fan obviously of red bundle um but i i think that story is interesting that um yeah so i i think if you want to build trust, you also don't want to lock people in because it's kind of hard to sell if you're saying to someone, I'm going to lock you in. And there's a better way to do it. You don't need to. If you have trust, that's actually where the value is, which I think for engineers resonates

Starting point is 01:01:23 with all of us because we we all want to build trust and and it's just that we've we had this you know a little bit of first order thinking but actually second order thinking is where it's at yeah yeah you're talking about selling and i know you're coming up on like production release of tiger beetle like what does selling look like for you is it support contracts is it a hosted tiger beetle is it something different what would that look like yeah so it's it's trust and time you know and the startups don't have time open source is too expensive it's not free for them you know they need someone to manage it for them because they don't have time to set up managed environments

Starting point is 01:02:00 a lot of work into that um and at scale it it's trust. So you can give someone free open source. They want to know, who do I call? So it's all about trust. You don't need to lock them in. They want to pay you to be there for them and have a relationship, a real relationship. So that's sort of how we think about it on those two that and that was my own journey i didn't know any of this you know coming into tiger beetle

Starting point is 01:02:30 like we're still gave me goosebumps you know what you can do with all these changes around testing and safety and performance but i was still figuring out the open source model but this is what i learned you know people just want trust and and save them time yep absolutely well jordan we're coming up on time um i just love this conversation which doesn't surprise me i think you have like one of the highest signal twitter accounts um tiger beetle has a great newsletter great great blog all sorts of stuff like i just learned um just amazing stuff and i really think that you have this insight into it feels like you're living in the future truly like just some of the insights you have this insight into it. It feels like you're living in the future, truly,

Starting point is 01:03:05 like just some of the insights you have. So I really appreciate you coming on. Like where can people find you if they want to, you know, learn more about you or about Tiger Beetle? Yeah, so we, like the whole team, we're very accessible. Just come and jump in, you know, on GitHub. Love to see you there on Twitter, Slack, anywhere you find us.

Starting point is 01:03:25 I'm happy to set up calls to chat. Yeah, and Alex, thanks to you because, yeah, this has been such a pleasure. Thanks for having me. Absolutely. Thanks, Jeroen, for coming on. See you.

Software Huddle - Distributed Financial Databases with Joran Dirk Greef of TigerBeetle

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.