Software Huddle - Distributed Financial Databases with Joran Dirk Greef of TigerBeetle
Episode Date: October 24, 2023In this episode we spoke with Joran Dirk Greef, who's the co-founder at TigerBeetle. TigerBeetle is a Financial Transactions Database that's focused on correctness and safety while hitting orders of m...agnitude more performance than other solutions in the space. We touch on various topics like what makes TigerBeetle orders of magnitude more performant, io_uring, the choice of Zig for TigerBeetle, protocol aware recovery, VOPR and so on.
Transcript
Discussion (0)
That's why I love, I've always loved MIT and I've loved Apache because you can build businesses on these licenses.
And it's not zero sum.
Again, it's all about trust.
So operating systems, three easy pieces.
For some people, it's like the KNRC.
It's like up there.
And it's by Ramsey and Andrea Apache-Dusseau.
And it's like my favorite computer science book. But then in the world of energy, I've kind of been learning this as we got into Tiger
Beetle is that our energy sources are changing.
And that's fascinating because the sun is more transactional when you switch to solar.
Energy prices change every second, every minute.
And you can be so much more efficient if you are able to transact energy at a much higher resolution.
Hey folks, this is Alex Sabri. I have a great episode for you today. I spoke with Joran Dirkgrief,
who's the co-founder at Tiger Beetle. You know, Tiger Beetle makes this financial transactions
database that's really focused on not only correctness and safety, but also performance,
right? Getting orders of magnitude more performance than other solutions in the space. I think Yaron and his team are really interesting just in the
content they put out there and the way they think about developing new databases is really
interesting. A lot of interesting practices, you know, one's called the VOPR. It's this
distributed simulation testing tool that helps them discover distributed systems, bugs, safety
bugs, liveness bugs much quicker than they would
in sort of regular testing scenarios and things like that. So I tell you're on a few times here,
I think he's really living in the future just in terms of how he sees, you know, not only what
database needs are going to be, but also development needs and the tools they're building on. So I hope
you enjoy this show. As always, if you have any suggestions or comments or anything like that,
feel free to reach out to me. I'd love to hear about which guests you'd like to see on next.
Other than that, enjoy the show. Yoran, welcome to the show.
Hey, Alex. Thanks for having me. Pleasure to be here with you.
Absolutely. You're the CEO and co-founder at Tiger Beetle, which is a new
database for financial transactions, really focusing on performance and speed.
And I'm super excited to talk to you because I sort of feel like you're living in the future in some way.
You have these sort of opinionated, idiosyncratic views about just how things are changing,
how technologies are changing, and what the implications are for databases and how they should be built and used.
So I'm excited to hear about that today.
I guess for the guests, do you want to just
give them a little bit more background
on you and your history?
Yeah, well, thanks.
So nice to hear that.
I kind of feel like I'm just living in the past
and trying to just get a chance to put it into practice.
All the last 30 years, we've had so much research,
so much hardware advance.
And looking back on that.
Yeah, so a bit of background on Target Needle.
We started the project in July 2020, so just a little over three years ago.
And to put that in perspective, it's interesting because, again, just the timing of things.
So we'll get into all of this hardware research.
How do you build a distributed database today?
And we're just so lucky that you've got new, safer systems languages, Rust and Zig.
So many things have changed.
We've got IOU ring.
But at a high level, the big picture of Tiger Beetle is it's something that's fast, but also small.
So what sort of OLTP database do you use today if you want to track financial transactions?
That was the question as we got into Tiger Beetle.
We were working on an open source payment switch, and this was using a traditional 30 years old OLTP database, MySQL.
And that experience was interesting.
At the end of the day, we realized we could work on this existing system and optimize it for a year with consultants and all of that cost. And maybe we could have an incremental improvement
and this payment switch would be more efficient
and more scalable.
And then we thought, well, you know,
we could put this year to use and see, well,
how would you build an OLTP database today
for tracking financial transactions?
And that became TigerBeal.
Instead of being incremental, how could we go,
typically these systems do a thousand to 10,000 financial transactions a second,
and that's using several hundred dollars a month of cloud hardly. And we wanted to say, well, how can we do a thousand times that?
How can we do a million a second using the same hardware or less?
And then how can we be much safer given Postgres and MySQL are tried and tested for 30 years?
We must have seen where do they fail, where do they fail the test?
Is that the case today? Are
there situations where these databases are not as durable as we think or as highly available?
That was the timing and the context and the goal. How can we be three orders of magnitude faster, but with the same unit of scale?
So this was really about asking, how do we optimize our unit of scale?
You know, the cloud is already more than 10 years old today.
And so we've actually had a lot of cloud databases that were also designed, you know, more than 10 years ago.
And, you know, how have things changed?
And I think in the past, it was a lot about,
let's just try and scale. Let's forget about cost. Let's just scale. Let's have a design that scales.
And today, I think the question is different. Today, the question that we wanted to ask was,
how can we scale cost efficiently? But the way you can answer that question then is you
can say, well, let's play with the unit of scale. Sure, we could add a thousand more machines and
hope that we get to a million a second, but that's very cost prohibitive. How can you take the same
machine and with that track a million financial transactions a second. So that was the goal of Tiger Beetle.
Absolutely.
And are you building for, I guess, sort of like existing financial use cases, like giant
banks and stuff that have these systems that are processing all those transactions and
they need a way to scale better?
Or is it like, hey, we think there's a future out there that there's just like more financial
transactions or just more use cases whether that's
micro payments or like you know in-game payments things like that or even like usage-based billing
for sass like are you building for that new world or is it is it also like the old world really
needs this they can't they're having trouble scaling yes i love i always love the question
where you know it's either or and then you answer it with yes and you say yes. So it's or not XOR and both. And we came at it from the context of micropayments. So this was a switch that needed to do a lot of payments and very cost efficiently because the value was for $5 or whatever.
So fees become very important at that scale.
But Tiger Beetle, it solves a lot of the same problems.
They're all kind of interrelated.
If you can build something that can do a million a second you can you can help the very large scale systems or you can trade that performance for a very small
hardware footprint and be very cheap um and it's very easy to operate so so yeah so both but we
did come at not not not from the crypto world i've never been involved in that uh but from the micro
payments well that that was the like the impetus for target deal because of where the this kind of
interesting you know you touched on it um a future world where the world is going so gaming you know
like you get all these economies in these massive multiplayer games um and we actually had people
reaching out to us
saying that their database was falling over.
We'd have a call and their pager would literally go off.
I kid you not, this actually happened.
At the end of the call, the pager went off.
It said, our ledger database is falling over
because the game was doing so well.
They were doing so many transactions.
But then in the world of energy,
I've kind of been learning this as we got into
tiger beetle is that our energy sources are changing and that's fascinating you know because
the sun is more transactional when you switch to solar energy prices change every second every
minute and you can be so much more efficient if you are able to transact energy at a much higher resolution.
And it turns out existing databases, they struggle to keep up with this more transactional
energy future. So energy systems are looking for Tiger Beetle and then the traditional fintech
world. And then that too is interesting
because so much of that is moving to new technology now.
And then you get this problem
where a lot of open source LTP isn't really online.
It's not highly available.
It's single node, and it doesn't fail over to multiple data centers.
You need to use proprietary solutions to handle that.
And what we also thought was,
let's see if we can shrink the footprint of the system
so you can run it on a Raspberry Pi.
What would that unlock?
You know, then that's what I'm really excited about.
All the use cases that we can't imagine
because you can just give people, you give people three new orders of scale or efficiency
and then let them come up with...
And again, usage-based billing.
What was interesting I heard,
there was actually a conference a while back
with people like GitHub and places like that.
They were actually sharing hacks for how they do billing
because billing scale has increased
that they can't actually track it all.
So they have to like try and probabilistically charge people
because they can't afford.
Approximate it, yeah.
Yeah, you know, with like serverless billing,
how do you track, you know,
so many seconds in a day and usage that just explodes?
So if you have something that just very easily just thinks in a day and usage that that just explodes um so if you have something that
just very easily just thinks in a new order it becomes easier so that was the other goal you
know safety performance but also operator experience let's just have a like a little
ant that can do a lot of work and be you know you sleep all at night too yep that's amazing
and that raspberry pi one it kind of reminds me we're seeing like this trend with like SQLite and
Terso and different things that people are doing of just like, Hey,
having lots of little tiny local databases or things like that.
So it's, it's cool to see, I don't know,
all these different changes that are happening, you know,
giant databases in, in cloud administrators,
but also small local ones as well. So.
Exactly. I think, yeah, it's definitely like I love that, you know,
what people are doing like Terso with SQLite
and I learned
a lot from, you know, from Glamour Costa
when he wrote about IONUring as well
that made an impact on Tiger Beetle,
the Ciladb people,
Red Panda,
Martin Thompson of LMAX.
They all, you know, think similarly
but what I love is when people say,
well, let's just allow people to have 50,000 databases.
What will happen when you just think,
you just play with the experiment and tweak the experiment
instead of being incremental,
because it's just more fun and more efficient.
Yep, absolutely.
So you just spoke at QCon, and I've heard just like a lot of
really great things on it from Twitter, from people that were there and things like that.
I can't wait to see it. I don't want to rehash the whole talk, but I do want to get people sort
of interested. And I think also set up the problem you're looking at and how these technical changes
impact it. So like one thing you say is like the OLTP workload is becoming more contentious.
Can you tell me what you mean by that? Yes. So this was something, I think like this was totally
new, new for me. Um, what's interesting with OLTP, just a bit of backdrop. Um, well,
let's start at the beginning and then, then we come back to you know to to to QCon um
what we saw like looking at this payment switch we will you know analyzing this payment switch
we are the bottlenecks and you see that the switch is really trying to do transactions
in the logical sense business transactions um who, what, when, where, why, how much. It's business, everyday
business. There's counterparties, there's a transaction that happened in the real world.
And that was all the switch was trying to do. And when you trace the physical SQL queries,
the database queries, for every logical real world transaction, there would be 10 to 20
physical SQL queries. To implement something that was quite standardized, because what the switch
was really trying to do, like every business is you're recording money, you're doing double entry
accounting, there's debits and credits, but it comes down to that. If the database doesn't do it to begin with next month,
the auditors will have to convert it into that format anyway.
And so we saw the switch trying to do debit credit.
All the code was that around the generic general purpose database.
And you had this query amplification, which was costing you,
because if your switch can do a thousand a second
it's doing 10,000 network requests a second behind that talking to the data so that's very
tough to scale because as whenever you scale we just multiply by 10 and typically that's all
these systems they all have it's you chat to other people, you know, in India,
working on the massive UPI nationwide,
the INPS switch that does,
it crossed 10 billion real-time payments in August.
They can't use OLTP databases anymore
because they don't,
it just doesn't handle the scale.
So they reinvent OLTP over a durable log Kafka
or, you know or Red Panda today. And then they do the
transactions processing logic in Java services above Redis, just because it doesn't keep up.
And this was really the problem is how do you do logical transactions that we've got an impedance
mismatch? So coming to the contentious workload, this is the other interesting thing is that people like I grew up in the era of like, I think we both did, you know, eventual consistency and let's scale, let's go horizontal.
People thought Moore's law was dying 10 years ago.
And so, well, we're going to have to scale out. And the problem though is that the workload that these switches do in most
financial transactions is there's only so many bank accounts. And what I mean is that you could
have like a hundred million customers and you could have a thousand machines and partitioning
your customers across those machines fine. You only have so many bank accounts like 10. So if you have a thousand machines and a
thousand shards, all those shards are going to serialize and bottleneck on the one shard that
has your bank account or your fees account. So this is classical, the way the real world works
is you always have on one side of the equation, you can go horizontal, sure.
But on the contra side, there's so many bank accounts,
so many energy providers, so much inventory in the game.
You're always limited.
So the horizontal scan out strategy works very well
for S3 or key value storage.
But this is what we started to ask.
Does it still work well today for transactional workloads?
Like, for example, the switch could also be deployed
where there's only four banks.
So you literally only have four accounts
so that you just can't logically,
you know, you could have four machines,
but they're all going to wait on each other.
So it's actually better just to have one machine and scale up.
And so this is the trouble of the contentious OLTP workload.
But it's also, we're starting to get into, what is OLTP?
Is it the logical real transactions or is it database transactions?
Which came first?
Who got the name Promgoo?
Yeah, interesting. which came first you know who got the name prom goo yeah interesting i'm excited to talk a little bit more about how yeah how you're solving that issue and what implications it has but i want to
talk a little bit about some technical stuff because you've talked about changes in hardware
changes in like core software libraries or languages that are that are you know just
changing what's possible and things like that. So maybe walk me through this. This is like an area I try to read up on and stay up on,
but it's just also like so far removed
from what I do day to day.
So I might ask some dumb questions.
But first of all, like I hear a lot about IOU Ring.
What is IOU Ring?
Why is it such a big game changer when it was released?
Yeah, so huge thanks to Glauber.
He wrote this magnificent post on IOU-Ring early 2020 that took the scales. I knew iE-Ring was thinking about it, but that post really, I mean, you've got it. if you want to read and write from the disk or from the network, what's interesting is it's
actually the same essential syscall. You're reading into a buffer that you provide.
So you give the kernel a buffer and you say, read into this buffer from a file descriptor for me.
Or you say to the kernel, here is a buffer that I have right from this buffer to a file
descriptor, which can be disk or network. This is pretty much it. Then you've got a problem where
on Linux traditionally, you couldn't do that asynchronously. So what I mean is that if you call that syscall,
your program will block while the kernel goes and does that work,
and then it comes back and tells you it's done.
And then obviously, if you're all familiar with JavaScript
and how Node.js works, there was that async revolution
with LibUV under Node.js
where it was like, well, we can...
So the Linux didn't really have the proper APIs
to do this asynchronously.
So how do you let your program keep running
while that task is off executing?
While you're waiting on the network
or you're waiting on the disk?
Because you want to keep going so that you can do CPU and IO.
You can pipeline them.
Otherwise, you're wasting CPU.
And Linux didn't really have good APIs for this.
And at the same time, people like Morten Thompson were talking about mechanical sympathy
and context switches.
These happen as you make the syscall.
You're switching context.
You're messing with the flow of the CPU.
And I mean, as programmers, we know if someone interrupts you,
it's expensive.
But the CPU, it gets really expensive.
And this was becoming more of a thing
because CPUs were getting faster.
And so an IO was getting faster.
So what you saw is that NVMe is so fast that the context switch is becoming as expensive as just doing IO.
So the interface is starting to cost as much as the work itself.
It's not exactly the same, but within the same order.
And so the interface was becoming a problem.
And up until that point, people like, you know,
if you understand the BV, you have a user space thread pool.
So if you want to, you have a control plane,
your program, your code is executing.
If you want to do some async IO,
you drop that into your user space thread pool.
That will then do the blocking,
it's just called to the kernel.
The problem is, yeah, again, context switches,
you know, going across threads.
So IOUring came out just before and solved all of this
by giving very nice ring buffers.
So you have a submission queue ring buffer from user space
to kernel space to send commands.
And then you have a completion queue
going the other direction from kernel space back to user space
to give you the results of these commands.
And it's just shared memory with the kernel.
And it's pretty much efficient.
So no more context switches, no more syscalls.
You just drop things at the head of a queue,
read things off the tail of a queue or vice versa.
And it's a beautiful design.
And it's very sympathetic to the hardware um this
actually isn't the the big performance insight for targeting although so i think that's this
interesting thing this is like a marginal incremental improvement so it's the right
interface it's far more efficient um and and the reason why we used iURing is also because it's just so much easier because the Linux APIs, you're actually doing the same thing, whether you work with network or disk, but the Linux APIs for these were totally different.
So you could do async networking, but not async disk.
But with IURing, I think this is the biggest contribution is you get a unified API for doing all async
network, async this, perfect, beautiful. It's perfectly efficient. It's very simple.
And you have one interface. So that's what I really liked. We didn't have to go and write
the BV again. Yeah. Okay. So you mentioned that's not Tiger Beetle's main performance secret or key there. What is the key that Tiger Beetle is using there? 10% and you add all of that together and it's just, it's the right way to do it,
but it's also simpler.
So you've got less current for if you use Iodine,
it's just,
it's much nicer.
A lot of the Tiger Beetle design decisions were just what would be quicker?
What would be the right thing?
And it would be the right,
no technical debt,
but it would actually just be easier and quicker and safer,
simpler.
So that's kind of why we made all the decisions we did. Yeah. it, but it would actually just be easier and quicker and safer and simpler.
That's kind of why we made all the decisions we did.
What was the big performance
win? The thing
that gave us four orders of magnitude,
not three. We actually got four,
I think. That was this insight
that is the
coming back to this query
impedance dispatchatch logical physical
transactions
we thought we have to solve this problem
because we're losing an order of magnitude
if your amplification is a factor of 10
so we thought
well the information
that you're really tracking is who
what when where why how much
that is in the data plane
hot data,
but it's actually small.
You could fit that into 128 bytes.
And then a lot of the big data, variable links data,
that was in the control plane.
That was like, who are our customers?
What's their favorite color?
And that stuff doesn't change.
It's not in the data plane.
But the transactions that they do, that's in the data plane.
So we realized, well, we can separate these.
Before it was just all mushed together.
We said, well, let's separate.
Let's have this idea of transactions processing.
And then we take of our business transactions. And then we take our business entities out.
Because they're not the business events. They take our business entities out because they're not the business
events they're the business entities let's put those in a general purpose or what we call olgp
you know online general purpose processing let's have a olgp control plan post grizzle my sequel
they're great at that and let's have like what if we had a real like like today, let's do LTP.
And then what we thought was, well, this is what the switch really needs is financial transactions.
It's double entry debit credit.
So let's pack 10,000 debit credits in one meg.
And then in a single database query, like one network round trip, you've executed 10,000 business transactions in one database.
Because then, again, you're playing.
You're taking what was a problem of query amplification.
You're judoing it, like inverting it and saying, well, I see your 10x and I'm going to make it a fraction that in one request we've done 10,000.
So this was the motto, you know, let's do more with less or more with the same.
Because that way it's just so nice.
You know, if you need to do a million a second,
I don't know how many network requests is that, you know, but you can count it on a, you know, very easy.
It's not 10,000 a second.
Yeah, yeah, very cool.
Yeah, so 100 a second and you're there.
You know, and we all know, you know,
Node.js can do 100 network requests a second.
Not that we wrote it in Node.js.
Yeah, no.
Speaking of what are you using?
I see you're using Zig.
Tell me about the choice of Zig for Tiger Beetle.
Yeah, yeah.
So that, yeah, like I said,
so that was the big win,
you know, just fixing the interface.
Kind of like IAE all over again,
you know, the interface is the problem,
how we talk to the database.
And yeah, just to add a little bit more there,
well, we can come back to this.
I think OLAP, you know,
if you compare OLAP and LTP,
you get a similar answer.
But in full language, we prototyped Tiger Beetle in five hours,
just a quick performance sketch of network protocol,
f-sync, checksumming, what could the whole thing do?
And we saw, well, this could do like 400 miles in a second,
just fixing this impedance mismatch.
And this also solves the contention problem,
interestingly enough,
because it gets rid of RoNOx
within the database.
But then
coming to language now, we thought
again, what's the
right thing? Should we invest
in systems languages of the last
30 years, like C or C++?
And obviously
writing a distributed database
like Tiger Beetle, it's a big investment.
So it's going to take time.
We're going to take time to, you know, to build this.
And we're three years in and coming to our first production release.
So which is pretty quick, but it's still three years is long.
So we also realized, well, we've got three years of grace.
We don't have to pick something in 2020
that is as widely adopted as Rust was then.
We could actually just pick what we thought
would make the best choice for a database.
And this comes to hardware advances.
Our feeling then was that
Comtic switches are more expensive.
But the other thing was that
CPUs
are really fast. Disks
were becoming really fast with NVMe.
Networks also. You can stripe
the disks if you ever have a
disk bandwidth problem. You can just stripe.
Networks, you know, they're so big the pipes network latency isn't going to change but bandwidth is just
increasing but memory bandwidth that's the thing you know because you can't if you if you have a
cpu socket how do you stripe memory on it like just to increase your memory balance you can't
really you know it's like cached and so we. You can't really. It's capped.
We thought, well, this is really where this is going to be our thing.
Numbers every programmer should know.
We've got the different colored columns like blue, red,
green. I don't know which color
the memory column was. I think it was green,
but we said,
in the past, we used to think of disk seeks
and spinning disk as the problem.
Let's just move into another column. Let's
see memory bandwidth problem.
And then
coming out of that, then the decision
was ZEG. It could
have been Rust,
but we wanted a single
threaded control plane because
we thought in a context which is we don't
actually want multi-threading anymore,
fearless concurrency. We just wanted simple single threaded control plane, it's easier.
So the Borrower Checker could still have helped us with logical data races on a single thread with concurrency, but the memory efficiency sort of won the day because Zig is obviously great. You can choose your allocator
and you can be very specific around alignment
in the type system.
We wanted to do direct IO
because again, you can save memory copies
to the page cache
because you don't use the page cache.
So it was all about memory,
like zero copy, zero deserialization.
These are again, those cherries,
but they add up.
And so we picked
Zig because it's just so precise.
We didn't see anything
to us that seemed
more efficient for working with memory
like just being really sharp
around memory layout.
But the other thing is Zig fixed all the
problems with C. So C would have been
the other choice because we couldn't choose Rust
because we had to handle allocation failure.
That was like a thing.
We adopted NASA's power of 10 rules
for safety critical code.
One of the rules is static memory allocation,
which we don't see that in the C sense,
you know, in your stack.
We see that as, okay,
you can use dynamic allocation at startup,
but there's an initialization
where you allocate everything you need.
And then at runtime,
there's no more malloc or free.
So everything is statically known,
like in the logical sense of the word.
So all your memory is statically allocated.
You have limits on everything.
But Zig again was better for this
because we could handle memory allocation failure at startup
and during runtime, you know, we could use the standard lab
and know that there's going to be no hidden allocation.
So we actually didn't want abstractions.
You know, abstractions always carry a cost.
They've got a probability of being needed. So in the zero cost thing, I was nervous about that.
We wanted less abstractions, a minimum of excellent abstraction is how we put it. And
Zig was very explicit about all this. So those are just some of the reasons, but it just seemed, you know, such a
small language and comp time is so
powerful. Again,
like safety macros can be really
dangerous.
We didn't want macros, we wanted
comp time, and we wanted
to do shared memory with the kernel,
and we use
a lot of intrusive
data structures in Tiger Beetle because it is simpler and safer in a logical sense.
And Zig is just very natural for doing that.
And the code is very readable.
So the switch happened to be the community around that switch was JavaScript developers.
And we said, well, hey, Zig reads like TypeScript.
They wouldn't be able to write hey, Zig reads like TypeScript. They
wouldn't be able to write it, but they could
understand it. So that was quite a
big win.
Very cool. Another thing I hear
you and the Tangle Beetle folks
talk about a lot is protocol-aware
recovery.
Maybe just tell me a little bit of background behind that paper,
what the implications are and what that means
for you building a database.
Yeah, thanks, Alex.
This was, again, just being lucky with timing.
Because we thought, well, we can solve the interface.
You've got low latency batching.
You actually get better latency because your system has more throughput,
so there's less curing.
It's kind of counterintuitive,
but you have a new database interface
and you can just do 10,000 per query.
We've got this really efficient,
memory-efficient language.
We've got IE ring.
But we realized,
like if you're going to say to a payment switch,
okay, we've got a new OLTP database for you.
It's going to be as safe
as something that's 30 years tried and tested.
They're still going to be nervous.
They're going to say, well, see us in 30 years.
It's great, you know, but really it's table stakes.
But we realized it's not enough
to be as safe as Postgres or SQL.
But then like my hobby was always following So we realized it's not enough to be as safe as Postgres or YSQL.
But then my hobby was always following the FAST conference in Santa Clara. And that was all the storage fault research from around 2008 over the last decade coming from Wisconsin, Madison.
So I don't know if you've read OSTEP, Operating Systems, Tweezing Pieces.
No, I haven't read that one.
Write it down. I'll link that in the show notes.
Okay, cool.
So Operating Systems 3D pieces.
For some people, it's like the KNRC.
It's like up there.
And it's by Ramsey and Andrea
Apachidiso. And it's like
my favorite computer science book.
I actually wrote to Ramsey. I said, you know, it used to be
KNR and now it's yours.
And I meant it because it's got so many great research papers
at the end of every chapter.
They've got these great examples all the way through
how to understand an operating system in terms of memory,
concurrency, and storage, I believe.
Yeah, maybe just the process model i'm not sure but they those same authors have sort of
um been the vanguard of all the storage folk research coming out of wisconsin medicine so
looking into file systems all the file system bugs do file systems give us durability answer
is no they don't zfs does that's sort of the answer. And again, for databases. And there,
it was very interesting because you will know, 2018, there was FsyncGate. So Craig Ringer,
he ran into real data loss actually because of Postgres. There was a latent sector error on the disk like a temporary IO error
but he actually lost data
because of a routine fault
that could have been correctly handled by the database
and the kernel was also to blame
again the interface was surprising
but still you know with direct IO
this could have been avoided
and so that was 2018.
And then it affected almost everybody.
My SQL, I think SQLite, Postgres.
And everybody patched it by crashing the database.
And at startup, they come back and read the write-ahead log and recover from the log.
That's how they solved F-Sync-IT.
And then in 2020, literally, I think a month before the design of TigerBeagle,
UW-Madison came out with a paper, can applications recover from Fsync failure?
And they looked into FsyncGate and asked the question, did the fix work? And it didn't.
And the reason is because the databases would come back after the crash, recover from the
writing headlog, but they didn't use direct IO.
So they didn't realize they were actually reading
from the kernel page cache in memory
rather than recovering from what's Jiribi on disk.
So the database is now externalizing decisions to users
and saying, okay, I committed your transaction.
And actually that transaction data was only in the page cache,
not never durable.
So still the same problem.
And Postgres, you know, in 2018, they immediately started with direct IO.
It's still not in, you know.
Some of it is in, and I'm not sure how far now, but the last few months,
it's still, like, sort of preview is in, and I'm not sure how far now, but the last few months, it's still like sort of preview mode, you know, getting into production.
So this is, you know, how do you fix F-SyncGate?
You really need direct IO.
But then the same people from Madison, that same way in 2018, they had this paper, like you mentioned, Protocol-Aware Recovery for Consensus-Based Storage. And this was asking,
so if the gate was about a single node database,
like MySQL, Postgres, one machine,
and you have just one sector fail,
and you see the database accelerates that into data loss.
And Protocol-Aware recovery took that storage fault model
that basically just to say disks
do fail, ZFS
has been leading the charge.
And protocol-aware recovery
took this world of storage research
which was like one discipline, one
domain, and then you had the world
of distributed systems with a whole other
department.
And in the distributed systems world,
you had Paxos and Raft, and they would have
formal proof. And the formal proof
would say, you can't use the
clock. You have to have
logical timestamps.
You can't depend on physical time.
And you have to...
You can't trust the network.
The network has a fault model.
Packets are dropped. Even if you use TCP, you can't get total order. The network has a fault model. Packets are dropped.
Even if you use TCP, you can't get total order out of TCP.
Otherwise, we wouldn't need consensus.
But TCP connections get dropped, and then a lot of bits are off.
So then comes Paxos and Raft.
And they had formal proofs.
And the model was like, yes, network fails.
Clock can't be trusted.
You know, you can't, you've got to solve this problem without using clocks.
And then processes crash.
And that was it.
And then they said for storage, yes, we're going to depend on storage to write the votes.
You know, each node is going to remember the vote that it makes.
It's going to record that vote to stable storage that doesn't backtrack.
You know, once you make a promise, you've got to keep that promise. Otherwise,
you're going to slip ring. So that's how all these systems worked in the distributed world.
And UW-Madison said, well, it's interesting, you know, that you say storage is stable,
that it's perfect and pristine because it's not actually
the reality. Because users
again, I've already seen this with
Postgres, with Epstein-Kate, you get
real data loss. So they ask
the question, well, let's take
a replicated database,
any of the well-known
ones, let's put a
single sector fault
on one machine. So you've got a cluster of five.
We'll just corrupt one sector in the log of one of the machines. And then they found that
that would actually result in global cluster deals. And that really surprised everybody
because people thought, well, redundancy implies fault tolerance. And
UW Medicine came out with a paper before protocol-aware and they said, well,
redundancy does not imply fault tolerance. And then protocol-aware recovery was, well,
how do you actually build a distributed database if you assume there is also a storage fault?
If you're going to have a formal proof, is there also a storage fault model? So we took protocol-aware recovery, and there's two big papers behind target build.
And there's a few, but there's VStand replication as our consensus protocol, because that was the only one that actually worked completely in memory.
So I thought, well, this is not only the first consensus a year before Paxos. To me,
it was easier to understand than Raft, the 2012 paper, but it also seemed to have good sensibility
that you shouldn't trust the disk. It's just easier to get this right if you don't have to
trust the disk to be perfect. So we started with that. But the second paper that was key was protocol-aware recovery because that
showed us how we could
actually do storage,
stable storage for our consensus
protocol. We obviously don't
want to be in memory. We want to get
it to disk at some point.
How do you do this?
The nice
thing with this, what is really cool,
it's so obvious. People people say well why not just
run z-rate underneath each node in your cluster use local redundancy z-rate would would solve
your storage fault model the problem is you know if you have a cluster of three and you do z-rate
three times on each you've got a storage overhead of 9x, which is expensive. So again, at scale, you've 10x'd your cost.
And what Protocol Aware said is, why do you need local redundancy if you already have replication?
You have global replication. we still writing lsm trees as one component and consensus protocols you know in different
departments you know the comp sci building let's integrate these worlds let's say you know if you
find a local storage fault you can you are allowed to use your consensus protocol to to recover okay
but actually most systems don't do this so tiger beetle you know if it
finds a local storage fault it'll actually use the consensus protocol to heal itself
and everything is integrated like so that you get the sufficiency just for 3x overhand
gotcha and so just so i understand that's a lot yeah no no this is great so protocol aware is
just saying like you can't have your storage over here your your sort of consensus system over here and be isolated from each other they really
need to be integrated because if if one fails like you need to understand how to recover from that
uh you know in a coherent way exactly because what would happen is you know the local storage
engines they would see the disk sector in their log. They would say, oh, I do have
checksums, yes. And I see the checksum doesn't match. And they would then say, therefore,
the power must have failed while I was writing to my log. And they truncate the log back in time,
which is fine if that was the case. But if it's actually bit rot in the middle of your log and you truncate, you're actually
truncating committed transactions on that local node.
And in the context of consensus, you're truncating your votes.
So now you've just gone into split brain.
So that is the problem.
So this protocol is two things.
You can't be correct as a distributed system
if your storage is not integrated with consensus.
You cannot be correct.
It's not safe if you assume storage faults
and if you're not running on Z-rate.
So you could solve it with Z-rate, but again, efficiency.
The second contribution of the paper was
that you can have higher availability if your local storage can heal itself using global redundancy.
So if you have global redundancy, why not use it to recover from faults?
It's basically distributed RAID.
And that's the protocol-away paper.
It's hardly distributed RAID, which is fun.
Okay, so that's a bunch of background on, I'd say, research or fun. Okay. So that's like a bunch of background on like, I'd say
research or tech improvements or things like that. And now just like getting into
what are the idiosyncratic ways about how you build Tiger Beetle and the Tiger style and all
that sort of stuff. And I think sort of flowing directly from that protocol aware recovery,
like, Hey, you're building your own consensus and storage that need to, you know, be tightly integrated and know about each other. And like you're saying,
people want to see a lot of history. They want to see 30 years of, um, testing and reliability on
that. So tell me about Voper and what's that, what, what Voper is doing and how that's sort
of helping you. Okay, cool. So again, comes back to that question, question you know it's not enough to be as safe as something
that's 30 years tried to test it we actually realized we'll we'll have to make people nervous
like like if you know but the good news was you had like uw madison had done this people
you had this great research just waiting to be implemented you know from the past you know since 2008 these amazing
papers they used to win best paper every year at fast and in some years i think they didn't present
to give people a chance you know that's well that's our joke not theirs but um it's it's an
amazing research and um and zfs had done this already and we were like well let's just go and
do this because we've got the opportunity.
The timing is perfect.
We're just lucky.
And so we could actually be so much safer because we could handle – you had that 30 years of hindsight.
Now we can – we could solve things that – for Postgres, it was really difficult.
They're still trying to retrofit DirectIO because it was never designed for asynchronous direct IO. But we designed from the very beginning for these things,
IO, URN, direct IO. So it was easy for Tiger Beagle. It was just the easiest way to do it.
But then again, now you're building your own consensus, building your own storage engine.
And we had to do that because the existing off-the-shelf solutions were not safe. They didn't have a storage hold model.
And we wanted to just do it the right way, but do it quicker.
So just one more example, you know, where these systems don't follow protocol aware,
what you will find is they have two write-in head logs.
So rocksDB or levelDB or whatever the storage engine is, that has its write-in head log. And the consensus protocol raft has its write-in head logs. So rocks DB or level DB or whatever the storage engine is, that has its right and head
log. And the consensus protocol raft has its right and head log. So you're just halving your throughput
through the right and head log. And most systems are like this, you know, off the shelf raft
plus storage engine. But if you integrate the two, you only have one right and head log. So
it's nicer. When you're building these systems, you feel better
because you know there's only one write-all log
and it's efficient.
You haven't halved your...
You haven't doubled your EpSyncs.
But how do you test this?
So there again, it's just timing.
So FoundationDB, I had done a lot of fuzzing
and also simulation testing, kind of like Jepson, where you do storage fault injection so that you've got a very nice replicated state machine architecture, you have a write-ah logical ticks and you're just careful,
like message passing on the network,
you're not leaking IPv4 everywhere,
you're not doing multi-threading in your control plane code,
you're just architecting cleanly,
then you can actually shim
the non-deterministic elements
of thread scheduling or timing or IO.
You can shim those,
but again, because you have storage fault models or network
fault models, like you could do message passing and in your interface have a very narrow interface
and say, we'll send a message to a node and receive a message. Messages can be dropped,
reordered, duplicated, whatever. And you actually make that very explicit. Then you have a very
narrow network interface. But the same with storage, You just say, well, the disk is going to forget or write to the wrong place.
It's fine.
Now you've got two very simple interfaces, then you can shim those, and then you start
to run your whole distributed cluster in a virtual simulation where your network and
storage is deterministic, and you can do fault injection, but
in a way that is repeatable.
If you find a bug, you
can reclare it again and again and again
from a random seed. You generate a whole
chaotic universe.
It's like chaos engineering, but it's
deterministic.
The tension.
It just comes from computer games.
It's like those randomly generated levels. It all comes from computer games you know it's like those
randomly generated levels from it all comes from a number and it explodes out
but if you do that you can you've got this reproducibility and that so this is kind of
answering the question like we've got to be much safer but we've also got to be these systems take
10 years or 30 years to build so how do we build it in three years
and again here's the answer because normally if you're testing a distributed system you have a
bug it can take you i've had it before with systems where we knew it was there and it took us a year
to finally find it and fix it and you know it's there in a distributed system you just can't
reproduce it you you hunt it for a year and you can't afford that.
So we thought, well, if we have a deterministic simulation,
you know, testing environment,
we don't have that problem anymore.
And then finally, yeah, the second part of the answer is that if you have a simulation, you can speed up time.
You know, if you have abstracted time,
you can then speed it up.
And that's the big difference
between how we test and how Jepson would test
because Jepson can't speed up time.
But with Jepson,
if you want to find a two-year bug,
you have to run it for two years.
With Tiger Beetle,
we can tick time in a while-true loop.
So we worked it out.
Actually, we end up running
about 700 times faster.
So that's the acceleration factor.
In one second you've done
700 seconds. In three
seconds you've done 39 minutes.
In a day
you've done two years.
And that's on one core.
And then what we do is we run 10 of
like you've said it's the Vop do is we run 10 of, like you've said, it's the VAPA simulator.
We run 10 of these VAPAs just 24-7.
So every day we do 20 years of real-time testing
that we can actually play with time.
And the name just comes from Wargames.
That's the ultimate inspiration.
Let's simulate.
In Wargames, there was the VAPA, the War Operation, I don't know,
planning and response.
So we had the stamp replication, and we said, well, let's call it the WAPA,
the stamp operation replicator.
But the funny thing, Alex, is like we used to,
if we didn't implement protocol where it just arrived and there were
cases where we were slightly not
as efficient as we could have been,
the VOPR would find those cases
and say, you know, your
cluster has become unavailable or deadlocked
because of the storage fault sooner than
it should have or correctness
bugs and it was
interesting how a simulator
found the research. That it that was quite amazing
you know and only if we implemented protocol recovery could we actually handle it correctly
and with you know with optimal availability it's amazing how this like it feeds back on each other
and i remember like a few months ago the vopper had found a bug and you all the team just like
live streamed sort of like reproducing it figuring out what's going on and it's just like so interesting to to see that all
work i guess like how often is is vopper finding a bug is that a regular thing or is that like
pretty rare for that to happen at this point what's that look like yeah at this point so it
runs we've got it hooked up um you know them running. And if they find something, they automatically classify it as either violation of strict
serviceability, so correctness.
It can know that already.
So it checks strict serviceability like Jepson would, except it can check it as things are
running using a recorded history.
It can also check a lot of things
because it controls the world.
It can actually check a given Tiger Beetle node.
It can go and look at the page cache
that we have in user space,
and it can compare that to disk
and see that we can check and verify
that we cached a garant.
So it can do all these really cool checks.
And it doesn't really find bugs anymore
unless we play with the protocol.
So what's very nice now is we've got a way
to climb mountains safely.
So we can do a lot more climbing.
We can really push the protocol and do very nice, simple things that I think people normally wouldn't
do them if they didn't have these safety techniques because you just can't climb as creatively
as you can if you're safe like this.
It kind of depends.
Are we doing new protocol work when then we do find bugs. Yeah, and we're still doing a lot of work on the VAPA.
There's still a ton of stuff we can do.
But the first time we switched it on,
I think we built Tiger Beetle for a year.
We always planned it for this testing
so that it was architected for it.
Then it took three weeks to create the VAPA,
the first one.
But when we switched it on, like the sensation of actually accelerating time,
like we fixed 30 bugs in three weeks,
and they were all tough distributed systems bugs,
so we could fix five a day.
Because, you know, that live stream you watched,
that was a three-hour stream, so that was a tough bug.
It still took three hours hours but would have taken a
few months otherwise but just that feeling that you get you know so um and again like thanks to
foundation yeah very cool well i mean i think the voper is so interesting and like i want there's
so many other like unique practices i love the the tiger style doc you have i'll link that in the in
the show notes and how you think about assertions and the static memory allocation like lots of cool stuff there uh we're getting close on time and i do
want to talk like some business stuff just a little bit because i think it's interesting
like one one issue you mentioned is just like the trust issue when you're building a database
how long does it take you know it takes a lot of building and then showing that trust and
reliability to get people to go over it so So we talked about that with Voper.
I also want to talk to you just about open source.
So Tiger Beetle is Apache 2.
We've seen a lot of databases or tools lately switch to BSL, the business source license or things like that. How do you think about open source and the BSL?
What do you think about that? Yeah, thanks thanks so again like it's all about building trust you know build trust so
i studied accounting and my professors i'm really thankful because the one thing they would drill
into us they would ask the question what do auditors you know the big four what do they sell
and the answer was trust and And I think in software,
so much comes down to trust. At the end of the day, it's about trust. So you can, I think this
is what people miss is the business side. I think as engineers, we retreat and we say, well, we don't
know much about business. So we're going to choose the license that has business in the title,
because obviously a monopoly is the way to build a great business.
But I think you can build amazing businesses that don't have to be monopolies.
It's not zero-sum.
If you understand that business at the end of the day is about trust, how do you build trust?
And I don't think you build trust by saying to your future customers, hey, you should depend on this critical dependency,
but you know what? It's going to be a monopoly. It's going to be one business that can support it.
I don't think you build trust like that. And I don't actually think, I think we should be more,
as engineers, I would like us to be, I come from a business background. I think
the way I look about, think about open source licenses is I always ask the question, what's good for business?
Is it free?
Is it permissive?
Because if you can't have a lot of entrepreneurs building businesses on this, it's not a great, in my book, business license.
So that's why I love, I've always loved MIT and I've loved Apache because you can build businesses on these
licenses and it's not zero-sum you know again it's all about trust so you know your customers want to
trust you and they do trust you but they also want to trust that if something happens to your
corporate entity well there's another one you know I mean that that is better for business um if you look at if you look at things
you know as a system you know as an ecosystem um and if you want to build a great business you
can't think only you know of your little castle you have to think of the fields beyond it and
you really want to get your cavalry into the field you don't want to be the fencer digging moats
no one wins a war like that you know you actually want to have the best cavalry in the field you don't want to be defensive digging moats no one wins a war like that you
know you actually want to have the best cavalry in the field just go and innovate make a technical
contribution builds trust this idea that we're going to lob four-year-old pencils over the wall
i i don't know i'm happy you know i've got friends that you know that have you know, I've got friends that, you know, that have, you know, thought they needed to do that.
And that's great.
It's a different philosophy, but I don't think, it's not great for business.
You know, I think you'd get far more business if you're just open source.
And we've also, I mean, we have this legacy, you know, we had people, the previous generation fought hard for open source.
And our generation, I see, you know people are saying it's we we say well
we're engineers what do we know about this but actually i think i wish people would just say
you know what let's just build trust and yeah let's just just be good um yeah yeah what about
i mean that trust is such an interesting one and is there a way that you can sort of uh more
permanently lock in some of that trust or things like that? Because
like one thing that's difficult is we've seen companies that talked about open source, talked
to good open source game for a while and then all this and built up that trust, it seemed like. And
then, you know, once it got to day two and things like that, now it's like, hey, now we're BSL going
forward. So like, I mean, it's probably hard to donate your database to like the Linux Foundation,
you know, like that wouldn't be great.
But like, are there ways you can sort of like lock in that trust to where it's not just, you know, subject to the whims of the company and whoever happens to be in leadership, you know, in five, 10 years as well?
Yeah, I think that's also good.
So we like Tiger Beetle came out of payment switch, which was Apache 2.
So we had to be Apacheache 2 by contract so that's
nice you know and so we are apache 2 um i think the other thing you always often what you find
is when people bait and switch it's new management you know and maybe they don't understand open
source they actually because it's always like how do people play chess are they trying to just
capture you know material on the board like just capture a pawn or actually trying to win the game and that's always my question are they
thinking second order because everyone is thinking business you know bsl first order
oh sleep well at night actually if you look at it like your red panda rewrote kafka i don't know
how many years in seven years in they didn't even want Kafka source.
They just said, well, things have changed.
We're going to rewrite it, you know?
So BSL wouldn't have given, not that Kafka or BSL,
that it actually doesn't give you protection in the defense you think
because by the time someone is going to compete with you,
things have changed.
They'll rewrite.
And again, you know, competition happens at the interface,
not at the implementation. So you can you can license this is my feeling that you can license the
implementation people always compete on the interface it's going to be you know casca
drop in replacement um so i'm a huge fan obviously of red bundle um but i i think that story is
interesting that um yeah so i i think if you want to build trust,
you also don't want to lock people in because it's kind of hard to sell if you're saying
to someone, I'm going to lock you in.
And there's a better way to do it.
You don't need to.
If you have trust, that's actually where the value is, which I think for engineers resonates
with all of us because we we all want to build trust
and and it's just that we've we had this you know a little bit of first order thinking but actually
second order thinking is where it's at yeah yeah you're talking about selling and i know you're
coming up on like production release of tiger beetle like what does selling look like for you
is it support contracts is it a hosted tiger
beetle is it something different what would that look like yeah so it's it's trust and time you
know and the startups don't have time open source is too expensive it's not free for them you know
they need someone to manage it for them because they don't have time to set up managed environments
a lot of work into that um and at scale it it's trust. So you can give someone free open source.
They want to know, who do I call?
So it's all about trust.
You don't need to lock them in.
They want to pay you to be there for them
and have a relationship, a real relationship.
So that's sort of how we think about it on those two
that and that was my own journey i didn't know any of this you know coming into tiger beetle
like we're still gave me goosebumps you know what you can do with all these changes around testing
and safety and performance but i was still figuring out the open source model but this
is what i learned you know people just want trust and and save them time
yep absolutely well jordan we're coming up on time um i just love this conversation which
doesn't surprise me i think you have like one of the highest signal twitter accounts um tiger
beetle has a great newsletter great great blog all sorts of stuff like i just learned um just
amazing stuff and i really think that you have this insight into it feels like you're living in
the future truly like just some of the insights you have this insight into it. It feels like you're living in the future, truly,
like just some of the insights you have.
So I really appreciate you coming on.
Like where can people find you if they want to, you know,
learn more about you or about Tiger Beetle?
Yeah, so we, like the whole team, we're very accessible.
Just come and jump in, you know, on GitHub.
Love to see you there on Twitter, Slack,
anywhere you find us.
I'm happy to set up calls to chat.
Yeah, and Alex, thanks to you because, yeah, this has been such a pleasure.
Thanks for having me.
Absolutely.
Thanks, Jeroen, for coming on.
See you.