The Changelog: Software Development, Open Source - The 1000x faster financial database (Interview)
Episode Date: April 2, 2025In July of 2020, Joran Dirk Greef stumbled into a fundamental limitation in the general-purpose database design for transaction processing. This sent him on a path that ended with TigerBeetle, a redes...igned distributed database for financial transactions that yielded three orders of magnitude faster OLTP performance over the usual (general-purpose) suspects. On this episode, Joran joins Jerod to explain how TigerBeetle got so fast, to defend its resilience and durability claims as a new market entrant, and to stake his claim at the intersection of open source and business. Oh, plus the age old question: Why Zig?
Transcript
Discussion (0)
Welcome everyone, I'm Jared and you are listening to the changelog, where each and
every week we sit down with the hackers, the leaders, and the innovators of the software
world to pick their brains, to learn from their failures, to be inspired by their accomplishments,
and to have a lot of fun along the way.
In July of 2020, Yaron Derkgrief stumbled into a fundamental limitation in the general purpose
database designed for transaction processing. This sent him on a path that ended with Tiger
Beetle, a redesigned, distributed database for financial transactions that yielded three orders of
magnitude faster OLTP performance over the usual general purpose suspects.
On this episode, Yaron joins me to explain how Tiger Beetle got so fast, to defend its
resilience and durability claims as a new market entrant, and to stake his claim at
the intersection of open source and business. Plus the age old question, why Zig?
But first, a quick mention of our partners at fly.io,
the public cloud built for developers who ship.
You ship, don't you?
Then you owe it to yourself to check out fly.io.
Okay, you're on from Tiger Beetle on the changelog.
Let's do it.
You're on from Tiger Beetle on the changelog. Let's do it.
Well friends, I'm here with Scott Dietzen,
CEO of Augment Code.
Augment is the first AI coding assistant
that is built for professional software engineers
and large code bases.
That means context, aware, not novice,
but senior level engineering abilities.
Scott Flex, for me, who are you working with?
Who's getting real value from using Augment Code?
So we've had the opportunity to go into hundreds of customers over the course of the past year
and show them how much more AI could do for them.
Companies like Lemonade, companies like Kodem, companies like Lineage and Webflow,
all of
these companies have complex code bases.
If I take Kodem, for example, they help their customers modernize their e-commerce infrastructure.
They're showing up and having to digest code they've never seen before in order to go through
and make these essential changes to it.
We cut their migration time in half because they're able to much more rapidly ramp, find
the areas of the code base,
the customer code base that they need to perfect and update in order to take
advantage of their new features.
And that work gets done dramatically more quickly and predictably as a result.
Okay. That sounds like not novice, right?
Sounds like senior level engineering abilities.
Sounds like serious coding ability required from this type of AI to be
that effective. 100%. You know, these large code bases, when you've got tens of millions of lines
in a code base, you're not going to pass that along as context to a model, right? That is,
would be so horrifically inefficient. Being able to mine the correct subsets of that code base in
order to deliver AI insight to help tackle the problems at hand.
How much better can we make software?
How much wealth can we release
and productivity can we improve
if we can deliver on the promise
of all these feature gaps and tech debt?
AIs love to add code into existing software.
Our dream is an AI that wants to delete code,
make the software more reliable rather than bigger. I think we can improve software quality, liberate ourselves from tech debt and
security gaps and software being hacked and software being fragile and brittle. But there's
a huge opportunity to make software dramatically better. But it's going to take an AI that
understands your software, not one that's a novice. Well, friends, Augment taps into your team's
collective knowledge, your code base,
your documentation, dependencies, the full context.
You don't have to prompt it with context.
It just knows.
Ask it the unknown unknowns and be surprised.
It is the most context aware developer AI
that you can even tap into today.
So you won't just write code faster, you'll build smarter.
It is truly an ask me anything for your code.
It's your deep thinking buddy.
It is your stay in flow antidote.
And the first step is to go to augmentcode.com.
That's A-U-G-M-E-N-T-C-O-D-E.com.
Create your account today, start your free 30 day trial.
No credit card required.
Once again, Augmentcode.com.
I'm joined today by Yaron, creator of Tiger Beetle.
Welcome to the changelog.
Hey Jared, it's so great to be here.
Thanks for having me.
Excited to talk to you.
I've been excited ever since you were mentioned by Glauber Costa on our Terraso episode.
He spoke very highly of you, very highly of Tiger Beetle.
And I thought, we got to get this guy on the show.
Yeah, so great to hear that.
Glauber's sort of part of the mix
in some of our design decisions for Tiger Beetle.
So I've always held him in high regard
and yeah, so love to dive into that later.
But yeah, thanks again.
Absolutely, well, we'd love to hear what you've been up to.
I read on your Tiger Beetle blog that you said,
in July of 2020, you stumbled into a fundamental limitation in the general purpose database design
for transaction processing.
Can you tell us that story?
Sure.
So it was the strangest performance optimization challenge that I've had to do.
So I was doing performance engineering at the time, consulting on a central bank switch
with the Gates Foundation.
They had created an open source nonprofit central bank exchange.
And I was brought in to see how can we make it go faster?
So how can we do lots of transactions between banks?
How can Alice pay Bob? And those are little payments. So
this is the classical database problem, a transaction between one or more people. And
if you look at the Postgres docs for Postgres transactions, the canonical example there is Alice pays Bob databases need transactions because they need
to record people transactions. And so the central bank exchange was like a good example
of why you need a database because it's literally money moving from one person to another. And so I came into this project and I had done
a lot of performance optimization work and specifically in Node.js. I'd been involved
in Node.js since Ryan Dole announced it. I'd kind of already been writing JavaScript on
the server before Node.js using Rhino on the JVM. And I was convinced that JavaScript was going to be server-side.
So I was already writing it and then Randall came, you know?
And then I was in Node.js for 10 years and doing a lot of performance.
Yeah. So this switch was written in, you know, in actually Node.js
to make it really accessible, open source project.
And so it was like, how do you optimize Node.js to make it really accessible, open source project. And so it was like, how do you optimize Node.js?
And I had experience with this.
But the surprise of this whole thing is that,
how do you make these transactions go faster
through the switch?
Inside it's essentially just a general purpose database,
20, 30 years old, My SQL was the one. And it's
very basic. They're just doing like, if you look at those
postgres docs on transactions, they're doing classic debit
credit in a SQL transaction, nothing more. And that's it, you
know, and then I thought, but like, how do you make it go
faster? So my first, my first experiment was, well, let's give
my SQL NVMe.
And it didn't go faster.
The system could only do 76 debit credit,
financial transactions a second.
And it didn't go faster.
I thought, okay, that's odd.
Because normally it makes the database go faster, Jared.
Yeah, throw a little hardware at it, you know, it'll go faster.
Yeah.
So I thought, well, maybe I did it wrong.
You know, maybe it's not NVMe.
Maybe the database is memory bound.
You know, let's give it more cash.
Okay.
So we gave it a lot of RAM and nothing changed.
Like it stayed 76.
And then we thought, okay, there must be like some CPU intensive algorithms. So we profiled flame graphs,
everything. And there was nothing like the CPU is like 1%.
You know, even even the database, there was everywhere you
looked in the system, there was just no bottleneck. And yet the
system could only do 76 TPS. Network also, so what we call the four primary colors,
network, storage, memory, compute,
we gave them all hardware and there was no change.
And this was so puzzling for me
because always with performance work,
there's always some bottleneck somewhere
and you fix that and then you fix the others.
And here it was the strangest problem because you had to optimize a
system where the hardware was doing nothing, it was idle and what, what I,
what, what are people who had worked on other payment systems, they sort of took
me aside and helped me along and they said, you know, it's the Rolex and what
they meant was when you do Alice
pays Bob through a SQL transaction to represent a real world transaction, you go to the database
from the app and you say, select me, Alice and Bob's accounts. I want to get their balances.
And then in the app, you check, you know, does Alice have enough money to pay Bob?
Yes.
Then you do another query back and you say, okay, increment the debits or the credit columns
and also insert a record of the transfer.
So you've only done two SQL queries within one SQL transaction.
Right.
Yeah. And this is what I saw.
In reality, the switch was doing 20 SQL queries within a SQL transaction, but they were all
necessary and essential.
And I've since then chatted to lots of other fintech people.
And usually it's like, there's about 10 SQL queries per debit credit.
If you want to do it really well,
you could do it just like two, as I described.
And so actually, this was the problem.
The SQL transaction was holding row locks on the account rows
across the network latency.
And even if that's, say, one millisecond round trip,
and you're doing two round trips, then
you've got locks for a millisecond.
And that means if your transactions are all sequential, your database tops out at 1000
TPS, 1000 debit credits a second.
And obviously, you know, the next question we're going to say is, well, let's go horizontal.
It doesn't change anything because you have these Rolex.
The other problem that we saw is that there's a power law of contention, like the Pareto
principle, 80% of real world transactions are through 20% of hot accounts.
For example, on the stock exchange,
anyone who's bought shares has probably bought Nvidia
or sold Nvidia, you know?
And so like 90% of trades are Nvidia or the super stocks,
Microsoft and some of the others.
And a bunch of other ones just sit there untraded.
Exactly, then there's the long tail.
But if you're the brokerage behind the stock market,
if you're powering like 30% of NASDAQ's volume,
then most of your SQL transactions
are hitting like only 10 rows, your 10 superstocks.
I mean, how many are there? And so
they're seven. That's it. Yeah. And so that's like the brokerage, but the central bank was
much the same. They only have like typically eight or so big participant banks around the
table. And so all the money movements are literally going through
eight SQL rows.
But even for like, you know, if you're,
say you're building like Earth's biggest bookstore,
you've got the same problem.
You know, you've got the best seller list of books.
And so all your Black Friday transactions are hitting that.
So then we started to see like, this is, you know,
many businesses, many different kinds of industries.
But let's go back to the beginning, back to the question.
And so it was Rolex and we realized it's a fundamental limit.
And it's got nothing to do with the SQL database, you know, whether you pick Postgres or MySQL
or something horizontal or cloud, it actually doesn't matter.
The problem is the interface of relapse across the network.
And that was when we realized like,
actually we can't go faster.
We need a transactions database,
not a general purpose database.
So that was the impetus.
Okay, so here in 2020, you are trying to basically squeeze
more transactions per second out of MySQL
for kind of a specific type of transaction, right?
Like you are literally meaning financial transactions,
not SQL transactions, which can be
a bunch of different things that kind of roll out
into a single unit of work.
We talk about a SQL transaction
that could be a lots of different things
and then it gets rolled back or it gets committed.
But you're actually saying like,
we're tracking and doing debits and credits
like financial transactions.
And we can't do these particularly simple type
of SQL transaction any faster.
We can scale horizontally,
but we have the network ROLOX.
I'm sure there's a way in MySQL to say, well, let's just fly closer to the sun and turn
off ROLOX or something.
But maybe that's, you're also a central bank.
Yeah.
It's not a great idea.
In Postgres, it's almost the default read committed isolation.
You fly close to the sun just by default, you know?
You actually have to, there's a lot of like tweaking
you need to do just to get it safer.
Yeah, no, I mean, at the end of the day,
you decided this requires a different kind of database.
Like we should have a transactional database.
Is that how you think of it?
Where it's like the core unit of work is a credit
and a debit combined together.
Is that right?
That's right. Yeah. And, and the re I, I think, you know, I had a, I had a little bit of luck
in seeing this because I've, I've been coding for 30 years since I was 11. So now I've given
away my age.
You're dating yourself.
Yeah. And, uh, been coding for 30 years and it's been my passion, you know, and self-taught.
So I didn't actually study computer science because I was already like reading papers
and I thought, well, I'm more like auto-diadact.
That's going along, you know, already.
I actually, I've always loved business too.
And people told me, they said, you know, a great subject to study
if you want to understand the schema of business across all sectors, you know, and anywhere
in the world. It's double entry accounting that sort of, you know, one there's a quote
from Warren Buffett, accounting is the language of business. And so that's why, you know,
like many, well, yeah,
it's just that's why I went and studied financial accounting. I
majored in that at university, because I had a love for
business and wanted to understand the schema, you
know, general entries, debit credits, because with debit
credit, you, you are doing transactions, you're doing
multi row transactions between, you know between multiple parties. A debit credit transaction
could actually be have many debits on one side of many credits. You're expressing something
essential in life across, and it's not only businesses, you could use this not only for
money, but for inventory stock counts or just counting things, counting API usage or counting kilowatt hours.
It's basically a counting. You're just counting.
So any domain where you need to transact with counts and its quantities,
it could be like valuable or not valuable.
But it's just things are moving, moving around. And that that's transactions. So in but yeah, and that's sort of what Tiger
Beetle was meant for sort of the canonical data, you know,
general purpose example, would be debit credits. And so often
we see, see this example given this. But we thought wouldn't
it be great if you just got a database that did this out the
box, you know, that lets you in one query to the database, execute 8,000
debit credit transactions, instead of having to do 80,000
SQL queries to do the same, you know, so that's one query and
you do 8,000 instead of 80,000 queries. And so that's just
tiger deal, you get you get these debit credits, you know,
it's it's so we thought thought let's build it in.
You, it's kind of like financial asset,
not only row column consistency,
but multi row column consistency.
So like transactional consistency,
double entry consistency, yeah.
Interesting, now I have not been in financial technology as much as you have.
I've definitely dabbled and I've been a, I was a contract software developer for many
years and so I worked on contract and I got into lots of different industries.
And one thing I found is lots of different industries have their little niche databases
that people on the inside know, people on the outside don't, or sometimes they're home
grown. people on the inside know, people on the outside don't, or sometimes they're homegrown.
I imagine in the inside, you know,
these large financial institutions,
there's probably many, maybe homegrown,
maybe there's a proprietary thing that's like,
this is the database we use because of the exact
specific use case that you are speaking to.
Now, am I right about that?
Are there tiger beetles living in large financial
institutions and you decide we want one for the world or why did you decide to build this
thing besides maybe go find one or?
Yeah. So I think, I think we're like, we're mostly right. Uh, and, and the reason is because
yes, everybody has to build this database, but it's, what is what you usually see. And the I also
had no experience in, you know, in payments specifically, as I worked on the central banks,
which I came in from the technical performance angle, I had the accounting experience, I
could see they Yes, they were doing debit credits. And, you know, that it was the
payments people that also took me aside. And they said, you know, every single like fintech, or this
is only fintech, but it applies to any business that needs to record transactions, any money, you
know, but they said, you know, they're all reinventing these debit credit databases.
And typically, what they do is they do it
over a general purpose database. I think that this is also interesting is that it's again lucky on
timing because SQL is like 50 years old and it's been able to power OLTP transaction processing in a business transactions for
you know, for 50 years.
And so kind of like this with this has always been a latent problem, but we were starting
to hit it now on the system.
And that's because things I think are becoming more, you know, the volumes are increasing.
But but yeah, the volumes are increasing. But,
but yeah, back to your question. So I think, yeah, all these fintechs were reinventing these databases, but they were building a database and didn't know it. So they were building a database
in the application around. So that's why, you know, there was this central bank switch.
And you can see the whole thing as a debit credit database. It's like tens of thousands of lines of code around my SQL.
But if they just had an OLTP database
with a debit credit interface, the switch would be so, so simple.
And that was, you know, so that's how this work started.
There was no desire to build a company, build a product, tell something.
I was also passionate about the mission because the switch was open source nonprofit. I was told, you know, and this
was the mission and it is the cheaper, the more performance you make the system, you
will reduce fees. The developing countries that run this will be able to lift a few billion, a
few million people above the critical poverty line. And it is literally actually, I don't
know, on the order of about a billion people around the world that don't really have access
to banking like we take for granted in the West, because they just don't have, like in those areas,
they have to walk miles just to move money.
But everybody has a cell phone.
So these systems will be built that you could power people
to move money instantly on cell phones.
But the performance is so important
because as you make things faster, you make them cheaper.
And now for someone who's got,
they're sending like $10 back home. If you're taking,
you know, 5% fees, or 1% fees, that's life changing, because
that 4% means so much to them, they're going to maybe have an
extra meal. And it's going to compound so that this, you know,
this difference in fees, like, that is why they are doing this.
So for me, it was really like, we do want to make it faster.
I'm not just consulting.
There are like a thousand MySQL knobs I could have tuned.
We did double performance.
We could have made it a little bit more.
But it was kind of like, we actually really want to make it really fast.
Well, friends, I'm here with a brand new friend of mine, Cal Galbraith, co-founder and CEO of depo.dev.
Your bills don't have to be slow.
You know that right.
Build faster, waste less time,
accelerate Docker image builds,
GitHub action builds, and so much more.
So Kyle, we're in the hallway
at our favorite tech conference, and we're talking.
How do you describe depo to me?
And depo is a build acceleration platform.
The reason we went and built it is because
we got so fed up and annoyed with slow builds
for Docker image builds, GitHub Action Runners, and so we're relentlessly focused on accelerating
builds.
Today we can make a container image build up to 40 times faster.
We can make a GitHub Action Runner up to 10 times faster.
We just rolled out depot cache.
We essentially bring all of the cache architecture that backs both GitHub Actions and our container
image build product.
And we open it up to other build tools like Bazel and Turbo repo,
SC cache, Gradle, things like that.
So now we're starting to accelerate more generic types of builds and make those
three to five times faster as well.
And so in simple terms, the way you can think about Depot is it's a build
acceleration platform to save you hundreds of hours of build time per week, no matter whether that's build time that happens
locally, that's build time that happens in a CI environment.
We fundamentally believe that the future we want to build is a future where builds are
effectively near instant, no matter what the build is.
We want to get there by effectively rethinking what a build is and turn
this paradigm on its head and say, hey, a build can actually be fast and consistently fast all the
time. If we build out the compute and the services around that build to actually make it fast and
prioritize performance as a top level entity rather than an afterthought. Yes. Okay, friends.
Save your time, get faster builds with Depot, Docker builds,
faster GitHub Action Runners, and distributed remote caching for Bazel, Go, Gradle, Turbo Repo, and more.
Depot is on a mission to give you back your dev time and help you get faster build times with a one-line code change.
Learn more at depo.dev. Get started with a seven day free trial.
No credit card required.
Again, depot.dev.
Well, to just spoil the end of the story,
it's not the end, but it's further down.
You did end up designing something
that's a thousand times faster.
So I think you said three orders of magnitude.
That's a big win.
But let's talk about getting there.
You decided like, okay, what we want is a database
that's designed from the ground up
for this kind of transaction.
Then what do you do?
You didn't wanna make a business out of it.
Here, you have a business now, but did you decide I'm going to code up in my free time
and open source database for the world,
or I need to go raise money?
Like, where'd you go from there?
Yeah, so I didn't have really any of those thoughts.
I think the first thought I had was,
I figured we could fix the interface of the database instead of having ROLOX in
multiple queries.
We pack a lot of debit credits first class, one database query back again and you're done.
You actually do get that 1000X performance and you have to only improve on 76 logical
finished product TPS.
I think people sometimes get confused.
They think we're talking row inserts per second, we're actually talking, you know, logical transactions, debit
credit with the contention rollout problem. If a database is only able to do 100 to 1000
TPS, depending on the latency of the network and the contention, those are the two variables.
You can plug them into Amdahl's law.
Typically if there's a round trip time, 10 milliseconds looking at 100 TPS, 50% contention
looking at 200 TPS, that's your max general purpose limit.
And so if you want to go a thousand times faster, but you are now packing 10,000 transactions
in one database round trip, it's only one meg of information. You're
amortizing so many costs. It is not actually hard to do 100,000 a second. We actually did
a million a second with primary indexes only. Tiger Beetle ships with 20 secondary indexes
turned on. So it does between 100 to 300,000 a second. So it actually is like, you know, I hate benchmark wars.
That's why we never really do it.
We just try to explain to people first principles, why it makes sense.
Um, but that, that was sort of the first step.
Like that was my first feeling.
Like, and we, we only thought that the thousand X because that was what the
we, you keep saying we, but you haven't described your partner.
Okay.
Yeah.
Yeah.
So, so actually it was, it was me and then Don Chang foot was working with me,
like guiding me around the system.
I was doing the work and then Don was helping me and he was like my, you know,
my bridge and, and, and manager.
They were also other people at coilil, my managers there and surrounding
team, really bright people, you know, my manager was the co chair of the W3C web, you know, payments
group. He was the co chair, you know, but there was Visa and MasterCard and all of them also on
that. But so the people who really understood payments. And yeah,
so I say we because I always like to include the team, you know, so today is there's a
team and but I did just create it, you know, and it came out of this experience. So the
first first feeling was just we need to go a thousand times faster because that is going to drive down the costs.
And because the scale is actually a thousand X more,
countries that deploy these kinds of systems,
you tend to get very small value transactions and massive volume.
In India, for example,
they've done four orders of magnitude scale in 10 years only.
So in other words, 10 years ago, if you picked MySQL today, you need to... You're telling MySQL,
go 10,000 times faster, please. There's no cloud database designed... Most cloud databases were
designed 2012 or 2015. None of them were designed for like four order magnitude increases.
So there's no LTP database on the earth that can power India or like Brazil's PICS.
These systems are already past three orders.
So we were like, okay, we need at least 1000X because that it's not...
This is something I think people also...
It's helpful to understand.
It's not as impressive as I'm impressed by it,
is what you're saying.
Like there's people who've done 10,000,
you're doing a thousand.
I'm impressed, but India is not impressed, for instance.
Yes, well, you could actually take Tiger Beetle today,
run it on your laptop, and you could,
your laptop could power India's transactions volume
going through the UI.
You really could, and you would be pretty easy.
So like it resets the experience, which I think makes sense because we've got 30 years
of hardware and software research advance how you could build a database today. A lot
has changed. So if you just work with the grain, in those days it was spinning disk,
that was the bottleneck. Today it's spinning disc. That was the bottleneck today.
It's memory bandwidth.
So all the parameters of the design have inverted.
But again, it's actually not that hard because there's just so much seismic differential
waiting to be, you know, just, I was just excited that we could now do something.
Yeah. Because all the major ones,
the general purpose ones,
MySQL, Postgres, SQLite, et cetera,
they're 30 years old, right?
I mean, they're nineties,
sometimes in certain cases, eighties.
And so they were built with
in a different world of constraints.
And they've shown amazing malleability and resilience
and the fact that they're still good enough
for lots
of things today is an amazing feat of engineering by all those teams.
But to start fresh and to start with a, is it a limited domain?
I mean, you're not also bolting on general purpose on top of Tiger Beetle.
You're saying use Tiger Beetle and for these transactions and then also use something else
for the rest of your workloads.
And that's quite right. Yes. So I think, and I agree with you Jared. for these transactions and then also use something else for the rest of your workloads?
And that's quite right. Yes. So I think, and I agree with you Jared, it's so nice that you said, because I was wanting to say too, it's a testament to their designers that these,
you know, Postgres powers the world of general purpose transactions, you know,
or general purpose, you know, if you need
to build an app, whatever you, but I think the interesting thing here, like to your point
about fixed domain, for many years, what we call OLGP, like general purpose database,
that was used for OLAP up until 93, 96.
And then OLAP actually became a term,
online analytics processing.
And then because there was this whole paradigm shift
that look, Postgres is great, it's row major.
That's how inside the elephant anatomy,
that's how it stores the information it's designed.
It's sort of designed for a 50-50 read-write split.
And so it has like a B-tree that's maybe a little bit better for OLAP. But the OLAP people
came along and said, no, no, we need to go column major. It's like a paradigm shift.
So the anatomy of DuckDB is totally different to the elephant. And today, we wouldn't say that Postgres is an OLAP database. It's a general purpose database. It's not, you know, it's not Snowflake or BigQuery. And that's fine. And vice versa, you know, OLAP is not always OLGP. But I think today, for the same reason, because of increasing scale,
and because you can specialize, there's like a paradigm shift. OLTP is not just rows and columns,
row major or column major. It's multi-row major. You're doing things across rows with contention.
So the concurrency control of the database inside looks totally different. The anatomy of a Tiger Beetle, I would say, if I may.
Sure.
You know, it's different.
So it's a lot similar to Postgres, but it's also,
you can think of like the group commit that MySQL or Postgres has
and just dial it up 10x, make it much bigger group commit, much bigger batches. It's also, again, like you
said, fixed schema. So you don't need all the overhead of serialization. There's no
interprocess communication, just so many things. It was just incredibly fun to design Tiger Beetle
because it was such a simple problem, just debit credit at scale. So in other words, I think this
is it. Tiger Beetle is not an OLAP database. It's also not a general purpose database. It's just an OLTP
database and that's all. Then Postgres is fantastic. You never replace Postgres.
Just like if you're using OLAP, you don't. But what you would do is OLGP is like your control plane in your stack.
So you put all your entity information.
People call it master data or reference data.
So it's the information, you know, your users table, your user names and addresses.
If you're building Earth's biggest bookstore, user names and addresses,
names of your catalog, your book titles.
Those are not really OLTP problems.
Because that's just, you update your top 10 every now and then, or you know what I mean.
OLTP is like when people move a book into the shopping cart, because that's adjusting
inventory that's held potentially for a shopping cart.
After a certain amount of time, that
debit credit times out and rolls back. Then if people do check out their shopping cart,
those goods and services are moving through logistics supply, all of that to warehouses
and delivery. That's all debit credit. Quantity is moving and it's moving from one warehouse to another, to a driver, to the
home, back again, okay, back again, all debit credit. And then you've also got the checkout
transaction with the money. That's also debit credit. And so like that would be OLTP. And
that's sort of the Black Friday problem. The Black Friday problem isn't, you know, how
do we, you know, store the book catalog or update that because it doesn't change often. So that's a great general purpose problem, just like users, you know,
you don't, they don't change the names often. So your, your database, that's great for variable
length information and a lot, you know, that information is actually very different to
transactional information. Transactional information is very boring essentially.
Just multi-row debit credit.
Right.
And so this multi-row major that you describe.
I made it up, yeah.
But I would always say multi-row,
but let's call it multi-row major.
Yeah, multi-row major versus row major or column major.
That makes sense to me because every single transaction
has, you know, you're gonna assume
that over here there's an addition, over there there's a subtraction. There is this double
entry thing where it's like there's gonna be more than one row in pretty much anything that matters
in Tiger Beetle, right? And so that's an interesting way to like think about it and fundamentally
different way like you said versus thinking about it column or based on rows. How does that fundamental primitive manifest itself in your decision making?
I assume there's storage concerns, maybe like memory allocation.
Maybe there's protocol.
I don't know where all that works its way out as you design this system.
How does that affect everything that Tiger Beetle is?
Great.
I'm so excited to dive in.
Yeah. So let's do it.
Yeah, yeah. So let's apply this. Like, let's take the concurrency control, for example.
So let's say we've got 8,000 debit credits. So one debit credit would be like,
take something from Alice and give it to Bob.
Then take something from Charlie and give it to Alice. Take something from Bob and give it to Bob, then take something from Charlie and give it to Alice, take something
from Bob and give it to Charlie. And you've got 8,000 of these, some of them might be
contingent on the other and you can, you can actually express a few things around this
like, but let's just leave it like that. So you've got 8,000 debit credits in one query.
So the first thing that comes off the wire into the database, the database is going
to write that to the log and it's going to then call F sync. What's great there is you've called
F sync once for 8,000. So it's F sync once for one query, but that is amortized across 8,000
logical transactions and F sync usually has a fixed cost. So it has a
variable cost, but there's also always a fixed cost and it's like half a millisecond or a millisecond
or it depends all on the hardware. But there's always a fixed cost. But now you've amortized
that massively over 8,000. Typically, the group commit for my SQL might be around 15 things.
Typically, the group commit for MySQL might be around 15 things. It's much smaller.
It will amortize Efsync, but not by so many orders.
That's the first thing.
Now we've got the D in ACID, atomicity, consistency, isolation, durability.
Before the database processes it, it makes the request durable on disk calls Fsync.
The next thing it does is it'll take these 8,000 and apply it to the tables.
It'll actually update the state on disk.
If it was something like a general purpose database, what it'll do is it'll take the
first debit credit.
It'll read Alice's row and Bob's row, you know, for their accounts.
Then it will update the rows and then it will write them back.
Then it'll go on to the next one, read the two rows, update them,
write them back, so on and so on.
So you're looking at about 16,000 accounts that you
read in and write out. And that typically takes what is called latches, little internal
Rolex also inside the database. So 16,000 little micro cuts, you know, also contending there.
And then, but you see, here's the catch is that the domain is usually, again, everybody's buying NVIDIA.
If like 80% of your 8,000 are all NVIDIA,
you're reading NVIDIA, writing NVIDIA, reading NVIDIA,
latching NVIDIA, latching it.
And you're doing it 16,000 times.
And so what Tiger Beetle does instead,
and this is again with anatomy changes,
now we first look through all 8,000 and we prefetch all the data dependencies
in that request. So all the accounts, for example, and so we load NVIDIA once. Then
we load the other six hot super stocks. And then there's a long tail that we load. But
actually there's not many accounts and they're usually hot in L1 cache. So they don't even go to disk because we've got a specialized cache just
for these accounts. Everything in Tiger Beetle is CPU cache line aligned. And we think of
that these days as like 128 bytes. Cache lines are getting bigger like with M1 started there.
And everything is cache line aligned. We don't straddle cache lines are getting bigger, like with M1 started there. And everything is cache line aligned.
We don't straddle cache lines for false sharing.
Everything is zero copy deserialization, fixed size,
very strict alignment, powers of two,
and we don't split machine words.
Or just doesn't always make a difference on the hardware, but it can.
So these are all the little edges, you know, the cash is optimized.
But okay, let's go back.
So now we load just Nvidia, the super stocks and a long tail, but actually they all in
the L1, you know, or in the L2.
And then we've got the data dependencies cached, then we push all 8,000 through, and then we
write them back.
And so it's just, I think you've got it now.
It's just drastically simpler.
It's kind of like you just do less
and that's how you go faster
because you're doing so much less.
You don't have SQL string passing.
Right.
It almost feels like you're cheating, but you're not.
You're just doing exactly what needs to be done
and nothing more because you're not general purpose, right?
Yeah, exactly.
I often like to say, you know,
we didn't do anything special.
It's kind of embarrassing.
It's so simple.
The thousand X trick, it's yes, we use IOEuring.
We do direct IO, you know, DMA to disc,
a zero copy, all the stuff.
Actually, it doesn't make a performance difference.
It makes, it does.
That's why we do it. It makes a 1%, 5%. It all adds up, but that wouldn't get you a thousand
X. Just like stored procedures in a general purpose also wouldn't get you a thousand X.
Stored procedures will get you, you know, you get those 10 SQL queries down to one.
But now you're still doing one for one and you still got, so you only went 10x faster or 10x cheaper. If you really want to go a thousand X, you actually
have to just have first-class debit credit in the network interface, change the concurrency control.
Tiger Beetle has its own LSM storage engine, LSM Tree.
We designed it from first principles, again, just for OLTP.
So it's actually an LSM forest.
We have an LSM Tree for every object that the database stores,
so transfers and accounts, accounts and transfers between them,
debit credit transfers between accounts.
Those are two trees.
And then all the secondary indexes around each object is like 10 trees and 10 trees.
And so for every size key and every size value, it's in a separate tree.
And again, there's just things you can do now like RuxDB or LevelDB, which is what you
find in a lot of
general purpose databases they use length prefixes for the key so it's a four byte length prefix
or its variable length but now you've got the CPU branching and costs etc
and then there's again a length prefix for the value but if you're if you know that your secondary indexes are only 8 to 16 bytes, and you then
add the cost of length prefixes, 4 plus 4 is 8, 8 bytes of overhead just to store 16
bytes of actual user data, you're burning a lot of memory bandwidth, a lot of disk bandwidth,
and you're going slower than a database that doesn't have length prefixes at all.
And so Tiger Beetle, each LSM tree stores pure user data.
Literally, we put the key and the value,
there is no length prefix because each tree has its,
it knows at comp time, you know,
what the size of the key is.
So yeah, we can go on and on. But yeah, and then the
consensus protocol, we did also similar, similar optimizations. Yeah.
So hyper tuned for this specific type of workload, which also happens to be one of the most important workloads as well. So let's
imagine that I'm an e-commerce shop and I'm not I'm not gonna roll out Shopify.
I'm gonna do it myself because it makes sense. And so far I'm a Postgres guy
personally. I respect my sequel but I just use Postgres as my example because
that's what I've been using for 15 years.
So let's say that I've got my Postgres database.
It's been running everything just fine.
It's got my users in there.
It also has all my transactions
and I'm hitting up against scale issues.
This is a good problem for me
because it means I'm doing more sales, right?
And so I'm selling a lot of books
and I'm hitting scale issues
and someone says you should really look into Tiger Beetle
for your transactions specifically.
And I think to myself, Postgres hasn't failed me
for 30 years.
Tiger Beetle hasn't existed until 2020 at the earliest.
You can tell me when at 1.0
or when you guys actually shipped a product
because conception to now we're five years in.
I'm gonna trust my most precious, my sales.
You know, I wanna trust that to something that's this new.
I'm sure you face this a lot as you go out
and try to get people to try out Tiger Beetle.
What's your response to that concern?
Cause it's a valid concern that, you know,
not super, not super, what's it called?
Battle hardened yet, or maybe it is.
Tell me about that.
Yeah, great.
I love the question.
That was my second thought as TigerBee was created.
First thought is we're gonna need-
No one's gonna trust us, why are they gonna trust us?
Yeah, and the second thought was this question,
how can you possibly be as safe as 30 year old software
that was created, Windows 95,
how could you possibly be as safe?
And I think the answer to that is actually a question.
In the same way that how could you be a thousand times faster than something that is 30 years
old, the question is what has changed in the world in the last 30 years?
So much has changed from a performance perspective. And
then when we look to safety, and especially mission critical
safety, so much has changed. So 30 years ago, consensus didn't
really exist in industry. Brian Oakey had pioneered consensus a
year before Paxos viewstamp replication in 88. That was his thesis at MIT with Barbara
Liskov. The consensus did exist. How can you replicate data across data centers for durability
to actually survive loss of a machine or disk? But that wasn't really an industry 30 years ago.
And so you don't really have first-class replication.
Yes, you do have it these days,
but it wasn't designed in from the start.
And I think there's more examples around testing.
So deterministic simulation testing,
what Foundation DB did.
And we actually got it from the people at Dropbox, not Foundation. James Carling and them at Dropbox
were also doing DST. The idea is like, you design your whole
software system that it executes deterministically. So you don't
use, if you use a hash table, the hash table given the same
inputs always gets the same layout. You don't use
randomness in the logical parts, you know same inputs always gets the same layout. You don't use randomness
in the logical parts that users would see. So you can think of it basically like fuzzing
your software or property-based testing. Given the same inputs, you get the same outputs
and you design all your software like this, then you can test it. But if the test fails,
when you replay it, you'll get the same result.
And so distributed systems are so hard to build 30 years ago, so not many people did.
And then when they started really building them, they actually started with eventual
consistency because people were still figuring out how to do strict theorizability. So there
was a lot of like,
fashion around eventual consistency, which has gone away, I think. But back then, to build a distributed system, you
just kind of assumed, like, it'll just be eventual. Yeah,
but it's a harder to build distributed systems was quite
hard. And because it was hard, because it was hard because it was hard
to test them because you know the failure of one system over there causes the failure
you know of another system here and you can never when you find a rare bug you can never
replay it and you need so many machines and these bugs you know before TigerBee I worked
on systems that were distributed like full duplex file sync, hence my interest in Dropbox. But those systems were incredibly hard to test. There would be bugs. You know,
they're there. They're taking you to find and fix. And then you realize it's a buffer
overflow in libUV. That was like some of my first C code that I ever wrote, you know,
years before Tiger Beetle. It was fixing a buffer overflow in libUV, it was the Windows event notification, some interaction with multi-byte
UTF-8 and different normalizations of NFC, NFD. That was this distributed systems bug,
and it also needed long file paths. So it took a year. We knew it was there.
Was that your Node.js days or that was prior to Node.js days?
That was Node.js days. Yeah. So around 2016 and that we were using like Jepson style techniques
where you've got fault injection, like chaos engineering. That's what we were doing. So you
could get the bug, but to even, you could get this amazing bug, but you knew it was there.
It was like the teas and you could, you could never find it, reproduce it or fix it. And
It was like the T's and you could never find it, reproduce it or fix it.
And it literally took a year.
So coming to Tiger Beetle, I knew from Dropbox, there were newer ways to build distributed systems, just like you don't need to use the eventual consistency
anymore, there's proper consensus.
No, no one should give that up.
Likely.
Um, you don't need to, cause you can get great performance first, you know, and so
much performance fundamentally, you fix the design, then you can get great performance first, you know, and so much performance.
Fundamentally, you fix the design, then you can add consensus. It's not expensive when
the design is right. It's I mean, consensus is literally just, you know, you append to
the log in f sync and in parallel, you send it across the network to a backup machine
and f sync that that's replication, you know replication 99% of the time. And then consensus
does the automated failover of the backups, the primary. So consensus doesn't really have a cost.
That's a common myth. But all these things have changed. But coming back to it, it's testing has
changed how you test distributed systems, because you can actually model a whole cluster of machines in a logical
simulation that's deterministic. Now you test it just like
Jepson, but the whole universe is is under deterministic
control. You can even speed up time so you can like fast forward
the movie to see if there's a happy ending if it does great.
Watch another movie.
Each movie is like a seed, you know, how you can generate a worms level, you know,
the worms game or scorched earth games would be like randomly generated all from
a seed.
And so this is, this is just classic fuzzing, probably best testing with
seeds, but you're taking these distributed systems and they're all the database was
born to run in a flat simulator
This is just terministic and if you can do that
You can build systems that you know, they're you kind of doing Jepson, but it's but it's you're also speeding up time
You can reproduce right and then you have once you've done had that experience you have to ask now
You know do the figure old systems have this level of testing? Yes, they have
30 years of testing. But with DST, you can speed up time. So
we've actually worked it out in Tiger Beetle, one tick of our
event loop is typically 10 milliseconds, we collapse that
into a while truly, we get a factor of 700 x speed up. When
you take into account the CPU overheads of
just executing our protocols. So we flatten everything and we execute it and you get a
700x time acceleration. So we have a fleet of a thousand CPU cores dedicated to Tiger
Beetle. We got them from Hezner. Thank Thank you very much, Hetzenaar. They're in
Finland, so nice and cool. They're burning clean energy. A thousand CPU core fleet. They run DST
24-7. And that adds up, I mean, it is roughly 700X. It's a thousand cores because we pay for it,
you know, dedicated and they run.
We do a lot of work to like optimize how much we're using those calls, but it does add up
to on the order of like 2000 years a day of test time.
And but, but you see now again, we're simulating like disc corruption and we're simulating
things like we write to the disc and the disc says, yes, I fsynced,
but the firmware didn't.
Or we write to the disk and the disk
writes it to the wrong place.
And so Tiger Beetle has like an explicit storage fault model.
So we do assume that disks will break,
but they don't only fail stop,
they actually are like kind of, we call near Byzantine.
So disks do very rarely, but 1% in a year or two, a disk will have some kind of corruption
or latent sector error.
Then you get a latent sector error, a little bit less common is like silent bit rot, a
little bit less is misdirected IO where it actually just writes to the wrong place.
And this can be the hardware or the disk firmware, even the file system. So two years ago, XFS
actually had a misdirected write bug. And if your database was running on that particular version
of XFS and you triggered that, your database would write to the wrong location on disk.
And now the question is, well, like who tests for this?
And you almost can't unless you're using
these new techniques.
Yeah.
I don't know.
I mean, yeah.
It's cool.
I guess the question was, you know,
it's not enough actually to be as safe
as what was safe 30 years ago.
We've got new techniques
and there's a few more of them in target.
Yeah.
No, I think that's super cool.
It reminds me of light bulbs for some reason.
You know, LED light bulbs,
they say they're supposed to last 25 years or something.
And I'm always like, you don't know that
because you haven't been using them for 25 years.
In fact, the house that I currently live in,
we're going on 10 years now
and they sold us on all LED light bulbs.
And I remember as the installer was putting them in,
he's like, you're never gonna have to replace any of these.
And you know what, I've replaced a whole bunch of them.
So whoever did their testing,
can't do what you guys can do,
which is they can't just fast forward time and prove
that this thing's gonna last for 25 years
because they haven't been building these for 25 years.
But what's super cool with this,
what's it called,, deterministic simulation testing?
Simulation testing, yes.
Yeah, what's cool about it is you guys can actually,
just through CPU power and design,
you can actually simulate all this time.
And so you can claim,
even though you've only been around for five years,
I'm giving you all five,
even though I'm sure it's technically less than that, that you've actually tested for hundreds of years, I'm giving you all five, even though I'm sure it's technically less than that,
that you've actually tested for hundreds of years, right?
Like you could just say that because you've done that work
through this three-dimensional simulation
that I'm imagining that you put the system through
at all times.
I think that's pretty cool.
Yes, at least we're trying to get there.
So we would also say it's only as good as our coverage
and our combinations of coverage and our state space exploration, but we invest so much into that
as well. And so it does because we also know how many bugs we find and how rare they are.
If you think of it like a hacker, they require like eight exploits to be chained
and each exploits is like tough. And then you just know that like traditional software,
there must be so many like millions of bugs, but yet we don't find them in the real world
because they're so, so rare. But with the DST, you do actually, and we find them pretty quickly.
So, yeah, so we wouldn't claim, but it is very strong.
It gives you some confidence that you wouldn't have otherwise.
And I think that's very valuable.
But like you said, or maybe you didn't say it, but I was thinking it while you said it,
there's no accounting for the real world.
So like Mike Tyson's famous statement,
everybody has a plan till they get punched in the mouth
or something like that.
It's like, you can simulate all you want
and it's amazing what y'all are doing.
But then there's the actual reality
and there's always gonna be something.
And so is Tiger Beetle out there?
Is it in production in reality yet?
Are people using it?
What's the state of that side of it?
Very much so.
So we had a customer reach out to us.
They needed to, I mean, we've got a few customers,
but just an example of one.
End of last year, they reached out, they said,
look, some regulations are changing.
They need massive throughput
because it can put their business ahead.
We had to get them for like share business strategy. They needed to migrate within 45 days,
a workload of like 100 million transactions a month, 100 million logical transactions a month.
They needed to migrate within 45 days from the old system. We migrated them and they saturated
their central bank limit and they were happy. And we pulled it central bank limit. And they were happy, you know, they
and we pulled it off. And the system just works. The I think that it did. Yeah, that was great.
You know, and we you know, there's there's like national projects. Three, three different countries,
you know, whether it's for the whole national transportation system or the central bank digital currency or another central bank exchange.
It's tiger beetle is going into the current production version of that gate switch now as we speak.
I think just to go back to like the dst it's a lot like formal methods.
like formal methods. The difference is that formal methods check the design of a protocol so that you know that the protocol could possibly work. It's formally proven that it could work.
But for me, always the challenge was, well, how do I know that I've just, because I'm not a,
I always feel like I'm getting slower every day. How do I know that I coded this correctly? I know the protocol is correct.
I know, you know, VSTAR replication, Paxos, RAF, they're formally proven, but the implementation
is like thousands of lines of code. And so how do you check that? And so DST is actually checking
the actual code. And you know, the simulator is checking for split brain, it's checking linearizability, strictizability,
of which linearizability is a part.
And it's even checking things like Fsync gate.
So cache coherency, Tiger Beetle's user space page cache
is coherent with the disk at all times,
even if there's Fsync IO faults.
So post-create is, they are fixing this,
they're adding direct IO.
If people wanna find out it's called fsync-gate 2018,
but most databases still can't survive fsync-gate.
They were actually patches where databases patched
like MySQL, et cetera, to panic.
The problem is when they start up,
they still read the log from the kernel page cache,
which is no longer coherent. So actually they have to use direct IO. So Postgres has been on a long journey, laudably,
to add async IO, direct IO. It's in, I'm not sure yet. It might already be in as the default,
but those are kinds of the things you need to survive. And that isn't even an explicit
storage fault model, but Tiger Beetle simulator is actually checking.
Your simulator can reach in and check so many invariants.
But then also, I think back to what
we were saying about claims and coverage
is you want to have a very buggy view of the world.
So you take your four primary colors, network storage,
memory, compute.
You assign explicit fault models. So compute, we say primary colors, network, storage, memory, compute, you assign like explicit fault models.
So compute, we say, look, that would be in the domain of Byzantine fault tolerance.
So we don't solve that.
Memory and that's explicit.
Memory would be ECC.
That's our fault model.
And then what Tiger Beetle focuses on is the network fault model.
So packets are lost, reordered, replayed, misdirected, classic Jepson, partitions, all
of that.
Most, you know, that's what makes consensus so hard is just solving that fault model is
almost impossible.
Then Tiger Beetle adds the storage fault model explicit.
So you know, you write to the disk, wrong place, forgotten.
You read from the disk, you actually reading from the wrong place.
So you'll get a sector that has a valid checksum
for the database, but it's the wrong.
Now you need to daisy chain your checksums.
And then what we do is with these,
this is sort of the buggiest view of these four.
OK, the first two, we've been explicit
that we don't solve those because they have different levels of probability.
The probability of a CPU fault is astronomically more rare than a memory fault, which in turn
is astronomically more rare than network fault, maybe so around. Sorry. The rarest thing is
the CPU, then it's the memory, then it's the storage, and then it's the network.
So most people are just solving network,
TigerBit will solve storage.
But storage is when you, at scale, it happens more and more.
So, you know, 1% of disks around a two year window.
At enough scale, you're gonna hit that
enough times that it seriously matters,
versus at small scale where you're like,
ah, we don't ever have to worry about that.
Yeah, that probability thing is one of the reasons
why I think logically as humans we fail often
to reasonably consider scenarios
because we think about worst case scenarios,
but we don't pair that with probability of scenarios.
And so it's like, what's the worst that could happen?
And for some reason in our, this is like off topic,
but in our hearts, we give that like a hundred percent
chance, right?
Like that, if that happens, well, let's assume it does
happen, so a hundred percent chance on that.
Then it's terrible.
But we don't also think like pair that with probability.
So that's kind of what you're saying with the CPU stuff.
It's like the odds of a CPU failure.
Okay, catastrophic catastrophic of course,
but like exceedingly rare versus the likelihood
of a drive failure for instance.
And so.
Exactly, drive or network.
Right, network is probably the most unreliable
at this point I would assume
since our drives don't spin anymore.
That's it.
Yeah, the network is incredibly unreliable unfortunately. Yeah. Yeah. The network is incredibly unreliable. Unfortunately. Yeah
Well friends I love notion because notion lets me do everything I want in a single
Application that lets me invite others to get involved in those
organizational workflows
processes
Collaboration whatever you want to call it, right?
The cool thing that I love most about notion is that you can make it your own meaning you can make
Operating systems work flows, you know processes standard operating procedures
The way you do things, the way you work,
the way statuses work for you.
And I don't mean that you gotta build this thing yourself
from the ground up.
No, everything is, for the most part,
push button, templateable.
You can start from somewhere and end up somewhere else
that fits your model.
For me, I could be in the middle of doing something,
thinking how I could add one more
property or one more status to the flow I'm doing things and make the change in real time
while doing the work to enable the future work I'm doing to be better, to be easier,
to fit me.
That's why Notion is cool.
And if you're not using Notion, well, now is the time to do it because there is no shortage
of the way AI has helped many, many people. I love Notion AI. Notion for me is big. A lot of stuff
in my workspace. I've got a personal one. I've got one for the change log and all these things
fit into different places in my personal life. The way I personally use Notion. But Notion AI lets me search across everything
in one single AI interface, and it's the coolest.
But being able to combine your notes, docs, projects,
all the things you want to do into a single space
that's simply beautiful, easy to use, well designed
on all the platforms, whether it's web, desktop application, iOS application,
Android application, you name it,
and Notion is there for you.
And Notion is used by over half of Fortune 500 companies.
Now I don't know about you,
but I'm not anywhere near a Fortune 500 company,
but they're also used by many, many teams.
And we're one of them. We're one of those teams.
These teams that use Notion send less email.
They cancel more meetings because, hey, no meeting needed.
They save time searching for their work using Notion AI,
and they reduce spending on various tools because they consolidate it.
And this helps everyone to stay on the same page
and have the business stay more focused
So try notion today for free when you go to notion.com
Slash changelog that's all lowercase letters
notion
dot-com
Slash changelog to try the powerful easy to use notion AI today and when you use our link, of course
You're helping us. So do that use our link link that lets notion know, hey, change log is impressing people.
They're sharing what we do well.
And you know what? I wouldn't tell you if I didn't. I love notion.
It's awesome. You should try it out again.
Notion dot com slash change log.
Well, we're talking about the simulation stuff.
I do want to take a quick aside and
talk about something I found that was really cool on your website.
Sim Tiger Beetle, sim.tigerbeetle.com.
This is a simulator for a distributed database scenarios.
It's like a game.
Well, for the video folks, we'll put it up for you guys to look at.
This is really cool.
You can play these different games,
like the Mexican standoff, the prime time,
and the radioactive hard drive.
What's going on with this?
This is like out of left field when I saw it.
This is not, I wouldn't expect this from you, Yaron.
What is this thing?
I thought you didn't expect it, Jared.
Oh, yeah.
I didn't expect it.
I saw it.
I'm like, okay, I started playing it.
And it's like, it's polished.
The graphics are awesome.
There's sound, there's music.
And it's like a game.
All about distributed system failures.
Tell me more.
Maybe we, you know, our team,
we just enjoy Tiger Beetle too much.
You know, it like kept me company in lockdown and COVID that that was July 2020. March 2020, I started on the switch. A week later,
it was my birthday, you know, and a week later, the world went into lockdown. And I was locked
down solving this problem. July 2020, I sketched Tiger
Beetle, created the prototype, many versions and never stopped. I don't know, maybe it's because
it's out of that experience that we, as a team, we have so much fun or we really enjoy it. But some
Tiger Beetle is kind of an expression of that because the feeling of our DST simulator, the name of it is the
VOPPA, view stamped operation replicator. It's a homage or ode to war games, you know,
that's got the VOPPA. That classic simulator in war games. And so Tiger Beetle simulator
is called the VOPPA. But the VOPPA for so, such an experience, you know, switching it on for the first time.
I was listening to Kings of Leon crawl as I switched it on.
That was, and we, we, we had been building Tiger Beetle for a year with the whole design,
all careful, the fault models.
We had the interfaces designed.
I knew we would do DST. And I planned it like that. A year in took about a week to build the first
version on the simulator, and then we switched it on. And just
the bugs came falling out of the trees and you'd fix like five a
day. And each of these was like one of those one year bugs from
before. Now you're fixing like five a day and that experience.
It was just like special as a programmer, you know, to I wish it for everyone.
And I think people are getting into this style.
So but but we we were, you know, this runs in our terminal.
So I did a demo to a friend for the very first time, Dominic Torno.
Fantastic, you know, of Resonate.
I really look up to him and distributed systems. He's become a mentor and friend.
And if I have a hard problem, distributed systems are also
Dominic. And, but he I had just met him, and I showed him at
like, simulator running in the terminal. And I'd never shown
anyone before. He was like, Wow, you know, he'd done formal
methods. He's like, this is formal methods, like on the code. And I showed him the probabilities
of the fault models, you know, for each simulation run. And, and he was blown away. Like he said,
no, you've got to tell people about this. And then I thought, well, how do you show this to people?
And, you know, I'm a dad. And I thought, how do I encourage my daughter, you know, just
not encourage her, but just, you know, how do you encourage the next generation to get
excited about computer science?
Yeah.
And, because, and to me, this was the most magical part of programming in my own journey.
So I thought, well, let's make a game.
Like let's take our simulator and let's put a graphical front end on top
to hook into all these events. And then we can create different
levels for people showing them how consensus works. If there
are no network failures, so everything's perfect. We'll
simulate the network latencies and disks, everything's simulated the
clocks, all of that. But there are no faults. And so it's
perfect. And now you can actually just teach this is
just normal replication through the consensus protocol. And then
the next level is like, okay, now we're going to introduce
probabilities of partitions and network faults, but the disk is
perfect. And then the next level, but the disk is perfect.
And then the next level is okay, disk is radioactive.
Yeah.
Well, I played it for five or 10 minutes and I had a blast.
It's like, here's a hammer.
You can start hammering stuff and see how it reroutes
and really, really cool.
Yeah, and you actually, you know,
each of those Beatles that you hit,
they're running, each of them is running real
Tiger Beetle code compiled to Asm.
Yes, it's the real code in your browser for the cluster.
When you take a hammer and hit the beetle, you're actually physically crashing the machine
and it's restarting and you're actually getting to touch a simulated world, but against the
real code as a human.
Yeah, and-
That's amazing.
I feel like as a Tiger Beetle engineer,
you could just be playing that game
and you're on ASCII what you're up to
and you're just like, I'm working.
I'm simulating some crashes here, come on.
Well, I have done that, I do do that Joe.
It's like sometimes you know, instead of doom scrolling,
you just sim scroll.
Yeah.
My daughter says to me, you know,
Papa can we play the Tiger Beetle game?
And then, but I'm glad you call it a game
because we meant it like in the walking sim genre.
So it's a game that you can't win
because no matter what you do, things recover.
Right.
And but if-
You can't knock the whole system out.
It's gonna go back to good.
It's gonna find its way back.
Yes, you actually can if you're really lucky.
When I do it in live presentations,
it's never failed me,
but theoretically you could crash the cluster.
If you, because you see the, those human tools allow you to inject more faults
than, than the F tolerance.
And then, then the system is designed to shut down safely.
Um, so you might run into that, but you would, you'd have to be very quick.
Yeah.
Well, I was too slow in my five minutes of playing around.
I'm also always.
I didn't know it was possible.
So now that I know it's possible, maybe I'll commit myself to shutting that system down.
Pretty cool.
Oh, but there is a game, Jared, at the end.
I don't know if you've played the credits in Radioactive.
I have not.
No, is this an Easter egg?
Yeah.
After Radioactive at the end, the credits like go and that becomes
like a little platform jumper game and you can jump and spin the beetle like,
and you do get a score and that, that one is quite hard.
That sounds like, that sounds amazing. I'm going to do that after we hang up here.
Yeah. Uh, you know, maybe I should just, if I, if I can just quickly add,
you know, the, that game, if I, if I can just quickly add, you know, the, that
game it was, was created part-time.
I met Fabio Arnold in Milan, the very first ZIG meetup in Europe.
He was there and he created it part-time, just a few hours every week with Joey Marx
Illustrator from Rio de Janeiro, Brazil, just the two of them.
And they created a like very, very low budget.
But they had such passion and we then carried on
and we put more into it.
So it did become more polished, but.
Yeah.
That's just their skill, you know, tribute to them.
Cool.
Yeah, I think for those who go out and give it a play,
you will notice immediately, and I did,
just like how much love is put into something like that.
Like you mean it when you say we just love Tiger Beetle
and this whole system because, you know,
there's custom sounds and there's music.
This is like, this is a labor of love for sure.
And a really cool way to show off
what you all have built in a way that is difficult
just with our brains, you know, as you talk about it,
for me to map it onto my brain and make sense of it,
but when you put it out there in that visual way,
it's very compelling.
So shout out to them for their labor
and for you to keep polishing that thing up.
I did wanna touch on the open source side
and kind of the business end.
You mentioned some customers, you know,
you raised some money now,
so this is like a serious business,
but it also is an open source database.
Can you talk about the give and take there,
the push and pull, the decision making process,
because we talked to a lot of people in your circumstance
and some of them have made other decisions,
some of them have made the exact same decision you've made
and we're all trying to figure it out.
How can we make this thing work?
So tell us your side of the open source slash VC story.
Yeah, thanks so much for asking that Jared.
That was my third feeling.
So my first feeling was,
I'm hitting all your feelings here.
Yeah, yeah, yeah.
I mean, but really like July, 2020,
as the project was started,
I remember clearly that there were three moments. The first moment was like, wow, this prototype is fast. It's like, it's
like, the design works. And like, wow, like this maybe could change things for
the open source switch. You know, that was it. And maybe it could change things
beyond, but I didn't know. The second, the second moment I remember was the DST switching it on Kings of Leon crawl. The
third moment was this wondering of like, what actually happened
was, we designed it to be so much safer, that yes, it gained
adoption with within the skates foundation project, we won
trust, and they are integrating it today, you know, it'll power countries at
the national level, you know, who knows in in how many years,
but as it gets deployed. So we did solve the trust problem of
being so much safer, because we designed it like that as well.
But then the third, the third moment was people then within that project saying this is all well. But you
know, open sources is too cheap. Where's the where's the company
to support it? Where are the, you know, the people that are
going to be available to service and really work and you can't you need open source.
So this this system is Apache to open source and all the software uses it.
It cannot use software that isn't open source because otherwise, you know, it just would be a non-starter.
So Tiger Beetle was also created by contracting then like it had to be a partial 2.0. Like that was obvious to me open source because otherwise, you know, you don't fulfill the
mission which is what inspired the performance and the safety is actually to make this really
safe because it is people's money.
So then the question was, you know, people were saying, where's the company to support
this open source is too cheap. But I still, I didn't have a clear vision of like business model at that
stage. And then it became clear as startups said to me, open source is too expensive.
So on the one hand, on the one hand, you have countries saying it's too cheap. And then
you have startups saying it's too expensive. And I'm like, this is Goldilocks. We're just
trying to make some great open source porridge. And it's either, and it's too expensive. And I'm like, this is Goldilocks. We're just trying to make some
great open source porridge. And it's either, you know, too cheap
or too expensive. But nobody's saying it's free as in puppies,
you know, it's too cheap. And then I realized, okay, that
that's it, you know, business model is orthogonal to open
source. Business is about trust, people trust you, you know, at the national enterprise and it's always about trust you that is what you sell as a business is trust your brand, your reputation. It's actually I use the word it's brand. And I think people startups talk a lot about go to market. I think it's more interesting to talk about brand. Do you understand? You
know, do we do we all appreciate the value of brand because brand is trust. And you know,
I must thank my my auditing professors, they used to ask us what your accountants sell
trust. That's the only thing you sell is trust that the numbers are correct. Right. And yeah,
so I yeah, so business is about trust. And it's also about value. Startups, they need someone,
they want to push button experience who will run TidyBiddle for me because that can make it cheaper
for me than if I had an engineer do three months of work around the SRE of a database.
work around the SRE of a database. Right.
So you can actually have a business and sell something that's going to make something cheaper
for startups. And similarly for enterprise, you can have a business and sell something
that is going to provide the value they need, which is now they might have SRE teams, but
they need the infrastructure to support know, to support massive scale
like petabyte. How do you connect target little to object
storage s3, like OLTP data lake, not only our data lake, but
like, let's just connect the OLTP direct to s3. And so this
comes to your question about, you know, the tug of war, and
licensing and all of this. I think the big
mistake that we can make, and I used to make this until it became clear for me that the
moment was that an open source license affects the implementation, not the interface. But
when it comes to competition in the business world, that doesn't happen at the implementation. Typically, it happens at the
interface. So if you think of like some of the, you know, the
fiercest competition, the most, you know, when things were
really on a knife's edge for the web, it was the interface, not
the browser implementations, it was the interface that the war
was fought, you know, Mozilla like fought that war, and we needed other browsers to fight it was the interface that the war was fought, you know, Mozilla like
fought that war and we needed other browsers to fight it because the interface was being
embraced extended and extinguished.
Triple E, you know, and then then you think of like Android and Java and again, it wasn't
about the Android implementation was the interface and that was that Oracle Google lawsuit, you
know, so and then again, you think of like, well, confluent,
Kafka's Apache 2.0, open source.
Then Red Panda came along and I'm a huge fan of Red Panda
because very similar design decisions
around being direct with the hardware,
static communication, all of this very, very similar.
We came from a similar time period,
that in that time period,
the things were changing how you build things.
But Red Panda came along and they saw
the open source implementation of Kafka and said,
well, thank you, but we don't want it.
But that interface, that is great.
That's where our business will be also,
that interface, thank you very much.
And so they built a massive high-performance implementation.
And then WarpStream came along and they said,
well, you know, Red Panda, you are business source license,
not open source, source available.
Apache, you know, Confluent or Kafka,
you are Apache 2 open source.
But that's all implementation. I'm
going to do my own implementation. Thank you very much. I'll object storage and put the
interface great. Okay, now we're all competing. And so I think the myth is that source available
is kind of the thing that always I always feel that you know, something inside of me
dies when I see a great company, we license or where I see a young startup follow that lead. Because
to me, Source Available says that we think it's going to stop competition. It doesn't.
You may as well be on the beach building little moats and sandcastles, but innovation technology
is like a wave. It'll just find a way around you, you know, it'll warp stream around you.
You can't legislate competition away and we shouldn't be trying to build companies
where we think the success of the company is us creating a monopoly. The world's too big, you know, there's too much for you to be. You don't need monopolies to do really great.
And that doesn't build trust, you know, to say to your customers, you can only buy from
me. So I think people think it stops competition, and they
think it helps them sell. And it actually defeats both of those
because you get complacent. And you actually fail to build
trust, you burn trust when you relicense. And if you start
source available, you're going to be doing diligence with
enterprise and say, sorry, you're not open source.
It's confusion, license confusion.
And maybe some people get it, but there's little 1% headwind.
And it's actually, it's a category error because you're spending so much effort chatting to
people about implementation licensing.
And the rest of the world is competing on interface.
And you know, Tiger Beals interface is quite simple. You know, very simple. So we could,
I don't know, whatever license we apply, it doesn't matter if debit credit is where we compete.
There are companies that offer debit credit as an API. And it's very similar to Tiger Beal.
But we compete on trust, you know, we didn't just
take a general purpose database and slap on debit credit. We went deep, you know, we really
cared and we built the whole thing. And people pay us. So we like, before we were a company,
we had an energy company reach out and we landed a, you know, quite a good contract very quickly.
And I think it came down to trust.
So open source builds trust.
Open source is great for business.
It's also orthogonal.
And yeah, I think the other thing is like if you, there's a lot of things in Tiger Beetle
that are like the, I had done many experiments of my own,
you know, my passion projects.
They're all in Tiger Beetle.
Many parts of the design of Tiger Beetle
come from these various, you know, experiments that I did.
And so I was never gonna put that all into a project
if it's not open source, because it's just too valuable.
You know, I wanna always be able to play with it
no matter what happens to it.
And I think we all feel like that,
like our critical infrastructure,
it just has to be open source.
And so, yeah, I think that's kind of how I think of it.
No, I think that's a great perspective
and a specific explanation that I haven't heard previously.
So I definitely appreciate it.
I'm mulling on it.
I think I agree with most of what you said
and the implementation versus interface dichotomy
is one that I hadn't considered that explicitly
and I need to think through it more.
So I appreciate your thoughts on the matter.
Question is, what happens, you know, Apache 2.0,
you get your heart and soul into this thing.
What do you do?
How do you respond if and when AWS comes by
and says, Tiger Beetle, Tiger Beetle by AWS,
you know, for sale now.
Like, does that scare you?
Does that threaten you?
Like, what do you think about that scenario?
Because that's what a lot of people are concerned about
most specifically.
Yeah, so I think one can try and stop it.
You know, you can build the castles in the sand. Or you can just say,
look, the wave is inevitable. It's coming, like, let's prepare
for it. And then so what we do is we get the surfboard ready.
And we're on the beach, we waiting for it to come. Okay. And
as actually we already paddling out and there's the swell, the wave will take
its time, you know, to catch up. But we're already surfing. And
so they are they are going to be like we just get into the water
and and it's like, you know, we could have the cavalry in the
castle, or we could get them out into the field and have great
cavalry and great user experience like let's's compete on, let's actually add value
and serve the community honorably at a profit.
Let AWS catch us in that.
And great, if they decide that they couldn't build
a debit credit database as well as Tiger Beedle
so that they offer it,
as their OLTP like flagship flagship database, Tiger Beagle,
well that builds trust, and then rising tide lifts all boats.
And we're still, now we're surfing the wave with AWS.
I love it, I love it.
I think that's a great way to think about it,
because like you said, the wave is inevitable,
so you might as well prepare for it.
You might as well ride it, you know, ride that wave.
And I think also we've seen this play out a few times that source available doesn't
stop AWS.
So if it's valuable enough, they'll write the implementation.
If you relicense, they will immediately fork your community.
Now you've got two problems.
They're still competing and now they lead the interface.
And that's when it's fatal for a company. When I see them relicense and you see the classic AWS,
you know, I think OG, you know, they did, yeah.
That's actually the thing you don't want is when, you know,
your open source clients are being bought up
and being led now.
Yeah, but I love AWS.
I love their work.
I've learned so much from the distributed systems.
I've got friends that work there. So really, to me, that isn't the threat. The threat is,
you know, what's the problem with the world? I am, you know, we are as a team. So
the threat is really that we stop investing in performance and safety, we stop being trustworthy, you know, building trust.
So maybe we should say, Jared, too, is that your product is more than the open source.
So like there's a principle here too, is that I think it works both ways. If we connect Tiger
Beetle to a proprietary API, proprietary interface, our principle as a company is that's viral.
So if someone wants to license their interface as proprietary, our connector will be proprietary.
And for example, S3, if we connect to S3 for massive scale, then we charge for that, you know, fairly, we make up we make honest profits to serve the community honorably. Add a profit, there
must be a profit because there's entropy in the world. Sure.
Yeah, but if something's open source, we'll then reconnect it.
And there's there's lots of you know, just just like people
would pay Amazon for, for Aurora, because there's a great
value in all the management around post-gres. Again, porridge is too hot or cold.
So we'd be curious to get your thoughts once you've thought about it.
And I can sharpen the argument because I really think we need more
founders need to stand up and we've all been given a lot from the previous
generation of open source and I think it would be great if we all say,
okay, we're gonna pay it forward as well,
like make a technical contribution.
And there's no reason not to, you know,
I think it's better for business actually.
Yeah.
No, definitely appreciate everything you said right there.
And I will be listening to it back as we produce this
and consuming it more.
I love the surf the wave analogy.
That's where you really sold me.
So, so far I'm amenable.
I'm amenable to your argument, but you know,
I'm very easily convinced on the air here.
Is there anything else Tiger Beetle or otherwise
that you wanted to touch on that I haven't brought up yet?
It's been a great conversation.
Yeah, no, I've loved it too.
I mean, I guess we should say,
I should say we wrote it in Zig,
this new system's language.
Oh yes, I didn't bring up the language wars yet.
We have to get our clip, you know,
cause the flame wars must rage on.
You wrote it in Zig, I assume from the day one,
the day one decision,
and you are probably
happy with it since you just brought it up now.
So Zig for the win, it sounds like.
Is that your overall message?
You love it?
Yeah.
It was also just a big wave.
You could see the swell and you're like, I'm going for that wave.
You're hopping on that wave.
Yeah.
That's a good choice so far.
Yeah.
When Node.js came out, jumped on, because it made sense.
And then I was really happy that I did. And when I saw Zigcam come up, I thought, well,
it's not often that you have these moments. Rust was another one. But with Tiger Beetle,
the timing meant that we could have caught the Rust wave, but there were so many thousands
on already, we would have been a drop in the ocean. Also, we wanted to do static memory allocation and Zig is actually really
ideal for that. It's really perfect for the Tiger Beetle design. It would have been much
harder to do our intrusive memory patterns and direct hardware access, IE ring, zero
copy. Two of our team, one of our team is actually the co-founder
of the Rust language with Gradem.
He was the project lead Brian Anderson Breeson at Mozilla.
His desk was next to Brennan Icke
and he's writing our Rust client in Zig.
Well, in Rust, sorry, he's writing it in Rust.
I was like, wait a second.
But he writes Zig normally, but he's writing our Russ client. And then Matt Clad is the creator of Rust Analyzer and IntelliJ Rust.
He's also, he joined the company.
He's basically like a co-founder.
So there were a few of us.
And my senior partner from Coil, not many people know, but he came with and Matt Glad,
Federico Rafael, but they sort of are the core team.
And Matt Glad joined, you know, he was trying to write, he wrote a blog post called Hard
Mode Rust, trying to do to write, he wrote a blog post called Hard Mode Rust,
trying to do static allocation, very similar patterns.
And then Jamie Brandon introduced us and MatClan was like,
I've been trying to do this in Rust.
And they're like, but Tiger Style, you know,
the way of coding in Tiger Beetle, Tiger Style,
we've got our engineering methodology written up.
Oh really?
Can you link me up to that?
We probably don't have time to go into detail.
Well, I'd love to read it.
Tiger style, you call this?
Tiger style, yeah.
And that's sort of all the safety techniques.
So if, assuming that we do still have bugs,
we also have like 6,000 assertions
that check the operation of Tiger Beetle in production.
If any one of them trips, it's like 6000 trip
wires, then there's probably maybe a CPU fault or memory fault
or a bug. And then we, we immediately shut down the whole
database, you want to operate safely or not at all. And that
way you get total quality control and the system becomes
safer and safer. You don't have latent bugs.
Yeah, so Zig really suited all, not only the performance,
but also some of the safety decisions.
Right, it wasn't merely the trend.
There's also technical reasons that you chose it.
Yeah, I think we picked Zig before Bun.
The only other major Zig projects at the time was River
by Isaac Freund, a Wayland compositor,
amazing programmer Isaac. He contracted on Tiger Beetle for quite a while. And then it was Tiger
Beetle and then Bun. Also Mach Engine was around the same time as Tiger Beetle. But we really just
picked it because I was doing a lot of C code and Zig was just a perfect replacement for C and for all these new memory patterns.
Yeah, that was it.
Very cool. Ahead of the wave on that one, you were a early adopter on Zig and probably one of Zig's, I don't know, largest code bases, but maybe most production grade
and out there, like successful projects to date.
Would you say that's fair?
I would say yeah, Bunn is also pretty massive.
For sure.
And yeah, also there's also Ghosty by Mitchell Hashimoto,
which is like incredible code.
You know, like his performance work there is very similar,
trying to get as close to pure memory bandwidth as you can. So he's making a terminal, you know, how can you make
that as close to memory bandwidth performance? Right. Which is also what we think with Tiger
Beetle and same as Jared with Bum. Yeah. And I think that goes back to what we were saying, the love for that sim that you see.
We're actually trying to show what it feels like
if you're coding in open source
because we've really crafted everything.
It's just, you may as well, you may as well enjoy.
And this sort of came from Antires,
his craft of Redis impacted It impacted me a lot and yeah.
Well, speaking of all these things,
we just put a clip out today as a record
of our conversation with Antires,
which was just a couple of weeks back.
And did you know he's hard at work trying to get Redis
to be open source again inside of Redis Inc.
He's advocating, he's returned to Redis and he thinks he can get
the company culture moved in a place where
they'll get switched off of that proprietary new license
and probably AGPL, but I think you'll find that good news
considering your stance on open source.
I'm excited about it.
Hopefully that happens.
That's great.
And again, I should be clear, I love open source also
because I love how it enables businesses.
So I actually think it's great for business.
I don't just like open source because I like it,
but I actually think it is better for trust, for sales.
It makes everything easier.
But yeah, that's fantastic news.
I can't wait to listen.
Yeah.
Cool, cool, cool.
While you're on, I appreciate you coming on the show.
I'm fascinated.
I don't have any use cases for Tiger Beetle in my life,
but I respect it.
And I'm sure our audience will enjoy this conversation
as well.
So appreciate your time, appreciate your work,
and your perspective on open source and business,
which is refreshing, especially in a sea of people
who are kind of moving away from your perspective.
But maybe back again, we'll see.
I mean, Elastic Search is back,
maybe Redis is coming back, we'll see what happens.
But I appreciate you coming on the show.
Yeah, there's always a new wave to come.
Yeah, and I appreciate you too, Jared.
Thanks so much for this, It's been really special.
Tiger Beetle sounds pretty amazing and I'm always impressed by what you can accomplish when you laser focus a solution on a narrow
problem space. That being said, general purpose solutions are amazing too because they're useful in so many different problem spaces.
It really does depend, at the end of the day,
what you're trying to do when you decide to build
or buy a solution and what to buy if you go that route.
What did you think of this conversation with Yuron?
Let us know in the comments.
Links to all the things are in your show notes.
That includes Tiger's architecture, the simulator, so cool, and Tiger's style,
which I'm interested in checking out.
Also, a direct link to continue the conversation
in our Zulip community.
Hop in there, hang with us.
No imposters, totally free.
Why not, right?
Let's thank our sponsors one more time.
Fly.io, you know we love Fly,
and a shout out to Augment Code at augmentcode. Fly.io, you know we love Fly. And a shout out to AugmentCode at
AugmentCode.com.
To Depot.
Build faster, waste less time.
Depot.dev.
And of course, to Notion, Adam's
beloved collaboration tool,
Notion.com slash changelog.
Please do use our links and discount
codes when you kick the tires on our
sponsors wares.
That lets them know we're helping spread the word
and that helps us put food on the table,
which we like to do, as I'm sure you know.
All right, that's all for this episode,
but we'll talk to you again on Change Login, friends,
on Friday.
Bye, y'all. So
I'm out.