CoRecursive: Coding Stories - Tech Talk: Data and Scale with Pat Helland - The long view on distributed databases
Episode Date: March 31, 2019Tech Talks are in-depth technical discussions. Pat Helland has a wealth of knowledge on building distributed data stores. He has been working on distributed data stores since 1978, when he worked on t...he tandem fault-tolerant database. Since then he has been involved in many distributed database projects. Here is the key thing, he is also a master at explaining the key ideas of distributed systems using simple language and practical everyday examples. Can you get married on the phone? How are messaging systems and idempotence like regional offices communicating via fax machine? These are the type of metaphor that Pat uses. Today, Pat sits down with me and teaches me about dealing with data in a distributed, fault tolerant, infinitely scaling world. Webpage : http://corecursive.com/028-data-pat-heland/ Links: Pat's articles ON ACM QUEUE Mind your state for your state of mine Consistency Eventually Standing on Distributed Shoulders of Giants The Singular Success of SQL Life Beyond Distributed Transactions
Transcript
Discussion (0)
Hello, this is Adam Gordon-Bell.
Join me as I learn about building software.
This is Code Recursive.
You have to think about the crazy idea for sometimes years
before you have it organized well enough to make a story out of it.
Because if you don't have a story, you don't have a paper.
And if you don't have a purpose and a point to make, you don't have a paper.
So it just takes a lot of noodling in order to make it happen.
That was Pat Heland. He's been working with databases since 1978. Today, we talk about where
we should store data. So should we store it in a relational database on a single server?
Should we store it in some sort of distributed database? What are the trade-offs that various stores represent?
Questions like, could we make an infinitely scalable service?
How does immutable data make distributing things easier?
If you liked this episode, give us a review on iTunes or Twitter
or mention it on Reddit or Hacker News or wherever.
Hey, I'm Adam Gordon-Bell and this is Co-Recursive. Today I'm speaking with Pat Helland.
Pat, thanks for joining me.
Well, thank you for having me. I'm looking forward to our chat.
I am as well. Pat, you work at Salesforce and I've written down here that you've basically been working on data storage and writing bits to disk since 1978. Correct.
I had hair at that time. I don't know, but I've been working on databases and storage and
transactions and distributed systems and built a multiprocessor and working on application platforms and stuff since 1978.
It's been a long and checkered career of building systems and then having thoughts about it and sometimes having the opportunity to write that down and share.
Yeah, you have a lot of great articles, which I think will cover some of that ground on the ACM.
It's actually one of my favorite things to do is to carve off of half a day or a day to work on the next paper.
And I'm always boring my friends with the titles and ideas for papers that are not yet written.
Yeah, you have great titles.
Thank you.
I'm trying to think. I forget now. I can't think of anyone.
Well, just right now today on ACMQ Online was Identity by Any Other Name, which is talking about how when you get into a distributed
system, things are knit together by identifiers. And you may have different names for the
identifiers, their pointers, their IDs, their whatever. But it's pretty much identifiers that
are tying everything together once you get into a distributed system.
Yeah. Standing on the shoulders of distributed giants is another one. I just like your titles.
You tend to be playful with them.
Yeah. It's standing on the distributed shoulders of giants, which plays off of the physics
and physicists because it's, you know, picking up five different physicists and giggling
about how they're the things that for which they're famous kind of preceded our distributed
systems ideas.
It's a pedagogical mechanism to explain how the heck distributed systems work.
And I think that one's just kind of fun and silly. And I had a good time with it. So here's why I brought you here, data stores
and distributed systems. And when did everything become so complex? Like I really love relational
databases and for a long time, I felt like that's kind of all I needed. But now I live in a world
and there's like Kafka and key value stores and so on.
And I'm trying to decide when which is the right tool.
So you're here to be my teacher, hopefully.
Well, let me give you a perspective.
Okay, I fell in love with building databases and transaction systems in 78
because the thing I absolutely loved is that the API was begin transaction,
do some stuff into transaction.
And underneath it, we needed
hundreds of thousands of lines of code and now growing into millions of lines of code to just
deal with all the goop that happens environmentally so that the user doesn't have to think about that.
And that was just a phenomenal, phenomenal advancement back in the day when we were
running on one computer. And it became under more pressure as we're trying to run over thousands of computers
because you just can't pull everything into one place at one time.
So what happens in that thousands of lines of code between begin and end transaction?
Well, you have to organize the data so you can locate it on disk efficiently.
For relational systems, you have to create the inverted indices so that you can look things up by different values that
are kept in there. And you have to make sure it happens atomically. You have to deal with the
fact that if something croaks, how are you going to put it back together? If something dies,
because that's a very important part, you want to lower the burden for the developer that's
using the system by taking those complexities on within the system. You and I've talked about it just a bit ago. I wrote one of the short papers I wrote is called the singular success of SQL. And it's that's the thing I based the column that I write papers. I write papers for ACM. There's a column. I titled the column escaping the singularity. It's
not your grandmother's database anymore. The point I wanted to make there is that to make
relational algebra work, to do relational systems, you have to make all the data there,
all the data accessible and all the data still, because you need to be able to do these set
oriented operations across
all this stuff and not get tangled up in other people changing other things and not get tangled
up in the data not being here now. And so it kind of works when you bring everything together to one
point in space and together at one point of time. And that's known as a singularity when you collapse
space and time. Now, transactions make
it look like you've collapsed time, even though you might have other stuff running against the
database at the same time. And that's kind of part of the fun of those. But the model,
the abstraction is one point in space and time. Now, our systems are so darn big, so darn complex,
that doesn't work so much anymore. It only works as one subset of the solutions that it takes to
compose a broader system. So at what size does that break down? I don't know. I mean, there's
no black and white. You know, it works pretty well for a handful of computers that are close together
or one computer certainly. But when you start getting into the scale of big web solutions,
it becomes a challenge. And that becomes one of the design points we have to deal with. And then you get into people who have quite
legitimately, you know, it's really an interesting thing, the key value store that's scalable,
that you shard, it's a fascinating data point. And I wrote a paper 2007 called life beyond
distributed transactions and apostates opinion. Because I was a believer in
distributed transactions. I believe they solved all the problems of the free world. And then I
started seeing things getting brittle around the edges and getting challenging and availability
being a challenge and one system went down and that kind of stuff. And so people start doing
other things. And in this new world, what we see the people doing is sharding their data so that
it can run across multiple machines.
And then there are patterns by which you can have your workflow do steps, one transaction at a time.
Step one here, step two there, step three there, and allow the system to scale.
And so that's what that paper from 2007 is about.
So sharding is basically, so you're saying, oh, my data is too big now for one relational database,
so I could divide it into many relational databases, partitioning it over some key?
Maybe. You can. The question is, okay, how do I identify that piece, right? And so if the system
is going to pick that piece up, because of the system that's holding it, the server that's
holding it is getting too fat, right? It's got too many pieces. It can't run more, right? So now it's going to pick some stuff up and move it from this server to that
server. Okay, that's cool. But what that means is I have to have a bite-sized piece, and I called it
an entity when I was doing this in this paper. It's just a term. It doesn't have to be the only
term. And the idea is that you can have transactions within that entity, but I can't have transactions
across identities because if I can't do transactions across identities because if I can't
do transactions across systems and I move one of these things from this system to that, then it
becomes a challenge because now I've got to do transactions across systems if I had a semantic
in my app of working across multiple entities to do the transaction. So how do you design for that?
You start out and you say, this is all I'm ever going to use for a transaction. So how do you design for that? You start out and you say, this is all I'm ever
going to use for a transaction. So if it's working from one entity to another, and there's another
one in the same server, I'm not going to do transactions across them because one might get
moved to another. And fundamentally, you have to pick an identifier for these things, which is the
boundary. It's the boundary within which you do transactions. And so that's a trick people use.
Let me make sure I understand that. You're saying that we have an entity
and we're limiting our transaction scope to a single entity.
Correct.
And so actually, do you have an example maybe that would put some meat on this?
So say I'm working on one particular order. Okay. I'm processing the work
for a workflow to ship some stuff for an order. Okay. I'm processing the work for a workflow to ship some
stuff for an order. Okay. And the order identity and the order information and the information
about who's the shipper and the information about the inventory that I've got from here and there,
it all fits into one entity and I can put it on one box. Matter of fact, I can put hundreds of
thousands of them on one server, but as time goes on, there's going to be more of them that fit on a single computer. And so if I can't do
transactions across computers, because that's a challenge, then how do I make sure that I stick
within it? Well, I only work on one order at a time. And then I commit that work. And then I
work on the next order at a time. And so now the question is, how do I hook up the order processing to the
shipper? Well, I probably have messaging that goes from this box to the computers or computers that
hold the shipping information for the shipper. And so then I'll send something off to there.
And that composition allows me to then expand and scale the system without things breaking.
Yeah. And also the place where I get concerned is it's also complicated. So in my previous,
if I had my relational database, I could update the order table and the inventory table at the
same time. Right. And now I can no longer do that. Is that right?
In many cases for many large distributed systems. Yes. And so it adds a challenge in building the
application because
you can no longer transparently view all these tables and just do what you want to do. You're
correct. But it also builds a system that can handle scale and continue to function.
And so I have my order, say I ordered three pens, and then there's an inventory system that tracks
the pens. And so I send a message to the inventory system, say,
I would like three pens. Something like that. Yeah. You have to design what the protocol is. Yes.
And you get into this world of how many pens do I order? How many pens did I ship?
Have I promised more pens than I've actually got? There's all the overbooking challenges.
Applications are a fascinating and fun system. And the question is, what do you build
it on top of? But there is no doubt that building a scalable distributed system has more challenges
than a centralized one. But let me give you a twist on that too. I may be having my company
dealing with my suppliers and my customers, but I have to do that kind of work across trust
boundaries anyway. I have to do that kind of planning for how do I do my piece and then send a message to
the partner to do their piece.
And so that kind of work is what we've been dealing with for decades as we build systems
that connect other companies.
You know, when I was a kid, you connected companies by sending faxes to someone who
added a terminal who then typed it into the local computer and the computer only dealt with your company's stuff, right? And then you would
print things out and then you would send them via fax or snail mail to the other company.
And so that meant there were people in the loop to try to deal with some of these challenges of
ordering and reordering. Do I have enough inventory and did I promise more than I've got?
And so where we're dealing with what's going on now is much of that stuff we're trying to figure
out patterns for codifying it so that we can actually make it all connected from computer
to computer, even across trust boundaries. This was all on the basis of you coming up
with a theoretical, infinitely scalable system. And if I understand correctly,
your solution involves fax machines.
No, not now.
That's what it was when I was a kid,
way before this time.
But that's funny.
That's good.
I used to refer to that as swivel chair integration
when you'd have to look at two screens
and type between them.
So that's the way it was way back in the day.
Now we're trying to put systems
and hook them together again.
It's an interesting metaphor
because in this system, so somebody orders some pens and I'm about to ship out the day. Now we're trying to put systems and hook them together again. It's an interesting metaphor because in this system, so somebody orders some pens,
and I'm about to ship out the order. So then I want to tell the warehouse that I need two of
those pens. But so in the transactional world, like I could take things from the inventory
and like send out the order at the same time. But now I have a world of potential things that can go wrong,
right? So if I ask for two of X, and then I ask for two of Y, and then two of Z,
and then there's not enough Z. I have to unroll this somehow, right?
Correct.
So how do I do that?
Carefully. No, at some level, things become not about changing the pen value from five to three,
okay? It becomes about having a thing that's there, which value from five to three. Okay. It becomes about having
a thing that's there, which is I wish to order two pens. And so that's an operation. And that's
what we build when we do these loosely coupled systems. We build operations that say, this is
my intent. And then the question is, can I apply them and unapply them? And how can I reorder that?
And so how can I say, oh, gee,
I took away two pens, but now the shipment arrived with new pens. And the ordering of it doesn't
matter as I increase and decrease the inventory, as long as I don't bottom out and run out.
And even if I bottom out and run out, what am I going to do about that? Do you see what I'm
saying? Because it's not about me viewing that inventory as a number that I can just automatically
do arithmetic on. It's about me
writing down the intention to get this many things shipped. And the reordering of that is an important
concept. I talk a lot about interchangeability because the thing that I'm dealing with,
there's a bunch of them that are the same. And so I can have differing orders of that.
You see that when you go to a hotel, the hotel gives you a reservation for a king-size non-smoking room, but they don't give you a reservation for room 301.
And you actually, the rooms are equivalent.
They're categorized as equivalent, even though, you know, one of them is next to the elevator and a bit noisier.
But they're still equivalent, so you're selling one of the pool. That allows them to deal with people checking out late, checking out early, assign you any one of that pool, and still have a successful completion
to their commitment. The same thing happens when you have pens in inventory. You typically have
a stockkeeping unit, which has got a certain number of pens, and each is the same as the other.
It is that ability to talk about computing operations, having interchangeable resources
that allows the ordering to be swizzled around and still be successful. And that's a huge thing
that we have to deal with as we're talking about bringing systems together. And I have a partially
written paper on interchangeability that I hope to get done soon. To me, I think the interchangeability
makes sense. But there's also a feel of things as an outsider who's
never worked on databases that I'm building my own kind of database system. So my inventory system,
it gets a message that this person needs two pens. And you kind of have to set those aside,
right? It's almost like locking on something and then waiting until the order is confirmed.
And then it certainly is a form of locking. There's no doubt about it. And there are actually business decisions that get involved in there. And I wrote about this in a paper called Building on Quicksand in 2009. One of the conditions, one of the things to think about is, do I want to over provision? That means I'm selling you pens. And for every pen, I'm going to actually have the pen and I can walk into the warehouse and touch it, right? So I promised it to you and it is there. And if things go well, and assuming I don't run
over the pens with the forklift, right, then I will be able to fulfill every committed order.
That's one choice. But that also means I'm spending a lot of money for inventory that
isn't being used. It happens that way. What I can do is overbook. I can say, I'm going to sell 10% more pens up all the time. And so filling the airplane means
airplane seats are cheaper and the money and it works out. And that's a business choice they make,
which can be annoying when we get told we can't get on the plane, but it's a business choice
they're making. What I'm pointing out is this isn't a database. This is just not a database.
In databases, we have reads and writes, and we make it look like there's a scratch board that
you can write things on and do some relational algebra and get back a set of answers. And that's fine. That's a
phenomenal building block. But then on top of it, you have to have abstractions for what makes a
business tick. And those abstractions include long-running work. If you think about it,
database transactions don't have a notion of long-running. They make it look like it all
happened at one point in time. You say begin transaction, you see your stuff, you say, end transaction or abort, whatever you're going
to do. But it's not a long running thing with other things getting in the way. To put other
things getting in the way, which is necessary to do time-based work, you have to have this
interchangeability. And even more than interchangeability, you have to have the
dealing with uncertainty and the probabilities of, gee, I bottomed out because I ran out of pens because I overpromised.
Isn't there uncertainty inside the database? Like in the part I don't have to deal with
between begin and end transaction, it has to cover something.
We map it all and make you either see a committed or aborted transaction.
Yeah.
Yeah, there's uncertainty. Things go wrong. But you don't see that as things go wrong when you're
building an app on top of it. It gets collapsed into a simple answer, yes or no.
Yeah. And now in a distributed system, these concerns are now mine.
Correct. And what I believe we have not done a good enough job in doing is figuring out how to
make that easier for the programmer. We've done some, but not all the way that we need to do to
talk about what does it mean to have interchangeable things? What are the business rules for overbooking versus over provisioning that class of stuff?
And so there's room for us to do those things over time as an industry.
Yeah. Maybe this connection isn't as there as much as I want it, but like in a database,
let's say I write something, I send some sort of right to the database and it commits it,
but it like crashes before it updates all the values. Is that
possible? Only if we have a bug. No, I mean, it's not, that's our job is to not have that show up
to you. I'm not saying it's never broken. Databases sometimes have bugs, but they usually don't. And
that's our job to never have that happen. But to do that, we narrow the abstraction of what you can
do to have records in a database and you're updating a set of them. And we know how to deal with making sure that all the things you updated are all or nothing.
Yeah. I was trying to draw this metaphor between the one system sending a message to the other
is something like writing something to like the transaction log. And once it's there,
it's committed, but maybe not everything's been updated to reflect the data that's in
the transaction log. Right. In databases, you get into the visibility of that effect, okay?
Where you're correct, it's in the log, but we have to make sure as an implementer of
a database that you can't see the value before the logged entry, okay?
So that's part of the job.
But you're correct if I'm building a loosely coupled system where the app is doing some
work and generating a message, and then the message is going to rattle over to the other
machine, and then it's going to do some stuff.
There's a window of visibility where you can look between the machines
and you can see you don't see the change on both of them, right?
Yes.
And to make you see the change on both of them,
then we have to hold something still.
We could hold the source still so that you can't see that the transaction emitted
where we're trying to get anything to the second guy,
or we could try to make the second guy still.
It just ends up the model isn't quite right when you're saying, oh, the app can see the two databases in between. And so we deal with that. Now, what that means is the messaging system has
to guarantee the delivery of the message over time. And it has to then make sure that the message is
applied exactly once and isn't applied twice because
you don't want to pull two orders of two pens out, right? So there's a whole set of things.
Item potence is a fascinating area. Item potence is a fancy word that it's a mathematical word
that we use in computing, which means item means I said it before and potent means still good.
And so item, I-D-E-M is like when you see ID at the bottom of a footnote, which is,
yeah, yeah, yeah, that previous footnote's got the stuff that I'm talking about, right?
Oh, I never thought of that connection.
Yeah, that's what the word comes from, right?
And so item, I said it before, and potent, it's still good, okay?
So it's okay to say it many, many times.
And I like to say sweeping the floor is idempotent.
It's naturally idempotent.
Reading a record, given that any value from a record over time is going to be within that
window of time is okay. It's idempotent. Read it a second time. You get a good record, right?
Yes.
Right. And I have a paper called Idempotence is Not a Medical Condition.
That was fun getting that title through. It's a super and powerful thing. And knowing whether
you've applied it before, so the paper that just hit the web today, Identity by Any Other Name, talks about the fact that in a banking
system with paper checks, you buy these printed paper checks and they have the routing number,
which says what bank you're going to. It has the account number, which says which account within
the bank. And then it has a check number. So that is a unique ID for that operation.
So you put the piece of paper and you fill out, I want to send a hundred bucks to Adam.
Great. So it's going to go send a hundred bucks to Adam and it goes into your account and then
it hits my account. Well, when it hits my account, the bank writes down all that stuff,
including the check number. It says, I've cleared this check for a hundred bucks.
Fine. So now if the bank, if that check comes around a second time, no, it's already cleared.
We don't have to send another a hundred bucks. And furthermore, banks usually put a time limit
of a year on the check. So the check has a date on it. And so if that check shows up two years
from now, well, you would potentially pay the a hundred bucks a second time because you're not
remembering things two years out, but the check is rejected because the date is in excess. It's more than a
year after the date of the check. So the combination of the date on the check and this rule means that
the bank doesn't have to remember every check forever, which is important for them to implement
idempotence of clearing the check. That makes sense. So in the microservices world, I have
one service that's sending messages to another,
let's say via Kafka or something, right? I'm just recounting to see if I'm understanding
this correctly. And say each message that it receives has some sort of ID. So in order to
make sure that it doesn't replay messages, we need to track what IDs it's processed.
Yeah, but that's really hard in a microservice world. Okay, when you say microservice,
I think dozens of equivalent services on different computers and a load balancer in front of it. And when I send a second message, most of the time it will get to the same actual microservice server as the first message. But there are many reasons why it doesn't always. The server can fail, things are busy, it gets rerouted. So it goes to a different one of the dozens of servers that now has to figure out what to do with it.
So if you want to make something which is stateful be item potent, and you're writing down whether
you did or did not do it, you have to write it down in a way where each of the sibling
microservices can see it in order to enforce that. And so that can be done. You pick this thing up and you run
to a key value store based upon a key, which has to do with me sending a message to Adam,
the Adam service. And then it goes and it looks up and it says, oh yeah, I've already seen that.
And that could be done. But you have to think carefully about what happens because
microservices means goes away sometimes. And the value of microservices in a scalable cloud environment is huge. They're out how to deal with state includes how do you make sure that item potent
work that's already happened doesn't happen a second time, unless it doesn't matter,
which is possible.
I guess what I'm getting at is, if I understand what you're saying, you're saying the expiry
of the check is important.
So if I have a service, whatever, it's stateless, but it writes to some store and it receives messages and it may receive one more than once. So it needs to keep track
somehow of whether it's applied one or not. However, the expiry date limits how many it
has to keep track of. That's one thing that can occur. In the bank clearing the check,
they notice if they send the hundred bucks a second time. Okay. So there's a problem if you
actually do the work a second time. For many things you do, many things you do with
microservices also, it really doesn't matter if you do the work a second time. It's not a big deal.
So the real question is, if you want to promise idempotent behavior to the caller,
you have to take apart what did the operation do? Did it matter if I did it more than once?
And if so, how did I make sure that the part that matters is protected from doing it more than once? That then gets more
challenging. What's an example where it's innately replayable? Oh, heck, go get me the product
catalog information for product XYZ out of the cache for my retail site.
Reads, specifically.
Right.
Reads are a fine way to do it, right?
And so they're just naturally item potent.
And there are layers of abstraction in the system.
The read and the processing of the read request is probably monitored and logged just for tracking the correctness of the service.
And that monitoring and logging doesn't matter the correctness of the service. And that monitoring
and logging doesn't matter because if you process the read twice and it's logged twice, that's fine,
but it didn't do any harm to do that. So you're always trying to take apart when does it do harm
and when doesn't it do harm. Now, if you're also calling something to add something to the shopping
cart and that service then goes and opens a value in a key value store and reads
it and adds something to the shopping cart and writes it back, well, that's fine if you do that
twice with the same item, as long as the act of doing it has some unique ID, which allows you to
say, oh, geez, I already put this in the shopping cart. No problem. That makes sense. Yeah. So
people, they design these things for how do you make it be okay to do the requests idempotently.
If you don't make it okay to do them idempotent, you're going to have a problem.
Now, these are not challenges that you deal with in a classic two-tier surrounding a database
because there you're much less thinking about a failure restart of the work.
And if you do, you're typically making the restart of the work idempotent because the
way you go and update the database means that work item potent because the way you go and
update the database means that it's okay to do the second transaction and it bumps into the first
transaction and say, oh gee, I already did that. And at some point you had worked on distributed
transaction systems, which propose a different solution to this problem, right? For me, okay,
the concern that I found when I first did distributed transactions, it was back at Tandem.
And I'm super proud of the time I had at Tandem in the 1980s. And there you had two to 16 processors and any processor could fail and it kept on going and the system was super resilient. And so if you
were doing a distributed transaction across multiple clusters, multiple of these clusters
across a wide area network, it was okay because those guys didn't
really go away for extended periods of time. Then when you started applying the distributed
transactions to single servers spread across a distance typically, but the server goes down
and the resolution of the transaction is stuck, right? It's just stuck behind that.
Then that means the locks in the database on the other nodes are stuck behind it. And so the entire
app kind of gets to be more brittle. There are many solutions that I'm interested in where you're
keeping the state of the transaction in a cluster and it's more vibrant. So I don't love two-phase
commit, but I actually think it's okay if the participants don't go away for long periods of
time. And there's a whole bunch of tricks to make that happen. And so I stand by my life beyond distributed transactions as an application pattern that
is super important for people to look at.
But I also believe there's still more innovations that are going on.
Spanner is very interesting, for example.
Lots of things are very interesting.
And I don't think we're done seeing growth in that area.
By the way, I have a model for people who are not super familiar with distributed transactions
and what it's like. The way I characterize two-phase commit is as follows. Imagine you're getting married on the telephone, okay? And your spouse and the minister are somewhere else. And so you're on the phone and you say, I do. And then you hear a dial tone. The real burning question is, are you allowed to date? And that's the fundamental algorithm behind two-phase commit.
And so you don't want to get locked up. And another thing that I've said is that people
are pretty comfortable with the CAP conjecture, which is the Consistency Availability Partition
Tolerance, CAP. And the conjecture in theorem is pick two out of the three, because you can't have
three out of the three. You can have consistency and availability as long as things don't break, but things break. So that one isn't useful.
You can have consistency and partition tolerance, but you're going to lock up if things go bad,
that kind of thing. And so one of the quotes I had from the 80s is that two-phase commit is
the anti-availability protocol. It picks perfect consistency over availability. That is its job. That is called success. And so, and I'm not saying that's necessarily bad. I'm saying it fits into a landscape of that's complex and challenging.
That helps explain why distributed transactions aren't used that much, because you're giving up availability, right? So if something gets blocked anywhere, it blocks everything.
And as the number of participating nodes goes up,
the probability that one or more of them are unavailable goes up. So if my data is distributed among eight servers
and my transaction crosses them, then anyone can stop the whole thing.
Correct.
And that's not good when each of these things goes away for a while.
Now, we are getting better about making them not go away for a while, but that's not been
what we've dealt with in the last 10, 20 years on a whole.
There's exceptions, but on a whole.
I think it explains how, so then as the amount of data I have in my distributed data store
grows, the more distributed transaction approaches is likely to stop.
Correct.
Now, what's kind of helping with this, which is really interesting,
is that in cloud computing, you're seeing a separation from compute and storage.
So I have a compute box, which is like a database or something,
and it's writing to a store, but the store is actually not on that physical server.
It's off on three different physical servers that are spread around the data center.
Now, if this compute box goes away,
you can pop a new one up
and within minutes or something like that,
it can be running and now finishing off
what was partially done
because all of the knowledge, the log, the database,
all that stuff is out on three other random servers
and you can make sure that at least one of them
you can get to.
So we just make each instance like a cluster.
Well, again, separate storage from compute. make sure that at least one of them you can get to. So we just make each instance like a cluster.
Well, again, separate storage from compute.
So the storage is just hundreds, thousands,
tens of thousands of servers that are writing things out there.
And everything they write is on three different copies.
So I can do the math on the failure of the servers
in the data center,
and three copies ends up being more likely
to lose the data center than all three copies
by losing servers, by a lot. Okay, the data center than all three copies by losing servers,
by a lot. Okay, the data center will get a meteorite that hits it or something, right?
Because when one of those servers goes away, you find an empty server or some other server,
and you say, go from two copies back to three copies, and now you got three.
So how quickly does that go down? And the math on that is something that we all do when we think about
how to build a distributed system. So the storage is spread around. Okay, so fine. So now I got a
database. The database is writing logs and it's reading from the storage. Great. Where are the
logs going? They're going spread around. And where's the database going? Spread around.
So if that server goes away, you just kind of scratch your head and look around and this is
automatic, but it's not a human, right? Scratch your head, look around and say, okay, let's pop up this empty server over
here and make it run. And then it reads the data and it puts it all back together. And so the time
to then come back to life is now not gated on getting a repairman to fix the one server in
the previous design. Now that means that two phase commit is getting less painful. It's still
challenging and it's got times when it's challenging,
and we could talk more about that for a great period of time.
But it's getting less challenging when the individual participants are more available.
One thing a system like that would allow is consistency of reads.
You mentioned before with the relational database, right?
Like I can read the data immediately after I update it.
I can even read it in the same transaction, right? And it should be. It must be. That's part of the rules.
Yeah. Yeah. But if I'm using a more modern, if I'm using a key value store that's like distributed,
I could write somewhere and possibly read from somewhere else and the changes haven't cascaded
out. Well, it depends on the store. Most of the key value stores offer what is called linearizability.
That means there's a linear sequence of changes. And another more approachable way to say that
is read your rights. If I write something and I go to read something, it's what I just wrote.
If I write something and somebody else goes to read something, it's what I just wrote.
The fancy word for that is linearizability, but it's really just read your rights.
It turns out that's what most systems do because most applications go completely haywire if
they don't have that characteristic.
But a fact of life in a distributed system, which is what you get in a data center, and
nowadays almost everything is distributed systems.
A fact of life is that that means you're writing it in three locations.
And that means that I can't just say, oh, I got two copies. That's good enough. Because then. Now, remember, every server known that we have out
there is usually pretty fast, and it has a pretty bounded latency, except when things are kind of
tangled up, and the latency gets very large for a period of time. And that can be garbage
collection, it can be a whole bunch of other things that can go wrong. And it could just be
storms in the network, which doesn't happen too much in data centers nowadays, where you can't actually talk to the guy. He's there and he's happy, but you can't get the message through. But it looks to you like he's low. So now when I'm trying to do a change, if I have to change three copies, then that means that I have to make sure that I wait until all three copies are there, or I officially give up on one of the copies. Now to officially give up on one of the copies,
we kind of have to shun that copy. We're not talking to him again until we like decide to
bring him back in. Then we bring him back in, there's going to be all this work to do.
Actually taking and shunning, which is just my word, it's not the official word, but to push it
out of that membership of three means you have to decide that server's bad, which takes sort of a
minute in most systems. So now you're writing something that server's bad, which takes sort of a minute in most systems.
So now you're writing something that was taking you 10 milliseconds, and now it's taking you a minute. And so that's a trade-off associated with being able to read what you wrote. Because if you
don't do that, there might be old copies floating around, and sometimes you'll bump into the old
copies. So you had described this as predictable answers result in unpredictable latencies.
Yeah. And so in the paper Heisenberg was on, which is the standing on the distributed shoulders of giants, there's a section called Heisenberg was on the right track, which talks about one technique
to avoid it, but it's only useful when you structure the data in certain ways.
So does that mean we want, that we should be giving up on reading our rights in certain cases?
Well, it does in certain cases. It depends what's more important. Is it more important to have a
quick answer or is it more important to have a perfect answer? And the tagline I have for that
is, do you want it right or do you want it right now? If you have a shopping experience for users
that are buying junk and you're showing the product catalog or you're showing the shopping cart or you're showing stuff, you know, measurements show that if you wait too long to give them their answer, they go to the dishes and they don't buy anything. catalog out of your caches because the cache is being updated irregularly. And first you see the
new one, then you see the old one, then you see the new one that nobody gets too excited about it
because it kind of works and it's okay. So it depends upon the business. If the business wants
the latency bounding of making sure that the user is getting what the answer quickly, even if it's
not the latest and greatest answer, that's fine. It's a business requirement. But you have to be thoughtful about it. So the business requirements dictate kind of
our storage strategy. I'm saying that they should, yeah. And that we don't pay enough attention to it.
So if I understand, like rewinding, if we have data that's too big to fit on a single machine,
we have to start distributing it. So if we use distributed transactions, there's some potential
availability issues. So then we need to decide, there's some potential availability issues.
So then we need to decide, assuming we didn't decide on Spanner or something, we need to decide whether we want it right or right now, whether we want it fast to write down or accurate to read. the same discussion of whether the app sees a ginormous database running across many, many servers and distributed transactions, or whether they actually do the kinds of things I talked
about with entities and workflows and messaging in the app, which will also be more resilient and
more rapid in its behavior, but doesn't get tangled up when things break. And the right versus right
now is less about distributed transactions and more about any individual store in a key value store. That was how I was applying it.
Yeah, yeah. No, I was thinking once you're in this land of a key value store that is distributed,
you need to decide whether you want it to wait for everything to be
written or whether you want it to just write what it can and then go.
Yes. But usually people want it to be read your rights because they don't know how to think
about what the app would do otherwise. And are you saying that your entity model overcomes this?
No, I'm saying that it's at a different granularity. We're conflating two things that
we shouldn't conflate, and I'm sorry. The entity model doesn't overcome it. The entity model
gives you a mechanism for scale that tolerates systems going out and the system can continue to work.
So one interesting tidbit I picked up from your paper was this concept of getting the answer
right now. One way that it could be overcome is if your data is actually immutable. How would
that help you? Well, the wonderful thing about immutable data, one of the wonderful things,
is that I don't have to worry about writing two versus three replicas and getting a stale
version of it because there is no stale version. The data associated with what I'm writing has
only one value across all space and time. The unique ID that I have is only going to be for
that stuff. So I don't have to get excited about, did I write two versus three copies
from the standpoint of stomping out an old version. I want to get
excited about how many copies I have so that if something fails, I don't lose the data. But that's
a different discussion than do I have correct update to it. And so I'm writing it out. I'm not
replacing a previous value because there was no previous value. And so now it's not, I can write
it anywhere. I can write it on any three machines. And if one of the first three is stuck, I'll just
pick the fourth or the fifth and go put
it back later on when things unstuck.
It kind of sounds like a cheat, like a trick.
You're like, okay, updating things is hard, so we just won't update.
Yeah, but when you build systems out of stuff that doesn't update, the whole distributed
system thought process inverts in a positive way.
Now, you have an obligation.
How do I deal with not updating things?
And that
ends up giving you some additional challenges. You have to think about a database very much like
an accountant's log of changes. And there's some well-established techniques for this,
where you write the changes and then you continue to merge and coalesce them so you can actually
find the latest thing for that key. This is called a log-structured merge system. But in all cases,
you're only writing out brand new stuff with a brand new identifier. And the storage system is far easier because you
don't have these corner cases when systems get wedged. And when you look up and down our
computing stack, everything from how do we do the hardware inside of an SSD up to how we're
building a database is inverting to the point where we're only writing data that you never update.
You only logically update it by writing new things into the journal, describing what should
be different.
I think an example will help.
So let's say you mentioned accounting.
So let's, what about a bank account?
So making changes to a bank account, how would I represent that?
Well, the way we do at the bank, right?
You have debits and credits and you get the monthly statement.
It says this changed and that changed and this changed and that changed.
And we started from this and now we have these
up and then we ended that. And that's the bank account. And then you roll them up and then you
have a yearly summary and all this kind of stuff is based upon just adding new artifacts of
knowledge. In the middle of the month, a check clears, it's added to the list. It's added to
your bank account. At the end of the month, you do the analysis of what the monthly statement is and some
stuff's added to your bank account.
You just keep adding to your bank account.
Eventually, kind of when the year gets wrapped up, you peel it off into archive.
You can go back and see it, but it's not as easy to see it.
You have to kind of look a little harder.
And so you're constantly adding information and you're never deleting and you're never
changing anything.
You're adding knowledge.
So somewhere in the system, I still would like to know my balance. So something has to roll these up, I assume. Correct. But sometimes to read your balance online in the middle of the
month, you have to have an algorithm that comes in and looks at the balance as of the beginning
of the month and apply the debits and credits as you're trying to figure out right here and now,
what's the balance? You apply the debits and the credits you're trying to figure out right here and now what's the balance, you apply the debits and the credits and say, oh, okay, and now you
have the summary. But the thing that's at the beginning of the month, which is your bank balance,
it kind of motivates for you. It bounds how much you have to do that because it's rolled everything
up to the beginning of the month. But you're actually calculating in situ to give an answer
in the middle of the month and then you can get the balance right now. So any given lookup of the balance will start at the last snapshot and roll forward?
Yeah, and apply the changes. And organizing the system so that that's reasonably efficient
is a part of the game. But the facts are, in accounting, if you use an eraser, you go to jail.
And so one of the titles of one of the sections in one of my papers is accountants don't use
erasers. And so much of this is described in the paper.
Immutability changes everything.
Thinking about distributed systems data as a collection of immutable artifacts absolutely inverts the way you think about the system.
And it's cool.
So how so?
I just have these things that somebody wrote somehow and I'm looking at them.
And so they're artifacts and they're there and they're going to be there until maybe they're deleted, but they're never updated. So the question is, how do I interpret
the bank balance out of that? And again, even when you're in the middle of the implementation
of an SSD chip and they're copying forward the pages that you're updating and applying the right
to a word that you're doing inside of a page, they're doing it by creating immutable artifacts
and copying them forward. And so that's just one example of what you're seeing up and down the stack
in the way we deal with data is increasingly about writing something down, never changing it,
and then doing something to semantically show the change.
So how do we semantically show the change? You mentioned log something.
Well, log structured merge system is a technique
that was first documented in 94 by some friends of mine. And it says, look, I'm going to have keys
and I'm going to have values. And the semantics I'm going to offer is I'm going to update the
values associated with the keys or add new keys or delete keys. And I'm just going to give you
a system that lets you modify the key values. Now that's cool. As I'm doing that though,
what I'm going to do is you make a change and I'm going to put it in a transaction log,
just like a database. And then before I tell you the transaction is committed to make those
changes, then I'll make sure that that database is written down, that log is written down in a
place where I can find it. Great. So now I have all this stuff in memory that is the new values.
Okay, fine. So I'm going to write it out into a file, which is structured by the keys and the
values. And it's got the changes that which is structured by the keys and the values.
And it's got the changes that happened, say, in the last five minutes.
And so I keep doing that.
But now I have a stack of changes representing five minute intervals.
And after a year, that's a real drag to deal with, right?
So what you do is whenever you write a new one out, you now have below that a tree of
things where at the next level down, it's maybe got 10 times as many files and they're organized by key.
So there's the least one-tenth of the keys and the second one-tenth of the keys and the third and so forth and so on.
And then I'm going to take that new one that came out representing five minutes and I'm going to sort merge it with the ones below and create go from 10 to 11 files because I've mixed the keys within that. And then I'm going to pick
one of those, and I'm going to go down to the next level, which has got 100 keys, and that the 10
keys I have, or the one-tenth of the key range covers one-tenth of the key range when there's
100 of them. And I'm going to merge it with 10 files below it, because it had 100 files, right?
So it's roughly that way. And so now I'm going to merge it down. And as you're constantly creating
data, you're merging it by key down into what becomes a
tree that gets fatter and fatter and fatter by key value.
So now when I want to look something up by key, I have a reasonably small number of places
to read.
Sort of depends on how, you know, 10 to the fifth or 10 to the sixth is like five or six
places to read.
And so I go look for the key in those five or six places, and it's reasonably bounded.
Are the old values, the old values are still found? They're just further down in the tree?
Correct. But there's a whole bunch of sub details in this. Like if I have a different value for the
same key, do I destroy the old one as I merge it in? And usually you do. So you only have the
latest version there. But that's an artifact of your specific implementation. But yes, you can discard the old values when you merge them together.
So at a higher level than an SSD.
By the way, this is the plumbing underneath HBase.
This is the plumbing underneath the RocksDB, LevelDB.
There's a set of technologies that are pretty well established for the last handful of years
that are out using this technique, which is called a log-structured merge system. And again, these systems, these key value stores,
run on top of storage where they give the storage immutable files.
They come up with a completely long, unique number for the file,
and they write this file out, and they read it as long as they need to,
and then they go delete it.
They never update it.
If I understand, one of the advantages of the immutable data
is what you called having data on the outside of your system?
That's a different discussion.
I mean, I know where you're going, but that's a different discussion.
It's okay.
Let's move on from immutable.
It's at a higher level of semantic is why I said that.
And that was a paper I wrote in 2005.
It is related.
I'm retracting my sentence.
Let's talk about it.
I was becoming aware in 2003 and 2004 and five that there was data that was outside of databases
that we weren't talking about.
People were squirting files around
and there were documents on the web
and there was all this stuff and messages and files.
And we were kind of in the database community,
not looking at that so much.
And so my nomenclature, which did not stick,
nobody else uses it.
It's just from one paper,
is that data from a database
is inside the database. It's inside and it's relational and it's got all these really cool
properties. But when you take data and you send it out on a wire and you push it out or you write
it to a file system or you send it across the web or whatever you're going to do with it, that data
tends to emerge as being semi-structured. And at the time, the leading semi-structured thing was
XML. And I think XML is fine. I think JSON is fine. I have no problem with either of them. But it's very nice to
have this semi-structured glop. And that thing has got an ID and maybe a version, and you write it,
and you never change it. And so it is immutable, but at a higher level of discussion. It does fit
into the immutability discussion. I'm sorry. And so you're writing that stuff. And now the question is, what do you do with it?
One of the really interesting things is that when you write it out, the metadata you put
with it is just describing what you meant at the time.
So you were going to say something.
Yeah.
In my mind, it all fits together because when I think of this, it makes me think of like
when I hear immutable data, I think of like messages or like Kafka topics, right?
Yes.
And you mentioned RocksDB.
I'm going to pull this all together because that's often used like along with Kafka as
a way of like summarizing this immutable data, like a service consumes and kind of rolls
things up somehow into RocksDB.
Some fascinating products out there.
That's really cool stuff.
So I still have this feeling that it's like somehow the distributed systems become giant
databases where like these messages, these immutable messages are kind of like the transaction
log and then individual services, maybe making some sort of materialized view out of them by
like processing them. Well, there is a perspective, which is that I'm sitting where I'm in the
universe and I've had an event horizon of knowledge arriving to me. And that knowledge can be, you
know, archived in my brain in
a store, which says I got all this stuff. Now, these things are going to have a regular metadata,
irregular shape and irregular form. And so the question is, can I, could I, should I put a
projection on the shape and the form of what I've received into a place where I want to do queries
and analysis on that? And so I tend to think of every message that comes in as having its descriptive metadata that describes what's in that message. I tend to think of being
able, when you want to do a relational system where you want to be able to do queries against
it, I think of that as needing a prescriptive database, which is thou shalt fit into this shape.
So one of the questions becomes, how do I take these things which were from a different
origin and a different metadata, and their description of their shape is not what I want,
how do I shoehorn it in?
And I just finished writing a paper that's not even edited yet called Extract, Shoehorn,
and Load.
And it's all about that whole notion of everything that transforms is typically a lossy transformation
from the descriptive metadata into the prescriptive
shape of where I want to do the analysis. The descriptive prescriptive dichotomy. So
it's like if I write an old school database, I have some sort of insert X, but an event system,
I might have like just something happened and the system reacts to it. So that's a...
Each event is going to have its descriptive metadata. Each event coming in, it's like,
you know, oh, here's my event ID. Here's the time at which it was issued. Here is
the alarm that went off. Here's what was going on. And here's what I mean by this stuff. And it's
immutable and you can't go back and revisit what the metadata is. It is what it was and it was what
it is. And so that's fine. Those things are little artifacts of knowledge that have their own
description of the metadata for them. The question is, how do you shoehorn that into a pool of stuff where there's a consistent enough interpretation
of that that you can do a query? Yeah, which sounds hard.
Well, yeah, but people are doing it. When I talk about this stuff, it's just kind of how can we
give better tools to talk about it and maybe make what we do better, but even still understand it.
Yeah. So it seems like a lot of your writing maybe has to do with cataloging
some of these patterns that you see. Is that?
Junk I think of, man, I'm staring at the ceiling all the time and annoying my wife.
It's like, where the hell are you? I'm thinking about some crazy idea. And then I have to try to
figure out how to organize it. And you have to think about the crazy idea for sometimes years
before you have it organized well enough to make a story out of it. Because if
you don't have a story, you don't have a paper. And if you don't have a purpose and a point to
make, you don't have a paper. And so it just takes a lot of noodling in order to make it happen.
So you're saying you're writing down ideas, not so much observing what's out there,
but trying to think about how things could be?
No, I'm just, well, could be and are. Okay. Just trying to explain the behavior of things,
if you understand. The paper that I just got put up on the web called Identity by Any Other Name
is the observation that we are using identifiers for all the junk to hook these systems together.
And then, you know, walking from example to example to example of how identifiers are the
backbone that makes it work. And I even was able to get more crystallized in my mind when I was
writing it,
that immutability is a relationship between an identifier and stuff that doesn't change.
And that was not something I thought of saying that way when I wrote the paper,
immutability changes everything. But it is a relationship. I got this identifier and here's the stuff. And by the way, things are sometimes immutable at one perspective and changing at
others. If I pick a byte stream up
and I move it from one underlying store to another, is it the same? Is the King James Bible
immutable with different editions, with different fonts, with different pictures next to it,
but the same sequence of letters because it's the King James Bible? There's like 6,000 King
James Bibles you can order on Amazon. They're all different, but they're all the same. Okay,
so what's immutable? And so this one paper I wrote called Side effects front and center is trying to articulate the fact that when you want to ask
these questions about immutability or the certain behaviors, you're doing it from the perspective of
an abstraction layer and not from a layer below. If I do an insert into a database and abort the
transaction and it causes the underlying block to split in the database, but the record is removed.
So now when I read the set of records, it's the same. Did I change anything? Well, the answer is yes and no. I changed things
from the level of the blocks. I did not change things from the level of the records, right?
Yeah. It depends on what level you're paying attention to or care about.
Right. And that's true for even item potence. I want to do the thing exactly once, but do I
really care if I did it twice and created two logging records? The thing was processed twice? No. Do I care that when I hit this server the second time,
it caused my memory manager to allocate some stuff and do a malloc and things were messed up in the
heap? No. Yeah. So you've been working with basically data for a long time. Do you think
that, like, where do you think somebody should start today who's building a new system?
I think you're going to say it depends on your business requirements.
Well, yeah.
I mean, one of two things is happening, right?
Either you're in school and you have a harebrained idea that you're trying to flush out.
And that's all cool.
I'm all in favor of that.
Or you walk into work and, you know, your boss says, this is something we need.
And so now you have to try to figure out how are you going to practically do that. And in our world of just ever increasing resources, both in terms of
software packages we can get and cloud computing and all these things, you want to figure out an
effective way to accomplish what the business needs, leveraging technology that's around.
So there is no one answer. Of course, for each of us as engineers, we need to know that we're
learning and growing and enjoying our team and enjoying our job while we continue to contribute to what our company needs us to do. That's kind of how it rolls. That was a non-answer, I think.
Well, I mean, I don't think there is a right answer. I think that's one thing I've learned from your discussions, that the business requirements should dictate how we store things, the constraints we put on it, how we interact with it. Yeah, I completely agree. And that's not just a one size fits all, because some of these things,
you have to make a choice. And the choice has this benefit and this, you know, drawback. And
the other choice has got different benefit and a different drawback. And that's fine. It's just
about a matter of educating ourselves and figuring out how to apply those things.
So if you had to choose between sort of the model you outlined, the model you outlined
in your theoretical infinitely scaling system, or something like the Google Spanner, where would you
choose? I mean, I think Google's Spanner is fascinating, but I don't think it quite works
as in terms of across trust boundaries. That's one issue. So if one of the participants is nefarious,
then how much does that get you tangled up? But it's way the heck easier for a lot of uses.
And so I have a ton of respect for it.
It's really interesting.
No one answered.
Well, this has been a lot of fun, Pat.
Thanks so much for the time.
I will put links to your ACM articles in the show notes.
Thank you.
And thank you very much for your time.
And I appreciate the interview.
It's very nice to meet you virtually.
And I hope to meet you physically, Adam.
Thanks.
All right. That was the interview. It's very nice to meet you virtually and I hope to meet you physically, Adam. Thanks. All right.
That was the show.
I hope you enjoyed it.
I'm trying to make all these interviews evergreen and hoping that people in the future will
still find them useful.
If you are listening to this sometime after March 2019, hello from the past.
I hope you like the episode and let me know what you
thought. Shoot me an email or whatever. Mention me on Twitter or DM me or however people in the
future communicate. I have no idea. Until next time, take care.