CoRecursive: Coding Stories - Tech Talk: Data and Scale with Pat Helland - The long view on distributed databases

Episode Date: March 31, 2019

Tech Talks are in-depth technical discussions. Pat Helland has a wealth of knowledge on building distributed data stores. He has been working on distributed data stores since 1978, when he worked on t...he tandem fault-tolerant database. Since then he has been involved in many distributed database projects. Here is the key thing, he is also a master at explaining the key ideas of distributed systems using simple language and practical everyday examples. Can you get married on the phone? How are messaging systems and idempotence like regional offices communicating via fax machine? These are the type of metaphor that Pat uses. Today, Pat sits down with me and teaches me about dealing with data in a distributed, fault tolerant, infinitely scaling world. Webpage : http://corecursive.com/028-data-pat-heland/ Links: Pat's articles ON ACM QUEUE Mind your state for your state of mine  Consistency Eventually Standing on Distributed Shoulders of Giants The Singular Success of SQL Life Beyond Distributed Transactions 

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, this is Adam Gordon-Bell. Join me as I learn about building software. This is Code Recursive. You have to think about the crazy idea for sometimes years before you have it organized well enough to make a story out of it. Because if you don't have a story, you don't have a paper. And if you don't have a purpose and a point to make, you don't have a paper. So it just takes a lot of noodling in order to make it happen.
Starting point is 00:00:27 That was Pat Heland. He's been working with databases since 1978. Today, we talk about where we should store data. So should we store it in a relational database on a single server? Should we store it in some sort of distributed database? What are the trade-offs that various stores represent? Questions like, could we make an infinitely scalable service? How does immutable data make distributing things easier? If you liked this episode, give us a review on iTunes or Twitter or mention it on Reddit or Hacker News or wherever. Hey, I'm Adam Gordon-Bell and this is Co-Recursive. Today I'm speaking with Pat Helland.
Starting point is 00:01:18 Pat, thanks for joining me. Well, thank you for having me. I'm looking forward to our chat. I am as well. Pat, you work at Salesforce and I've written down here that you've basically been working on data storage and writing bits to disk since 1978. Correct. I had hair at that time. I don't know, but I've been working on databases and storage and transactions and distributed systems and built a multiprocessor and working on application platforms and stuff since 1978. It's been a long and checkered career of building systems and then having thoughts about it and sometimes having the opportunity to write that down and share. Yeah, you have a lot of great articles, which I think will cover some of that ground on the ACM. It's actually one of my favorite things to do is to carve off of half a day or a day to work on the next paper.
Starting point is 00:02:06 And I'm always boring my friends with the titles and ideas for papers that are not yet written. Yeah, you have great titles. Thank you. I'm trying to think. I forget now. I can't think of anyone. Well, just right now today on ACMQ Online was Identity by Any Other Name, which is talking about how when you get into a distributed system, things are knit together by identifiers. And you may have different names for the identifiers, their pointers, their IDs, their whatever. But it's pretty much identifiers that are tying everything together once you get into a distributed system.
Starting point is 00:02:38 Yeah. Standing on the shoulders of distributed giants is another one. I just like your titles. You tend to be playful with them. Yeah. It's standing on the distributed shoulders of giants, which plays off of the physics and physicists because it's, you know, picking up five different physicists and giggling about how they're the things that for which they're famous kind of preceded our distributed systems ideas. It's a pedagogical mechanism to explain how the heck distributed systems work. And I think that one's just kind of fun and silly. And I had a good time with it. So here's why I brought you here, data stores
Starting point is 00:03:10 and distributed systems. And when did everything become so complex? Like I really love relational databases and for a long time, I felt like that's kind of all I needed. But now I live in a world and there's like Kafka and key value stores and so on. And I'm trying to decide when which is the right tool. So you're here to be my teacher, hopefully. Well, let me give you a perspective. Okay, I fell in love with building databases and transaction systems in 78 because the thing I absolutely loved is that the API was begin transaction,
Starting point is 00:03:41 do some stuff into transaction. And underneath it, we needed hundreds of thousands of lines of code and now growing into millions of lines of code to just deal with all the goop that happens environmentally so that the user doesn't have to think about that. And that was just a phenomenal, phenomenal advancement back in the day when we were running on one computer. And it became under more pressure as we're trying to run over thousands of computers because you just can't pull everything into one place at one time. So what happens in that thousands of lines of code between begin and end transaction?
Starting point is 00:04:19 Well, you have to organize the data so you can locate it on disk efficiently. For relational systems, you have to create the inverted indices so that you can look things up by different values that are kept in there. And you have to make sure it happens atomically. You have to deal with the fact that if something croaks, how are you going to put it back together? If something dies, because that's a very important part, you want to lower the burden for the developer that's using the system by taking those complexities on within the system. You and I've talked about it just a bit ago. I wrote one of the short papers I wrote is called the singular success of SQL. And it's that's the thing I based the column that I write papers. I write papers for ACM. There's a column. I titled the column escaping the singularity. It's not your grandmother's database anymore. The point I wanted to make there is that to make relational algebra work, to do relational systems, you have to make all the data there,
Starting point is 00:05:18 all the data accessible and all the data still, because you need to be able to do these set oriented operations across all this stuff and not get tangled up in other people changing other things and not get tangled up in the data not being here now. And so it kind of works when you bring everything together to one point in space and together at one point of time. And that's known as a singularity when you collapse space and time. Now, transactions make it look like you've collapsed time, even though you might have other stuff running against the database at the same time. And that's kind of part of the fun of those. But the model,
Starting point is 00:05:53 the abstraction is one point in space and time. Now, our systems are so darn big, so darn complex, that doesn't work so much anymore. It only works as one subset of the solutions that it takes to compose a broader system. So at what size does that break down? I don't know. I mean, there's no black and white. You know, it works pretty well for a handful of computers that are close together or one computer certainly. But when you start getting into the scale of big web solutions, it becomes a challenge. And that becomes one of the design points we have to deal with. And then you get into people who have quite legitimately, you know, it's really an interesting thing, the key value store that's scalable, that you shard, it's a fascinating data point. And I wrote a paper 2007 called life beyond
Starting point is 00:06:41 distributed transactions and apostates opinion. Because I was a believer in distributed transactions. I believe they solved all the problems of the free world. And then I started seeing things getting brittle around the edges and getting challenging and availability being a challenge and one system went down and that kind of stuff. And so people start doing other things. And in this new world, what we see the people doing is sharding their data so that it can run across multiple machines. And then there are patterns by which you can have your workflow do steps, one transaction at a time. Step one here, step two there, step three there, and allow the system to scale.
Starting point is 00:07:16 And so that's what that paper from 2007 is about. So sharding is basically, so you're saying, oh, my data is too big now for one relational database, so I could divide it into many relational databases, partitioning it over some key? Maybe. You can. The question is, okay, how do I identify that piece, right? And so if the system is going to pick that piece up, because of the system that's holding it, the server that's holding it is getting too fat, right? It's got too many pieces. It can't run more, right? So now it's going to pick some stuff up and move it from this server to that server. Okay, that's cool. But what that means is I have to have a bite-sized piece, and I called it an entity when I was doing this in this paper. It's just a term. It doesn't have to be the only
Starting point is 00:07:57 term. And the idea is that you can have transactions within that entity, but I can't have transactions across identities because if I can't do transactions across identities because if I can't do transactions across systems and I move one of these things from this system to that, then it becomes a challenge because now I've got to do transactions across systems if I had a semantic in my app of working across multiple entities to do the transaction. So how do you design for that? You start out and you say, this is all I'm ever going to use for a transaction. So how do you design for that? You start out and you say, this is all I'm ever going to use for a transaction. So if it's working from one entity to another, and there's another one in the same server, I'm not going to do transactions across them because one might get
Starting point is 00:08:34 moved to another. And fundamentally, you have to pick an identifier for these things, which is the boundary. It's the boundary within which you do transactions. And so that's a trick people use. Let me make sure I understand that. You're saying that we have an entity and we're limiting our transaction scope to a single entity. Correct. And so actually, do you have an example maybe that would put some meat on this? So say I'm working on one particular order. Okay. I'm processing the work for a workflow to ship some stuff for an order. Okay. I'm processing the work for a workflow to ship some
Starting point is 00:09:06 stuff for an order. Okay. And the order identity and the order information and the information about who's the shipper and the information about the inventory that I've got from here and there, it all fits into one entity and I can put it on one box. Matter of fact, I can put hundreds of thousands of them on one server, but as time goes on, there's going to be more of them that fit on a single computer. And so if I can't do transactions across computers, because that's a challenge, then how do I make sure that I stick within it? Well, I only work on one order at a time. And then I commit that work. And then I work on the next order at a time. And so now the question is, how do I hook up the order processing to the shipper? Well, I probably have messaging that goes from this box to the computers or computers that
Starting point is 00:09:52 hold the shipping information for the shipper. And so then I'll send something off to there. And that composition allows me to then expand and scale the system without things breaking. Yeah. And also the place where I get concerned is it's also complicated. So in my previous, if I had my relational database, I could update the order table and the inventory table at the same time. Right. And now I can no longer do that. Is that right? In many cases for many large distributed systems. Yes. And so it adds a challenge in building the application because you can no longer transparently view all these tables and just do what you want to do. You're
Starting point is 00:10:29 correct. But it also builds a system that can handle scale and continue to function. And so I have my order, say I ordered three pens, and then there's an inventory system that tracks the pens. And so I send a message to the inventory system, say, I would like three pens. Something like that. Yeah. You have to design what the protocol is. Yes. And you get into this world of how many pens do I order? How many pens did I ship? Have I promised more pens than I've actually got? There's all the overbooking challenges. Applications are a fascinating and fun system. And the question is, what do you build it on top of? But there is no doubt that building a scalable distributed system has more challenges
Starting point is 00:11:12 than a centralized one. But let me give you a twist on that too. I may be having my company dealing with my suppliers and my customers, but I have to do that kind of work across trust boundaries anyway. I have to do that kind of planning for how do I do my piece and then send a message to the partner to do their piece. And so that kind of work is what we've been dealing with for decades as we build systems that connect other companies. You know, when I was a kid, you connected companies by sending faxes to someone who added a terminal who then typed it into the local computer and the computer only dealt with your company's stuff, right? And then you would
Starting point is 00:11:49 print things out and then you would send them via fax or snail mail to the other company. And so that meant there were people in the loop to try to deal with some of these challenges of ordering and reordering. Do I have enough inventory and did I promise more than I've got? And so where we're dealing with what's going on now is much of that stuff we're trying to figure out patterns for codifying it so that we can actually make it all connected from computer to computer, even across trust boundaries. This was all on the basis of you coming up with a theoretical, infinitely scalable system. And if I understand correctly, your solution involves fax machines.
Starting point is 00:12:27 No, not now. That's what it was when I was a kid, way before this time. But that's funny. That's good. I used to refer to that as swivel chair integration when you'd have to look at two screens and type between them.
Starting point is 00:12:39 So that's the way it was way back in the day. Now we're trying to put systems and hook them together again. It's an interesting metaphor because in this system, so somebody orders some pens and I'm about to ship out the day. Now we're trying to put systems and hook them together again. It's an interesting metaphor because in this system, so somebody orders some pens, and I'm about to ship out the order. So then I want to tell the warehouse that I need two of those pens. But so in the transactional world, like I could take things from the inventory and like send out the order at the same time. But now I have a world of potential things that can go wrong,
Starting point is 00:13:05 right? So if I ask for two of X, and then I ask for two of Y, and then two of Z, and then there's not enough Z. I have to unroll this somehow, right? Correct. So how do I do that? Carefully. No, at some level, things become not about changing the pen value from five to three, okay? It becomes about having a thing that's there, which value from five to three. Okay. It becomes about having a thing that's there, which is I wish to order two pens. And so that's an operation. And that's what we build when we do these loosely coupled systems. We build operations that say, this is
Starting point is 00:13:37 my intent. And then the question is, can I apply them and unapply them? And how can I reorder that? And so how can I say, oh, gee, I took away two pens, but now the shipment arrived with new pens. And the ordering of it doesn't matter as I increase and decrease the inventory, as long as I don't bottom out and run out. And even if I bottom out and run out, what am I going to do about that? Do you see what I'm saying? Because it's not about me viewing that inventory as a number that I can just automatically do arithmetic on. It's about me writing down the intention to get this many things shipped. And the reordering of that is an important
Starting point is 00:14:11 concept. I talk a lot about interchangeability because the thing that I'm dealing with, there's a bunch of them that are the same. And so I can have differing orders of that. You see that when you go to a hotel, the hotel gives you a reservation for a king-size non-smoking room, but they don't give you a reservation for room 301. And you actually, the rooms are equivalent. They're categorized as equivalent, even though, you know, one of them is next to the elevator and a bit noisier. But they're still equivalent, so you're selling one of the pool. That allows them to deal with people checking out late, checking out early, assign you any one of that pool, and still have a successful completion to their commitment. The same thing happens when you have pens in inventory. You typically have a stockkeeping unit, which has got a certain number of pens, and each is the same as the other.
Starting point is 00:14:59 It is that ability to talk about computing operations, having interchangeable resources that allows the ordering to be swizzled around and still be successful. And that's a huge thing that we have to deal with as we're talking about bringing systems together. And I have a partially written paper on interchangeability that I hope to get done soon. To me, I think the interchangeability makes sense. But there's also a feel of things as an outsider who's never worked on databases that I'm building my own kind of database system. So my inventory system, it gets a message that this person needs two pens. And you kind of have to set those aside, right? It's almost like locking on something and then waiting until the order is confirmed.
Starting point is 00:15:39 And then it certainly is a form of locking. There's no doubt about it. And there are actually business decisions that get involved in there. And I wrote about this in a paper called Building on Quicksand in 2009. One of the conditions, one of the things to think about is, do I want to over provision? That means I'm selling you pens. And for every pen, I'm going to actually have the pen and I can walk into the warehouse and touch it, right? So I promised it to you and it is there. And if things go well, and assuming I don't run over the pens with the forklift, right, then I will be able to fulfill every committed order. That's one choice. But that also means I'm spending a lot of money for inventory that isn't being used. It happens that way. What I can do is overbook. I can say, I'm going to sell 10% more pens up all the time. And so filling the airplane means airplane seats are cheaper and the money and it works out. And that's a business choice they make, which can be annoying when we get told we can't get on the plane, but it's a business choice they're making. What I'm pointing out is this isn't a database. This is just not a database. In databases, we have reads and writes, and we make it look like there's a scratch board that
Starting point is 00:17:03 you can write things on and do some relational algebra and get back a set of answers. And that's fine. That's a phenomenal building block. But then on top of it, you have to have abstractions for what makes a business tick. And those abstractions include long-running work. If you think about it, database transactions don't have a notion of long-running. They make it look like it all happened at one point in time. You say begin transaction, you see your stuff, you say, end transaction or abort, whatever you're going to do. But it's not a long running thing with other things getting in the way. To put other things getting in the way, which is necessary to do time-based work, you have to have this interchangeability. And even more than interchangeability, you have to have the
Starting point is 00:17:41 dealing with uncertainty and the probabilities of, gee, I bottomed out because I ran out of pens because I overpromised. Isn't there uncertainty inside the database? Like in the part I don't have to deal with between begin and end transaction, it has to cover something. We map it all and make you either see a committed or aborted transaction. Yeah. Yeah, there's uncertainty. Things go wrong. But you don't see that as things go wrong when you're building an app on top of it. It gets collapsed into a simple answer, yes or no. Yeah. And now in a distributed system, these concerns are now mine.
Starting point is 00:18:12 Correct. And what I believe we have not done a good enough job in doing is figuring out how to make that easier for the programmer. We've done some, but not all the way that we need to do to talk about what does it mean to have interchangeable things? What are the business rules for overbooking versus over provisioning that class of stuff? And so there's room for us to do those things over time as an industry. Yeah. Maybe this connection isn't as there as much as I want it, but like in a database, let's say I write something, I send some sort of right to the database and it commits it, but it like crashes before it updates all the values. Is that possible? Only if we have a bug. No, I mean, it's not, that's our job is to not have that show up
Starting point is 00:18:51 to you. I'm not saying it's never broken. Databases sometimes have bugs, but they usually don't. And that's our job to never have that happen. But to do that, we narrow the abstraction of what you can do to have records in a database and you're updating a set of them. And we know how to deal with making sure that all the things you updated are all or nothing. Yeah. I was trying to draw this metaphor between the one system sending a message to the other is something like writing something to like the transaction log. And once it's there, it's committed, but maybe not everything's been updated to reflect the data that's in the transaction log. Right. In databases, you get into the visibility of that effect, okay? Where you're correct, it's in the log, but we have to make sure as an implementer of
Starting point is 00:19:31 a database that you can't see the value before the logged entry, okay? So that's part of the job. But you're correct if I'm building a loosely coupled system where the app is doing some work and generating a message, and then the message is going to rattle over to the other machine, and then it's going to do some stuff. There's a window of visibility where you can look between the machines and you can see you don't see the change on both of them, right? Yes.
Starting point is 00:19:53 And to make you see the change on both of them, then we have to hold something still. We could hold the source still so that you can't see that the transaction emitted where we're trying to get anything to the second guy, or we could try to make the second guy still. It just ends up the model isn't quite right when you're saying, oh, the app can see the two databases in between. And so we deal with that. Now, what that means is the messaging system has to guarantee the delivery of the message over time. And it has to then make sure that the message is applied exactly once and isn't applied twice because
Starting point is 00:20:25 you don't want to pull two orders of two pens out, right? So there's a whole set of things. Item potence is a fascinating area. Item potence is a fancy word that it's a mathematical word that we use in computing, which means item means I said it before and potent means still good. And so item, I-D-E-M is like when you see ID at the bottom of a footnote, which is, yeah, yeah, yeah, that previous footnote's got the stuff that I'm talking about, right? Oh, I never thought of that connection. Yeah, that's what the word comes from, right? And so item, I said it before, and potent, it's still good, okay?
Starting point is 00:20:58 So it's okay to say it many, many times. And I like to say sweeping the floor is idempotent. It's naturally idempotent. Reading a record, given that any value from a record over time is going to be within that window of time is okay. It's idempotent. Read it a second time. You get a good record, right? Yes. Right. And I have a paper called Idempotence is Not a Medical Condition. That was fun getting that title through. It's a super and powerful thing. And knowing whether
Starting point is 00:21:22 you've applied it before, so the paper that just hit the web today, Identity by Any Other Name, talks about the fact that in a banking system with paper checks, you buy these printed paper checks and they have the routing number, which says what bank you're going to. It has the account number, which says which account within the bank. And then it has a check number. So that is a unique ID for that operation. So you put the piece of paper and you fill out, I want to send a hundred bucks to Adam. Great. So it's going to go send a hundred bucks to Adam and it goes into your account and then it hits my account. Well, when it hits my account, the bank writes down all that stuff, including the check number. It says, I've cleared this check for a hundred bucks.
Starting point is 00:22:00 Fine. So now if the bank, if that check comes around a second time, no, it's already cleared. We don't have to send another a hundred bucks. And furthermore, banks usually put a time limit of a year on the check. So the check has a date on it. And so if that check shows up two years from now, well, you would potentially pay the a hundred bucks a second time because you're not remembering things two years out, but the check is rejected because the date is in excess. It's more than a year after the date of the check. So the combination of the date on the check and this rule means that the bank doesn't have to remember every check forever, which is important for them to implement idempotence of clearing the check. That makes sense. So in the microservices world, I have
Starting point is 00:22:43 one service that's sending messages to another, let's say via Kafka or something, right? I'm just recounting to see if I'm understanding this correctly. And say each message that it receives has some sort of ID. So in order to make sure that it doesn't replay messages, we need to track what IDs it's processed. Yeah, but that's really hard in a microservice world. Okay, when you say microservice, I think dozens of equivalent services on different computers and a load balancer in front of it. And when I send a second message, most of the time it will get to the same actual microservice server as the first message. But there are many reasons why it doesn't always. The server can fail, things are busy, it gets rerouted. So it goes to a different one of the dozens of servers that now has to figure out what to do with it. So if you want to make something which is stateful be item potent, and you're writing down whether you did or did not do it, you have to write it down in a way where each of the sibling
Starting point is 00:23:38 microservices can see it in order to enforce that. And so that can be done. You pick this thing up and you run to a key value store based upon a key, which has to do with me sending a message to Adam, the Adam service. And then it goes and it looks up and it says, oh yeah, I've already seen that. And that could be done. But you have to think carefully about what happens because microservices means goes away sometimes. And the value of microservices in a scalable cloud environment is huge. They're out how to deal with state includes how do you make sure that item potent work that's already happened doesn't happen a second time, unless it doesn't matter, which is possible. I guess what I'm getting at is, if I understand what you're saying, you're saying the expiry
Starting point is 00:24:38 of the check is important. So if I have a service, whatever, it's stateless, but it writes to some store and it receives messages and it may receive one more than once. So it needs to keep track somehow of whether it's applied one or not. However, the expiry date limits how many it has to keep track of. That's one thing that can occur. In the bank clearing the check, they notice if they send the hundred bucks a second time. Okay. So there's a problem if you actually do the work a second time. For many things you do, many things you do with microservices also, it really doesn't matter if you do the work a second time. It's not a big deal. So the real question is, if you want to promise idempotent behavior to the caller,
Starting point is 00:25:18 you have to take apart what did the operation do? Did it matter if I did it more than once? And if so, how did I make sure that the part that matters is protected from doing it more than once? That then gets more challenging. What's an example where it's innately replayable? Oh, heck, go get me the product catalog information for product XYZ out of the cache for my retail site. Reads, specifically. Right. Reads are a fine way to do it, right? And so they're just naturally item potent.
Starting point is 00:25:56 And there are layers of abstraction in the system. The read and the processing of the read request is probably monitored and logged just for tracking the correctness of the service. And that monitoring and logging doesn't matter the correctness of the service. And that monitoring and logging doesn't matter because if you process the read twice and it's logged twice, that's fine, but it didn't do any harm to do that. So you're always trying to take apart when does it do harm and when doesn't it do harm. Now, if you're also calling something to add something to the shopping cart and that service then goes and opens a value in a key value store and reads it and adds something to the shopping cart and writes it back, well, that's fine if you do that
Starting point is 00:26:30 twice with the same item, as long as the act of doing it has some unique ID, which allows you to say, oh, geez, I already put this in the shopping cart. No problem. That makes sense. Yeah. So people, they design these things for how do you make it be okay to do the requests idempotently. If you don't make it okay to do them idempotent, you're going to have a problem. Now, these are not challenges that you deal with in a classic two-tier surrounding a database because there you're much less thinking about a failure restart of the work. And if you do, you're typically making the restart of the work idempotent because the way you go and update the database means that work item potent because the way you go and
Starting point is 00:27:05 update the database means that it's okay to do the second transaction and it bumps into the first transaction and say, oh gee, I already did that. And at some point you had worked on distributed transaction systems, which propose a different solution to this problem, right? For me, okay, the concern that I found when I first did distributed transactions, it was back at Tandem. And I'm super proud of the time I had at Tandem in the 1980s. And there you had two to 16 processors and any processor could fail and it kept on going and the system was super resilient. And so if you were doing a distributed transaction across multiple clusters, multiple of these clusters across a wide area network, it was okay because those guys didn't really go away for extended periods of time. Then when you started applying the distributed
Starting point is 00:27:49 transactions to single servers spread across a distance typically, but the server goes down and the resolution of the transaction is stuck, right? It's just stuck behind that. Then that means the locks in the database on the other nodes are stuck behind it. And so the entire app kind of gets to be more brittle. There are many solutions that I'm interested in where you're keeping the state of the transaction in a cluster and it's more vibrant. So I don't love two-phase commit, but I actually think it's okay if the participants don't go away for long periods of time. And there's a whole bunch of tricks to make that happen. And so I stand by my life beyond distributed transactions as an application pattern that is super important for people to look at.
Starting point is 00:28:30 But I also believe there's still more innovations that are going on. Spanner is very interesting, for example. Lots of things are very interesting. And I don't think we're done seeing growth in that area. By the way, I have a model for people who are not super familiar with distributed transactions and what it's like. The way I characterize two-phase commit is as follows. Imagine you're getting married on the telephone, okay? And your spouse and the minister are somewhere else. And so you're on the phone and you say, I do. And then you hear a dial tone. The real burning question is, are you allowed to date? And that's the fundamental algorithm behind two-phase commit. And so you don't want to get locked up. And another thing that I've said is that people are pretty comfortable with the CAP conjecture, which is the Consistency Availability Partition
Starting point is 00:29:16 Tolerance, CAP. And the conjecture in theorem is pick two out of the three, because you can't have three out of the three. You can have consistency and availability as long as things don't break, but things break. So that one isn't useful. You can have consistency and partition tolerance, but you're going to lock up if things go bad, that kind of thing. And so one of the quotes I had from the 80s is that two-phase commit is the anti-availability protocol. It picks perfect consistency over availability. That is its job. That is called success. And so, and I'm not saying that's necessarily bad. I'm saying it fits into a landscape of that's complex and challenging. That helps explain why distributed transactions aren't used that much, because you're giving up availability, right? So if something gets blocked anywhere, it blocks everything. And as the number of participating nodes goes up, the probability that one or more of them are unavailable goes up. So if my data is distributed among eight servers
Starting point is 00:30:15 and my transaction crosses them, then anyone can stop the whole thing. Correct. And that's not good when each of these things goes away for a while. Now, we are getting better about making them not go away for a while, but that's not been what we've dealt with in the last 10, 20 years on a whole. There's exceptions, but on a whole. I think it explains how, so then as the amount of data I have in my distributed data store grows, the more distributed transaction approaches is likely to stop.
Starting point is 00:30:44 Correct. Now, what's kind of helping with this, which is really interesting, is that in cloud computing, you're seeing a separation from compute and storage. So I have a compute box, which is like a database or something, and it's writing to a store, but the store is actually not on that physical server. It's off on three different physical servers that are spread around the data center. Now, if this compute box goes away, you can pop a new one up
Starting point is 00:31:07 and within minutes or something like that, it can be running and now finishing off what was partially done because all of the knowledge, the log, the database, all that stuff is out on three other random servers and you can make sure that at least one of them you can get to. So we just make each instance like a cluster.
Starting point is 00:31:24 Well, again, separate storage from compute. make sure that at least one of them you can get to. So we just make each instance like a cluster. Well, again, separate storage from compute. So the storage is just hundreds, thousands, tens of thousands of servers that are writing things out there. And everything they write is on three different copies. So I can do the math on the failure of the servers in the data center, and three copies ends up being more likely
Starting point is 00:31:42 to lose the data center than all three copies by losing servers, by a lot. Okay, the data center than all three copies by losing servers, by a lot. Okay, the data center will get a meteorite that hits it or something, right? Because when one of those servers goes away, you find an empty server or some other server, and you say, go from two copies back to three copies, and now you got three. So how quickly does that go down? And the math on that is something that we all do when we think about how to build a distributed system. So the storage is spread around. Okay, so fine. So now I got a database. The database is writing logs and it's reading from the storage. Great. Where are the
Starting point is 00:32:14 logs going? They're going spread around. And where's the database going? Spread around. So if that server goes away, you just kind of scratch your head and look around and this is automatic, but it's not a human, right? Scratch your head, look around and say, okay, let's pop up this empty server over here and make it run. And then it reads the data and it puts it all back together. And so the time to then come back to life is now not gated on getting a repairman to fix the one server in the previous design. Now that means that two phase commit is getting less painful. It's still challenging and it's got times when it's challenging, and we could talk more about that for a great period of time.
Starting point is 00:32:48 But it's getting less challenging when the individual participants are more available. One thing a system like that would allow is consistency of reads. You mentioned before with the relational database, right? Like I can read the data immediately after I update it. I can even read it in the same transaction, right? And it should be. It must be. That's part of the rules. Yeah. Yeah. But if I'm using a more modern, if I'm using a key value store that's like distributed, I could write somewhere and possibly read from somewhere else and the changes haven't cascaded out. Well, it depends on the store. Most of the key value stores offer what is called linearizability.
Starting point is 00:33:26 That means there's a linear sequence of changes. And another more approachable way to say that is read your rights. If I write something and I go to read something, it's what I just wrote. If I write something and somebody else goes to read something, it's what I just wrote. The fancy word for that is linearizability, but it's really just read your rights. It turns out that's what most systems do because most applications go completely haywire if they don't have that characteristic. But a fact of life in a distributed system, which is what you get in a data center, and nowadays almost everything is distributed systems.
Starting point is 00:34:00 A fact of life is that that means you're writing it in three locations. And that means that I can't just say, oh, I got two copies. That's good enough. Because then. Now, remember, every server known that we have out there is usually pretty fast, and it has a pretty bounded latency, except when things are kind of tangled up, and the latency gets very large for a period of time. And that can be garbage collection, it can be a whole bunch of other things that can go wrong. And it could just be storms in the network, which doesn't happen too much in data centers nowadays, where you can't actually talk to the guy. He's there and he's happy, but you can't get the message through. But it looks to you like he's low. So now when I'm trying to do a change, if I have to change three copies, then that means that I have to make sure that I wait until all three copies are there, or I officially give up on one of the copies. Now to officially give up on one of the copies, we kind of have to shun that copy. We're not talking to him again until we like decide to bring him back in. Then we bring him back in, there's going to be all this work to do.
Starting point is 00:35:13 Actually taking and shunning, which is just my word, it's not the official word, but to push it out of that membership of three means you have to decide that server's bad, which takes sort of a minute in most systems. So now you're writing something that server's bad, which takes sort of a minute in most systems. So now you're writing something that was taking you 10 milliseconds, and now it's taking you a minute. And so that's a trade-off associated with being able to read what you wrote. Because if you don't do that, there might be old copies floating around, and sometimes you'll bump into the old copies. So you had described this as predictable answers result in unpredictable latencies. Yeah. And so in the paper Heisenberg was on, which is the standing on the distributed shoulders of giants, there's a section called Heisenberg was on the right track, which talks about one technique to avoid it, but it's only useful when you structure the data in certain ways.
Starting point is 00:36:00 So does that mean we want, that we should be giving up on reading our rights in certain cases? Well, it does in certain cases. It depends what's more important. Is it more important to have a quick answer or is it more important to have a perfect answer? And the tagline I have for that is, do you want it right or do you want it right now? If you have a shopping experience for users that are buying junk and you're showing the product catalog or you're showing the shopping cart or you're showing stuff, you know, measurements show that if you wait too long to give them their answer, they go to the dishes and they don't buy anything. catalog out of your caches because the cache is being updated irregularly. And first you see the new one, then you see the old one, then you see the new one that nobody gets too excited about it because it kind of works and it's okay. So it depends upon the business. If the business wants the latency bounding of making sure that the user is getting what the answer quickly, even if it's
Starting point is 00:37:01 not the latest and greatest answer, that's fine. It's a business requirement. But you have to be thoughtful about it. So the business requirements dictate kind of our storage strategy. I'm saying that they should, yeah. And that we don't pay enough attention to it. So if I understand, like rewinding, if we have data that's too big to fit on a single machine, we have to start distributing it. So if we use distributed transactions, there's some potential availability issues. So then we need to decide, there's some potential availability issues. So then we need to decide, assuming we didn't decide on Spanner or something, we need to decide whether we want it right or right now, whether we want it fast to write down or accurate to read. the same discussion of whether the app sees a ginormous database running across many, many servers and distributed transactions, or whether they actually do the kinds of things I talked about with entities and workflows and messaging in the app, which will also be more resilient and more rapid in its behavior, but doesn't get tangled up when things break. And the right versus right
Starting point is 00:38:01 now is less about distributed transactions and more about any individual store in a key value store. That was how I was applying it. Yeah, yeah. No, I was thinking once you're in this land of a key value store that is distributed, you need to decide whether you want it to wait for everything to be written or whether you want it to just write what it can and then go. Yes. But usually people want it to be read your rights because they don't know how to think about what the app would do otherwise. And are you saying that your entity model overcomes this? No, I'm saying that it's at a different granularity. We're conflating two things that we shouldn't conflate, and I'm sorry. The entity model doesn't overcome it. The entity model
Starting point is 00:38:39 gives you a mechanism for scale that tolerates systems going out and the system can continue to work. So one interesting tidbit I picked up from your paper was this concept of getting the answer right now. One way that it could be overcome is if your data is actually immutable. How would that help you? Well, the wonderful thing about immutable data, one of the wonderful things, is that I don't have to worry about writing two versus three replicas and getting a stale version of it because there is no stale version. The data associated with what I'm writing has only one value across all space and time. The unique ID that I have is only going to be for that stuff. So I don't have to get excited about, did I write two versus three copies
Starting point is 00:39:21 from the standpoint of stomping out an old version. I want to get excited about how many copies I have so that if something fails, I don't lose the data. But that's a different discussion than do I have correct update to it. And so I'm writing it out. I'm not replacing a previous value because there was no previous value. And so now it's not, I can write it anywhere. I can write it on any three machines. And if one of the first three is stuck, I'll just pick the fourth or the fifth and go put it back later on when things unstuck. It kind of sounds like a cheat, like a trick.
Starting point is 00:39:50 You're like, okay, updating things is hard, so we just won't update. Yeah, but when you build systems out of stuff that doesn't update, the whole distributed system thought process inverts in a positive way. Now, you have an obligation. How do I deal with not updating things? And that ends up giving you some additional challenges. You have to think about a database very much like an accountant's log of changes. And there's some well-established techniques for this,
Starting point is 00:40:14 where you write the changes and then you continue to merge and coalesce them so you can actually find the latest thing for that key. This is called a log-structured merge system. But in all cases, you're only writing out brand new stuff with a brand new identifier. And the storage system is far easier because you don't have these corner cases when systems get wedged. And when you look up and down our computing stack, everything from how do we do the hardware inside of an SSD up to how we're building a database is inverting to the point where we're only writing data that you never update. You only logically update it by writing new things into the journal, describing what should be different.
Starting point is 00:40:50 I think an example will help. So let's say you mentioned accounting. So let's, what about a bank account? So making changes to a bank account, how would I represent that? Well, the way we do at the bank, right? You have debits and credits and you get the monthly statement. It says this changed and that changed and this changed and that changed. And we started from this and now we have these
Starting point is 00:41:06 up and then we ended that. And that's the bank account. And then you roll them up and then you have a yearly summary and all this kind of stuff is based upon just adding new artifacts of knowledge. In the middle of the month, a check clears, it's added to the list. It's added to your bank account. At the end of the month, you do the analysis of what the monthly statement is and some stuff's added to your bank account. You just keep adding to your bank account. Eventually, kind of when the year gets wrapped up, you peel it off into archive. You can go back and see it, but it's not as easy to see it.
Starting point is 00:41:37 You have to kind of look a little harder. And so you're constantly adding information and you're never deleting and you're never changing anything. You're adding knowledge. So somewhere in the system, I still would like to know my balance. So something has to roll these up, I assume. Correct. But sometimes to read your balance online in the middle of the month, you have to have an algorithm that comes in and looks at the balance as of the beginning of the month and apply the debits and credits as you're trying to figure out right here and now, what's the balance? You apply the debits and the credits you're trying to figure out right here and now what's the balance, you apply the debits and the credits and say, oh, okay, and now you
Starting point is 00:42:07 have the summary. But the thing that's at the beginning of the month, which is your bank balance, it kind of motivates for you. It bounds how much you have to do that because it's rolled everything up to the beginning of the month. But you're actually calculating in situ to give an answer in the middle of the month and then you can get the balance right now. So any given lookup of the balance will start at the last snapshot and roll forward? Yeah, and apply the changes. And organizing the system so that that's reasonably efficient is a part of the game. But the facts are, in accounting, if you use an eraser, you go to jail. And so one of the titles of one of the sections in one of my papers is accountants don't use erasers. And so much of this is described in the paper.
Starting point is 00:42:46 Immutability changes everything. Thinking about distributed systems data as a collection of immutable artifacts absolutely inverts the way you think about the system. And it's cool. So how so? I just have these things that somebody wrote somehow and I'm looking at them. And so they're artifacts and they're there and they're going to be there until maybe they're deleted, but they're never updated. So the question is, how do I interpret the bank balance out of that? And again, even when you're in the middle of the implementation of an SSD chip and they're copying forward the pages that you're updating and applying the right
Starting point is 00:43:19 to a word that you're doing inside of a page, they're doing it by creating immutable artifacts and copying them forward. And so that's just one example of what you're seeing up and down the stack in the way we deal with data is increasingly about writing something down, never changing it, and then doing something to semantically show the change. So how do we semantically show the change? You mentioned log something. Well, log structured merge system is a technique that was first documented in 94 by some friends of mine. And it says, look, I'm going to have keys and I'm going to have values. And the semantics I'm going to offer is I'm going to update the
Starting point is 00:43:55 values associated with the keys or add new keys or delete keys. And I'm just going to give you a system that lets you modify the key values. Now that's cool. As I'm doing that though, what I'm going to do is you make a change and I'm going to put it in a transaction log, just like a database. And then before I tell you the transaction is committed to make those changes, then I'll make sure that that database is written down, that log is written down in a place where I can find it. Great. So now I have all this stuff in memory that is the new values. Okay, fine. So I'm going to write it out into a file, which is structured by the keys and the values. And it's got the changes that which is structured by the keys and the values.
Starting point is 00:44:25 And it's got the changes that happened, say, in the last five minutes. And so I keep doing that. But now I have a stack of changes representing five minute intervals. And after a year, that's a real drag to deal with, right? So what you do is whenever you write a new one out, you now have below that a tree of things where at the next level down, it's maybe got 10 times as many files and they're organized by key. So there's the least one-tenth of the keys and the second one-tenth of the keys and the third and so forth and so on. And then I'm going to take that new one that came out representing five minutes and I'm going to sort merge it with the ones below and create go from 10 to 11 files because I've mixed the keys within that. And then I'm going to pick
Starting point is 00:45:05 one of those, and I'm going to go down to the next level, which has got 100 keys, and that the 10 keys I have, or the one-tenth of the key range covers one-tenth of the key range when there's 100 of them. And I'm going to merge it with 10 files below it, because it had 100 files, right? So it's roughly that way. And so now I'm going to merge it down. And as you're constantly creating data, you're merging it by key down into what becomes a tree that gets fatter and fatter and fatter by key value. So now when I want to look something up by key, I have a reasonably small number of places to read.
Starting point is 00:45:35 Sort of depends on how, you know, 10 to the fifth or 10 to the sixth is like five or six places to read. And so I go look for the key in those five or six places, and it's reasonably bounded. Are the old values, the old values are still found? They're just further down in the tree? Correct. But there's a whole bunch of sub details in this. Like if I have a different value for the same key, do I destroy the old one as I merge it in? And usually you do. So you only have the latest version there. But that's an artifact of your specific implementation. But yes, you can discard the old values when you merge them together. So at a higher level than an SSD.
Starting point is 00:46:10 By the way, this is the plumbing underneath HBase. This is the plumbing underneath the RocksDB, LevelDB. There's a set of technologies that are pretty well established for the last handful of years that are out using this technique, which is called a log-structured merge system. And again, these systems, these key value stores, run on top of storage where they give the storage immutable files. They come up with a completely long, unique number for the file, and they write this file out, and they read it as long as they need to, and then they go delete it.
Starting point is 00:46:39 They never update it. If I understand, one of the advantages of the immutable data is what you called having data on the outside of your system? That's a different discussion. I mean, I know where you're going, but that's a different discussion. It's okay. Let's move on from immutable. It's at a higher level of semantic is why I said that.
Starting point is 00:46:55 And that was a paper I wrote in 2005. It is related. I'm retracting my sentence. Let's talk about it. I was becoming aware in 2003 and 2004 and five that there was data that was outside of databases that we weren't talking about. People were squirting files around and there were documents on the web
Starting point is 00:47:11 and there was all this stuff and messages and files. And we were kind of in the database community, not looking at that so much. And so my nomenclature, which did not stick, nobody else uses it. It's just from one paper, is that data from a database is inside the database. It's inside and it's relational and it's got all these really cool
Starting point is 00:47:30 properties. But when you take data and you send it out on a wire and you push it out or you write it to a file system or you send it across the web or whatever you're going to do with it, that data tends to emerge as being semi-structured. And at the time, the leading semi-structured thing was XML. And I think XML is fine. I think JSON is fine. I have no problem with either of them. But it's very nice to have this semi-structured glop. And that thing has got an ID and maybe a version, and you write it, and you never change it. And so it is immutable, but at a higher level of discussion. It does fit into the immutability discussion. I'm sorry. And so you're writing that stuff. And now the question is, what do you do with it? One of the really interesting things is that when you write it out, the metadata you put
Starting point is 00:48:09 with it is just describing what you meant at the time. So you were going to say something. Yeah. In my mind, it all fits together because when I think of this, it makes me think of like when I hear immutable data, I think of like messages or like Kafka topics, right? Yes. And you mentioned RocksDB. I'm going to pull this all together because that's often used like along with Kafka as
Starting point is 00:48:31 a way of like summarizing this immutable data, like a service consumes and kind of rolls things up somehow into RocksDB. Some fascinating products out there. That's really cool stuff. So I still have this feeling that it's like somehow the distributed systems become giant databases where like these messages, these immutable messages are kind of like the transaction log and then individual services, maybe making some sort of materialized view out of them by like processing them. Well, there is a perspective, which is that I'm sitting where I'm in the
Starting point is 00:48:58 universe and I've had an event horizon of knowledge arriving to me. And that knowledge can be, you know, archived in my brain in a store, which says I got all this stuff. Now, these things are going to have a regular metadata, irregular shape and irregular form. And so the question is, can I, could I, should I put a projection on the shape and the form of what I've received into a place where I want to do queries and analysis on that? And so I tend to think of every message that comes in as having its descriptive metadata that describes what's in that message. I tend to think of being able, when you want to do a relational system where you want to be able to do queries against it, I think of that as needing a prescriptive database, which is thou shalt fit into this shape.
Starting point is 00:49:40 So one of the questions becomes, how do I take these things which were from a different origin and a different metadata, and their description of their shape is not what I want, how do I shoehorn it in? And I just finished writing a paper that's not even edited yet called Extract, Shoehorn, and Load. And it's all about that whole notion of everything that transforms is typically a lossy transformation from the descriptive metadata into the prescriptive shape of where I want to do the analysis. The descriptive prescriptive dichotomy. So
Starting point is 00:50:10 it's like if I write an old school database, I have some sort of insert X, but an event system, I might have like just something happened and the system reacts to it. So that's a... Each event is going to have its descriptive metadata. Each event coming in, it's like, you know, oh, here's my event ID. Here's the time at which it was issued. Here is the alarm that went off. Here's what was going on. And here's what I mean by this stuff. And it's immutable and you can't go back and revisit what the metadata is. It is what it was and it was what it is. And so that's fine. Those things are little artifacts of knowledge that have their own description of the metadata for them. The question is, how do you shoehorn that into a pool of stuff where there's a consistent enough interpretation
Starting point is 00:50:49 of that that you can do a query? Yeah, which sounds hard. Well, yeah, but people are doing it. When I talk about this stuff, it's just kind of how can we give better tools to talk about it and maybe make what we do better, but even still understand it. Yeah. So it seems like a lot of your writing maybe has to do with cataloging some of these patterns that you see. Is that? Junk I think of, man, I'm staring at the ceiling all the time and annoying my wife. It's like, where the hell are you? I'm thinking about some crazy idea. And then I have to try to figure out how to organize it. And you have to think about the crazy idea for sometimes years
Starting point is 00:51:21 before you have it organized well enough to make a story out of it. Because if you don't have a story, you don't have a paper. And if you don't have a purpose and a point to make, you don't have a paper. And so it just takes a lot of noodling in order to make it happen. So you're saying you're writing down ideas, not so much observing what's out there, but trying to think about how things could be? No, I'm just, well, could be and are. Okay. Just trying to explain the behavior of things, if you understand. The paper that I just got put up on the web called Identity by Any Other Name is the observation that we are using identifiers for all the junk to hook these systems together.
Starting point is 00:51:54 And then, you know, walking from example to example to example of how identifiers are the backbone that makes it work. And I even was able to get more crystallized in my mind when I was writing it, that immutability is a relationship between an identifier and stuff that doesn't change. And that was not something I thought of saying that way when I wrote the paper, immutability changes everything. But it is a relationship. I got this identifier and here's the stuff. And by the way, things are sometimes immutable at one perspective and changing at others. If I pick a byte stream up and I move it from one underlying store to another, is it the same? Is the King James Bible
Starting point is 00:52:29 immutable with different editions, with different fonts, with different pictures next to it, but the same sequence of letters because it's the King James Bible? There's like 6,000 King James Bibles you can order on Amazon. They're all different, but they're all the same. Okay, so what's immutable? And so this one paper I wrote called Side effects front and center is trying to articulate the fact that when you want to ask these questions about immutability or the certain behaviors, you're doing it from the perspective of an abstraction layer and not from a layer below. If I do an insert into a database and abort the transaction and it causes the underlying block to split in the database, but the record is removed. So now when I read the set of records, it's the same. Did I change anything? Well, the answer is yes and no. I changed things
Starting point is 00:53:08 from the level of the blocks. I did not change things from the level of the records, right? Yeah. It depends on what level you're paying attention to or care about. Right. And that's true for even item potence. I want to do the thing exactly once, but do I really care if I did it twice and created two logging records? The thing was processed twice? No. Do I care that when I hit this server the second time, it caused my memory manager to allocate some stuff and do a malloc and things were messed up in the heap? No. Yeah. So you've been working with basically data for a long time. Do you think that, like, where do you think somebody should start today who's building a new system? I think you're going to say it depends on your business requirements.
Starting point is 00:53:49 Well, yeah. I mean, one of two things is happening, right? Either you're in school and you have a harebrained idea that you're trying to flush out. And that's all cool. I'm all in favor of that. Or you walk into work and, you know, your boss says, this is something we need. And so now you have to try to figure out how are you going to practically do that. And in our world of just ever increasing resources, both in terms of software packages we can get and cloud computing and all these things, you want to figure out an
Starting point is 00:54:14 effective way to accomplish what the business needs, leveraging technology that's around. So there is no one answer. Of course, for each of us as engineers, we need to know that we're learning and growing and enjoying our team and enjoying our job while we continue to contribute to what our company needs us to do. That's kind of how it rolls. That was a non-answer, I think. Well, I mean, I don't think there is a right answer. I think that's one thing I've learned from your discussions, that the business requirements should dictate how we store things, the constraints we put on it, how we interact with it. Yeah, I completely agree. And that's not just a one size fits all, because some of these things, you have to make a choice. And the choice has this benefit and this, you know, drawback. And the other choice has got different benefit and a different drawback. And that's fine. It's just about a matter of educating ourselves and figuring out how to apply those things. So if you had to choose between sort of the model you outlined, the model you outlined
Starting point is 00:55:05 in your theoretical infinitely scaling system, or something like the Google Spanner, where would you choose? I mean, I think Google's Spanner is fascinating, but I don't think it quite works as in terms of across trust boundaries. That's one issue. So if one of the participants is nefarious, then how much does that get you tangled up? But it's way the heck easier for a lot of uses. And so I have a ton of respect for it. It's really interesting. No one answered. Well, this has been a lot of fun, Pat.
Starting point is 00:55:32 Thanks so much for the time. I will put links to your ACM articles in the show notes. Thank you. And thank you very much for your time. And I appreciate the interview. It's very nice to meet you virtually. And I hope to meet you physically, Adam. Thanks.
Starting point is 00:55:46 All right. That was the interview. It's very nice to meet you virtually and I hope to meet you physically, Adam. Thanks. All right. That was the show. I hope you enjoyed it. I'm trying to make all these interviews evergreen and hoping that people in the future will still find them useful. If you are listening to this sometime after March 2019, hello from the past. I hope you like the episode and let me know what you thought. Shoot me an email or whatever. Mention me on Twitter or DM me or however people in the
Starting point is 00:56:13 future communicate. I have no idea. Until next time, take care.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.