Disseminate: The Computer Science Research Podcast - Tamer Eldeeb | Chablis: Fast and General Transactions in Geo-Distributed Systems | #46

Episode Date: February 12, 2024

In this episode, Tamer Eldeeb sheds light on the challenges faced by geo-distributed database management systems (DBMSes) in supporting strictly-serializable transactions across multiple regions. He d...iscusses the compromises often made between low-latency regional writes and restricted programming models in existing DBMS solutions. Tamer introduces Chablis, a groundbreaking geo-distributed, multi-versioned transactional key-value store designed to overcome these limitations.Chablis offers a general interface accommodating range and point reads, along with writes within multi-step strictly-serializable ACID transactions. Leveraging advancements in low-latency datacenter networks and innovative DBMS designs, Chablis eliminates the need for compromises, ensuring fast read-write transactions with low latency within a single region, while enabling global strictly-serializable lock-free snapshot reads. Join us as we explore the transformative potential of Chablis in revolutionizing the landscape of geo-distributed DBMSes and facilitating seamless transactional operations across distributed environments.CIDR'24 Chablis PaperOSDI'23 Chardonnay paperTamer's Linkedin Hosted on Acast. See acast.com/privacy for more information.

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Disseminate the Computer Science Research Podcast. I'm your host, Jack Wardby. This is the first show of 2024, so I guess a belated Happy New Year. I don't know if we're still doing that or not really. It's probably what, we're into February now, so I don't think we can really say Happy New Year anymore. But anyway, I hope you all had good New Years and Happy Holidays. So yeah, the usual reminder that if you do enjoy the show, please do consider supporting us through Buy Me A Coffee. It really helps us to keep making the show. Now onto today's episode. I'm really glad to say that I'm going to be joined today by Tamir El-Deeb, who will be telling us everything we need to know about Shably, fast and general
Starting point is 00:01:00 transactions in geo-distributed systems. Temer is a PhD student at Columbia University and is also a software engineer at Jane Street. So he's very, very busy, I can imagine. Great, great stuff. So Temer, thank you for coming on the show. Oh, thank you so much for having me. I'm really excited. Well, let's jump straight in then. So I've given you a very brief introduction there, but can you tell us a little bit more about yourself and how you became interested in database management research? Sure, yes. So I grew up in Egypt. I studied computer science and then I moved to the US to start my career in software engineering. That was in 2011, 2012 timeframe.
Starting point is 00:01:54 And as fate would have it, I would start on the Azure storage team, which is kind of a distributed storage database team. And I found that I really enjoy that area. That same year, like in 2012, two very influential papers came out. So like during that time, I think the conventional wisdom was generally that like, you know, transactions are too slow. You,
Starting point is 00:02:29 we don't need them. We're gonna like build our, like all the cool, like no SQL stuff, you know, Cassandra and big table and dynamo. And like, there were both sorts of these things.
Starting point is 00:02:42 And then like in 2012, two papers came out, the Spanner paper from Google and also the Calvin paper in SIGMOD 2012 on deterministic database systems. And they both kind of showed that actually like, you know, transactions A are still, like, very useful. Like, even Google developers who are, like, you know, known for being high, like, highly skilled were struggling without transactions.
Starting point is 00:03:21 And, you know, here's a way we can actually like build this global scale system calvin kind of showed a like a radically different way that you can do this and i just i found like whole papers very exciting and it's like since since then have been just like keeping an eye on the on the area until i decided to start at phd myself uh in late 2019 early 2020 and i didn't like uh intend to work in that area specifically from the start but i i just like naturally found myself uh gravitating to the same area so awesome stuff yeah so going back on the uh on the as you said you must have been in there pretty early doors i guess then so like kind of how long how long have the team kind of been going and that sort of project been going when you joined them originally because that's kind of feels for like like kind of day one sort of stuff it wasn't really day one but
Starting point is 00:04:26 it was early so i think the entire azure thing started in 2006 the famous project thread dog and i think like azure storage had been in production for maybe a year or two before i joined or or something like that so it was you know it was clear that like this is becoming a huge thing it was a like like already like lots of data being stored and so on but it was also fairly early days uh so yeah yeah no that's really cool and also as well i'm sure a lot of our listeners will be familiar with the the sort of the influential spanner paper and the calvin paper as well that's uh that's really cool and we i just want to get us over because we spoke about this but off before we started recording i think it's quite funny as to why you started and to why you decided to pursue a phd during during
Starting point is 00:05:17 during covid and or during the pandemic so i'll let you tell the listeners what was the main motivation what you tried before and you're like, no, screw this. I mean, I, I tried baking like most people did during the pandemic, but I didn't find it very, very interesting. So basically decided to just continue working while also doing some research and it it it i like not not that i like necessarily recommend it in normal times but it worked out so yeah it definitely did because then when we've got this awesome paper as well to prove it so uh let's talk a little bit more about that then so um let's let's start off i guess with some a little bit more background set the scene for the for the for the listener a little bit more so so can you we're going to be talking about geo distributed databases today so can you tell us what they are
Starting point is 00:06:12 and kind of why we need them really sure so i think like the definition of a geo just like so uh the distributed database is a database that runs on multiple nodes. It's not like a single node kind of thing. And geo-distributed here means that these nodes span geographical regions. It's not like all in the same data center, for example, or the same cloud region. And the reason we need them, I think there's two main reasons. One is some or like many applications are like have users that are geo-distributed.
Starting point is 00:07:01 Like think about, I don't know know your twitter or your facebook or something like the users are all around the globe so you want to keep the data for the user near them so that they can have uh low latency access to it latency is is very important we are all used to like snappy uh app experience and like if the apps take too long to load, users just leave. Companies lose money. It's a pretty big deal. The other reason this is also pretty important is just disaster recovery. I'm sure your viewers are familiar with you know
Starting point is 00:07:47 aws region going down and the internet you know stopping to work for a bit or something like that so for a lot of like mission critical apps you really want to store like a copy of your of your data in at least like one more region than like where it's it's uh home is probably more um like hence like to protect from things like natural disasters that can like happen and take a uh it's centered down and so forth um so these are i think the two main reasons why you need geo uh distribution yeah it's such i mean we're so used to having this it's instantaneous sort of any application we use these days right we just want it to be we're conditioned for it to be like instantaneous right as soon as it's not like we're not gonna use this anymore so yeah there's a big thought and obviously as well the disaster recovery sort of angle as well i mean um yeah
Starting point is 00:08:48 that kind of speaks speaks for itself right i mean you think it oh one reason will be fine but yeah this whatever what can go wrong will go wrong right now if he's lost so yeah we need to cover our backs there so that's cool um so yeah so as you alluded to in when you were like um answering one of my earlier questions um you mentioned that distributed database research has been it's been quite a fertile ground. There's been a lot of work over sort of the last 10, 10, sort of 15 years, especially as people have kind of refuted this notion that transactions, we don't need them. And then I was like, hang on a minute. That's a really good thing. Let's have those. And how can we do those performance at sort of large scale? And so maybe,
Starting point is 00:09:26 you touched on the briefly mentioned Spanner and Calvin, but can you kind of give us a rundown of maybe some of the more state-of-the-art systems in that space? And some of the problems that they kind of have still. Sure. The way that I think about this, and like it wasn't very obvious
Starting point is 00:09:43 when I started doing research, but it turns out you can generally like categorize the state of the art systems into two buckets. Like there's fast and there is general, but there isn't both at least I like until now. Nice. Yeah. So in that general bucket,
Starting point is 00:10:08 I think like these are your systems that support, you know, traditional SQL. They have this unrestricted API. They give you very strong consistency semantics. And so things like Spanner or cockroach DBb or yugabyte or like um like they're all kind of spanner influenced systems and they are general in the sense that like i said they like give you uh transactions with all the semantics that you want you know serious serious here like serializability
Starting point is 00:10:46 consistency you can just like run full sql on those with all the features like there's no restrictions to the programming model but they tend to kind of uh be slow especially like if you have to do transactions that cross one partition. So, you know, like if you can partition your data in such a way that you can like only touch one partition per uh transactions you run the well-known two-phase commit algorithm or a variant of it like it's not exactly like like it's not always implemented in the textbook form but it's always kind of like one variant of it. And you take a hit because, you know, you have to do two RPCs and two rights to storage,
Starting point is 00:11:55 at least for every transaction. And that takes a toll. So that's the general systems. The fast systems kind of look at this and say, okay, two-phase commit is this really slow bottleneck and we really don't want to do it. So we are
Starting point is 00:12:15 going to design the system in such a way that we don't do it. And in doing so, you kind of lose what I call generality. Like you lose a property of the system that is fairly important. So one such property is like the easiest one is like, let's just not have any cross partition transactions at all. Like we will just have, like we will limit the transactions to a single shard.
Starting point is 00:12:48 Or we may let you do transactions that cross shards, but with weaker semantics. Or the deterministic systems, which are kind of, like like structured very differently, that they can be fast and they do support cross-partition transactions, but they restrict the programming model in various ways that are kind of fairly restrictive. Like you either have to, like so the code has to be deterministic,
Starting point is 00:13:24 which usually rules out things like conversational uh queries where you are like running a part of a query and then like look at the result and then like issue the other part and and so on and like they do things to like try and mitigate that, but in general, you really cannot support a SQL interface on it. You kind of have to come up with a different query language to fit the system or something like that. Nice, cool. Yeah, so I liked your trade-off there. Basically, you've got these two families of systems,
Starting point is 00:14:03 these two buckets of systems, the fast ones, but they make some trade-offs, weaker semantics, maybe for cross-partition transactions, restrict the programming model so we don't have the nice SQL that probably developers are used to. And then on the other side, we've got the general systems and they take a performance hit for that.
Starting point is 00:14:19 When we're talking about a partition here, are we talking about kind of, is the physical location of the machine kind of is the is the physical location of the machine kind of related to these pathogens are we talking about just partition in that context usually just means a single machine it's not exclusive to like geo distribution like even uh things within a single region have the same trade-offs right i'm just saying like once you have to like run across multiple machines, you usually have to either run two-phase commit
Starting point is 00:14:49 and suffer the penalty, or do something else and sacrifice some property about the system. Yeah, as soon as the network gets involved, things are going to get a little bit slower. That was the conventionalism now i i like i should know that like this category is more about like systems that primarily store the data on uh disk uh in memory systems like there's been another line of of research there like where you keep your data completely in RAM and you use things like RDMA and so on. And there you could have fast and general,
Starting point is 00:15:32 but it's also pretty expensive because it's all in the RAM. And really, I think the popularity of on-disk systems just show that really developers really care about about that too so yeah yeah exactly yeah especially when you when you start using something maybe this more kind of well fancy hardware or fancy way it gets expensive like you said for one and also as well it's not as generally available right like it's not as easy to use so there's definitely some um challenges there as well yeah cool so we know all the problems now but we've we've solved this trade-off, this fast versus general trade-off with Shably, right? So
Starting point is 00:16:07 can you give us the high-level elevator pitch for the system first? Okay, so I think with Shably, we are going after geo distribution, right? And we have two goals. One
Starting point is 00:16:24 is we want transactions that are local to one region to do global externally consistent lock-free snapshots without slowing down our regional rights in the system. So two goals. And they are in many ways like uh until at this point we're kind of conflicting you either oh plus i think like so like that's the the fast part the general part just speaks for itself you still want to be able to like you know run sql have an unrestricted api and like like all the nice general things now geo distribution is an even harder problem than normal uh distribution so like even systems that are like fast or general within the data center, when you deploy them as geo, they have even more trade-offs that they have to make.
Starting point is 00:17:54 So let's take Spanner, right? Spanner was famous because of its TrueTime API that lets you do externally consistent block-free reads, and that's a uh big deal like you can just like like do a global snapshot lock-free fantastic right but a it uses a specialized uh clock which is not as widely available. Although there are startups and cloud providers that are trying to make it more widely available, but still it's not as generally available as you would like. And the other problem is that because of these TrueTime API,
Starting point is 00:18:51 like every single write in Spanner has to wait out the clock uncertainty before releasing locks. So before committing, you have to wait out the clock uncertainty to generate version timestamps that are guaranteed to have certain properties. Now, this clock uncertainty bound is many milliseconds in the Spanner paper.
Starting point is 00:19:25 I think they have improved it recently. Like maybe it's like one millisecond or something like that. But it still means that basically you have to slow down every single write in the system to achieve the lock-free snapshot reads. Some other systems basically require you to run consensus across the globe for every single write you make, which also kind of like slows down every single write in the system
Starting point is 00:19:59 to let you like read lock-free. And finally, so, these are the slow versions, right? And then the fast versions are based on determinism, which, again, sacrifices the
Starting point is 00:20:17 generality of the programming model. So on top of the fast and general trade-offs for the distributed transactions, when you go geo, you also have to sacrifice speed for generality or generality for speed even more. There's more trade-off there. And WatchHablis shows that you actually don't have to anymore. And I think
Starting point is 00:20:50 the pitch here is that you can have local writes have hundreds of microsecond latencies and you can have snapshot free with external consistency lock-free in the same system.
Starting point is 00:21:09 So it's fast and general. Now, of course, if you have to run a transaction that spans the globe, that will be slow, but that's just not affordable. Yeah, there's some fundamental sort of limit to how fast such a transaction can go, right? But it sounds there. Yeah, continue.
Starting point is 00:21:27 Sorry. I'm just saying, but REITs, they can just go without impacting REITs at all. So that's the thing that we showed possible with Shibley. Nice. Awesome. So yeah, it really sounds like we can kind of have our cake and eat it here. So that's really cool. So I just want to just pull on one thread real quick.
Starting point is 00:21:49 And you've mentioned it a few times while you've been talking about this notion of external consistency. So maybe you can kind of give a brief sort of rundown of what that actually means in practice. Yes. So I think it's also known as strict serializability, although some people slightly distinguish between the two. I basically just treat them as the same thing. So serializability, I think, is well understood. It means that all the transactions you execute, the result is equivalent to some serial order. Okay, great.
Starting point is 00:22:30 But it doesn't make any guarantees about which serial order. And one example here is that like, basically like if you have a read-only query, you can always order it like as of the beginning of the database. You can always return null. And that would be still valid serializability because... It's a compound, but yeah, it's technically still valid. I didn't violate anything.
Starting point is 00:23:01 It's as if it happened before all the other rights that you have. So that's obviously not very useful. And that's where the consistency semantics come in. And like, these kind of tell you like roughly how stale
Starting point is 00:23:19 things can be. And the strongest semantics that you can give is like this external consistency and it basically means that like if you start to read you are guaranteed to observe all the rights that committed before you started that read in real time like you know if somebody so basically like i can commit a transaction and then like i don't know call you on the phone say hey you can read right now and you go and execute it and the system has to know that like to show you the results of my of my right because it committed before your read started. Nice.
Starting point is 00:24:05 So basically like, even if the two transactions do not coordinate beforehand, the system is not allowed to order one before the other. If one committed before the other started. It's got to respect the clock on the wall, right? It's got to. Yeah. Yeah physical time, basically. Yes.
Starting point is 00:24:27 Now, transactions that are that start overlappingly together may be ordered arbitrarily by the system, but if one committed
Starting point is 00:24:45 before the other one started it must be that the order will reflect that real time relationship and basically this is considered the gold standard semantics and I think Spanner was the first
Starting point is 00:25:03 geo system that achieves it but Kelvin and its successors also do that's the gold standard that with all these systems are aiming for I just on a quick aside I remember reading somewhere also somebody somebody told me this that the the obviously when serialized abilities defined there's no mention of actually like real time right of sort of the wall clock time but so most of that is like it was almost they didn't have to think about it because when they defined those semantics like the notion of a geo-distributed system didn't exist almost like everyone's on one box right so they they just got that for free then but then obviously yeah now obviously well that's right that's right like like when you think about like asset transactional semantics and so on, it was all conceived on a single box.
Starting point is 00:25:54 And so you got that for free because you are acquiring locks on things and that just orders things. things but once you have replicas and global distribution and stuff this becomes like this suddenly becomes a major source of uh like surprises for yeah that's somewhere yeah awesome cool right so let's dig into the details now of um of chevrolet but i guess before we do that we need to give some air time to its predecessor chardonnay and i'm getting a theme with the names as well so we need to touch on that at some point as well yes but yeah so like tell us about chardonnay and how that then led to chablis and start let's let's start filling in this the details here sure so um like i said like i have known about the spanner paper and the calvin paper and you know i've've been thinking about the tradeoffs they both make.
Starting point is 00:26:47 And I like conclude like, OK, it stems from slow two phase commit. And then at some point I was reading another paper in MSDI 2019, I think it's called the DRPC, where it says data center RPCs can be general and fast. And it shows that in bare metal data centers with modern networks and things like kernel bypass and so on, you can really have RPC latencies that are five microseconds in the data center. Like, okay, that's interesting. That removes kind of one
Starting point is 00:27:37 reason why two-phase commit is really slow. And then i came across a bunch of other like research on things like nvme or xenand or like store like really low latency storage that are also single digit microseconds and i was like wait a minute like it seemed like it sounds like we can actually have very low latency two-phase commit right now. Like, back of the envelope estimates, it's like, if these numbers actually hold, there's no reason why you can't design a two-phase commit protocol that finishes in 50 microseconds or something but as but a typical like ksd right is oh i'm like sorry uh typical ssd access is 300 microseconds like you know uh commodity ssd like you know like not the fancy xenand or obtain ones the one that that the cheap one that you want to use to store your data, it takes you like, I don't know, 300 microseconds to access. So suddenly the latency of one IO is actually higher than two phase commit. And I was like, Hmm, interesting.
Starting point is 00:28:57 So maybe two phase commit isn't the fundamental bottleneck that people think it is now. And what would you do? Isn't the fundamental bottleneck that people think it is now? And what would you do if that's the case? What would the system look like? Can you finally be fast and general? And I mean, no shockers here. The answer is yes. That's the result of the Stratton paper. It was focused on single data center. The idea was let's
Starting point is 00:29:27 make the assumption that two-phase commit is fast and design a system based on that. And what we wanted to achieve is not just low latency, but also the ability to handle high contention workloads. So, you handle high contention workloads so you know contention is where like one or a few records in the database become very popular that they get most of the access and it's something that's very unpredictable like it's very easy to like have an app where like you like the load is very like evenly balanced but then suddenly something becomes very popular and you couldn't know
Starting point is 00:30:12 before that and once you have heightened contention the slow systems basically the performance just like really drops really bad like either very high abort rates because of deadlocks or if you or like if you are using optimistic concurrency uh control things just like you know uh assume that data isn't gonna change and then they try to commit and and the data change and and they have to like restart. So short and A starts from the assumption that transaction IOs are slow, the network and the log are fast, and it does a few things.
Starting point is 00:30:56 One is it uses the fast RPCs to support the lock-free snapshot mechanism. And the way you do it is in Chardonnay, the key idea here is we have a service called the Epoch service.
Starting point is 00:31:18 It's a very, very, very, very simple service. All its job is maintaining a single counter. That counter is not incremented by transactions. It spontaneously, like on a timer, just advances. All right? So it's not a sequencer in the traditional sense
Starting point is 00:31:42 where transactions would go and get a number and that determines the ordering. No, it's just kind of a clock. It's very coarse grained. Because transactions are only
Starting point is 00:31:59 reading it, we can actually make it distributed and scalable. It's not a bottleneck. And that's a key difference from sequencer-based designs. But the key point here is that transactions, they run in very much the classic way transactions run in a shared nothing system. This is the root of the name Chardonnay because A, it's a sharded system. Sorry. Nice.
Starting point is 00:32:34 Nice. Yeah. That's good. And B, it's an architecture that's very classic that we think has a aged like fine wine. Oh, it's vintage yeah nice so in in in chart rate transactions run using two-phase locking you know like these are things that were designed in the 80s or something like by jim gray and phil bernstein and people uh like that very like very classic design shared nothing you read the like you you figure out okay like i need to read uh this key then i'm gonna go to the partition server that
Starting point is 00:33:14 that has that key i'm gonna take the lock on that key in the end i'm gonna like run two-phase commit to make sure that everything is atomically committed. But we use very fast RPCs, so then the network time is minimal compared to the I.O. time. And
Starting point is 00:33:37 during running the transaction, sorry, during running the commit protocol, you just read the epoch. You know, from the epoch service, you issue an RPC in parallel, you get the value, whatever value that was, is the value that you use to, like, to version your records. And the system just maintains the property that a transaction that reads an epoch value 5
Starting point is 00:34:13 is ordered before a transaction that reads an epoch value 6. And that's very easy to guarantee. We talk about the details in the paper, but it's easy to just make sure that that happens because transactions still acquire locks and they are ordered. So when a transaction finishes execution
Starting point is 00:34:44 and it reads a value of the epoch, say five, that means that it cannot have depended on a transaction that read a value four. And why is that? Because it has the locks. If a transaction, I'm sorry, value six. And because it finished execution before reading the epoch, it knows that all its dependencies finished and read the epoch and the value has been five or less. It couldn't have been six.
Starting point is 00:35:22 Does that make sense? Because the epoch is monotonically increasing pilot system in real time if you read a value five that like that means that anything you depended on had a value of five or less it couldn't have been six because you read the latest value right now okay yeah so so that's nice because that means that the epoch boundaries are consistent points in the order and you can read their lock free ah okay so you you do your like your lock free snapshot reads at an epoch version basically but i guess the one before the the one that's the active epoch
Starting point is 00:36:06 at the moment is that how that would work that's basically right that's basically exactly right okay so it reminds me very similar of a scheme like epoch-based memory reclamation i can't say the word correctly reclamation there we go that's the word and a similar sort of thing right about like kind of you get so far in advance of this epoch that nothing can have a reference to that epoch, something that in the previous bin, basically, so you can garbage collect that. So it's a similar sort of concept, I guess. It is very similar concept.
Starting point is 00:36:34 And it was inspired by a paper called Silo. It's also, it's a single node multicore database, but we showed how to basically distribute it. Awesome. Cool. But, yeah, and the nice thing is if you want the property of external consistency, meaning like any transaction that started before I did, sorry, that committed before I did, I want to observe it.
Starting point is 00:37:08 All you have to do is wait for the epoch to advance once. Then you read as of then, as of the, like just before the new epoch, right? Because you know that any transaction that committed before you started has an epoch, say, seven. So if you read everything that has an epoch seven, then you're golden. Nothing could have committed with a lower epoch.
Starting point is 00:37:42 Very simple. Now, there are some technicalities when you are reading that we talk about in the paper, but that's basically it. Use the epoch to coarse-grain version the transactions. So transactions can have the same epoch, right? And you would have to wait until after the epoch to read. But the epoch boundaries are the points where you do your lock-free snapshot reads.
Starting point is 00:38:14 Nice, cool. We'll probably, I guess, touch on this. Obviously, we've got to talk about Shably and everything else, but like kind of the, how we actually choose the time between these to increment these epochs. Maybe we can touch on that when we get to the results later on.
Starting point is 00:38:28 Yeah, I think that's a really good question. And in Chardonnay, it's not super sensitive. You just want it to be large enough compared to the transaction execution time, but small enough to not have the users wait for too long to get the external consistency. So we set it to like 10 milliseconds or even five. We found it works well enough.
Starting point is 00:38:58 Nice. Cool. So I know kind of now talking about Shardly, there were some challenges about taking this concept of Shardner in sort of a single data center and then kind of going to distribute this. So can you tell us what these challenges were? And then we're going to talk about how you overcame them. Yes.
Starting point is 00:39:17 So basically, the problem here is that every committing transaction has to read the epoch, which is fine if all the nodes are in the same data center. But once you go geo, the question is, where do you put the epoch service? If you put it in one region, then all the transactions from the other regions have to take like a cross-region rpc to read the epoch which is bad because that would slow down the transactions in the other regions and if you try to like replicate it across regions you still have to like reading from the epoch service like physically has to be like a consensus read
Starting point is 00:40:05 because you have to get the absolute latest value. So then you're also running cross-region RPCs, which also slows down all your transactions. But remember, in Shabli, we want transactions that are local to a region to have Chardonnay regional latencies. We are talking 100 microseconds. Yeah.
Starting point is 00:40:29 So that's the conundrum. Where do you put the Epoch service? And it turns out the solution is fairly simple. Because of the way the Epoch service works, it turns out you can basically break it into two. In every region, there is like an Epoch. We call it the Epoch Publisher. And this looks to the nodes in that region like the Epoch service looked before.
Starting point is 00:40:57 It is the thing that they talk to to read the Epoch. And then there is one central thing that can live anywhere that's actually responsible for advancing the epoch. So the way this works is we have this global epoch service. It would go and say, okay, now the epoch is five. It goes and pushes that value to all the publishers that are local to every region. And it does not advance the epoch again until all the publishers have gotten the new value. Now, the publishers are designed to be like replicated, highly available services. So it's going to ask kind of what happens if we lose one of these publishers, right? I know going down, that's fine
Starting point is 00:41:49 because it's a replicated thing. Okay, cool. But what this implies is that the value at every publisher can either be equal to the true epoch or one less, right? Yeah. But it turns out it's really easy to fix the algorithms to take account for that fact.
Starting point is 00:42:12 And I don't want to go into too much detail, but the trick here is that when in doubt, you just need for the epoch to advance, and there will be no doubt okay so we just wait to get one more head and then we know for sure that we're all good we can't because I guess everyone can only be at the latest one
Starting point is 00:42:34 or the one before so if you then get to the latest one nobody can be you can't be that one before so then it's all good absolutely I see I see nice and once you do that suddenly you can read lock free in a geo distributed fashion all all the while read write uh transactions they just read the epoch from the from the local thing using fast rpcs they don't block they don't wait on the epoch to advance nothing it's just they keep the exact
Starting point is 00:43:06 same latencies as as as before you didn't need any fancy clock hardware you didn't need like gps clocks or anything like that you didn't make any assumptions about the maximum clock skew none of that just need one one number. Just one number. And it turns out it's easy to scale and maintain that one number. Like I said, you have to make sure that that one number is replicated. Managed correctly, yeah. It's durable and all that,
Starting point is 00:43:38 but it is not a scalability bottleneck because you can have as many of these publishers as you want basically and uh it's not a latency bottleneck because you read it using like kernel bypass rpcs you know in like like even on public cloud like not uh bare metal you get today 20 microsecond latencies or so. Yeah. So this is something that's even like on cloud, you can have today, like nothing fancy,
Starting point is 00:44:15 no clock, you know, hardware, nothing like that. So fast and general, which was the goal. Yeah, mission complete. So let's talk some numbers then i guess yeah um so can you tell us about your experiments with shabbly and kind of what the results were yeah like we ran a very simple experiment like we like ran the ycsb benchmark to like measure regional latencies and uh i mean memory doesn't exactly serve me right but i think like a single write would take like below 100 microseconds compared to cloud spanner it would take many milliseconds and then we run the snapshot read the lock free snapshot read with external consistency and so on and so uh to clarify like this is a deployment across the u.s so like we have a region in central in east and in west US. Yeah. And you can take a snapshot across those,
Starting point is 00:45:28 I think, with external consistency and everything. And if memory serves right, I think like 80 milliseconds, like something like that, compared to Spanner, which is like 60 milliseconds. So it's a bit slower than Spanner because you have to wait for the epoch to uh advance but it's comparable while our uh right latency is like an order of magnitude uh faster nice so yeah
Starting point is 00:45:56 i'd count that as a win i i i hope people uh will agree with that. Yeah. Awesome stuff. So I always like to ask this question as well. So this is like kind of, are there any sort of situations? Obviously we've kind of taken these two boxes of fast in general. And are there any cases though, when the performance of Shabli or any use cases where the performance may be sort of suboptimal?
Starting point is 00:46:21 I guess I'm asking here kind of what are the limitations of the system? Sure. I do think like compared to something like spanner i don't think so to be honest okay but i think compared to something like calvin or its geo uh distributed successor called slog they're able to handle cross-regional rights more efficiently. If the system has lots of cross-regional rights, the performance
Starting point is 00:46:56 of Chablis will be fairly worse than Slog. It can be sensitive to that aspect of workloads i guess so yeah yeah so basically like we make kind of strongish assumption that you don't do cross region rights that often but you care about cross-region uh creeds if that's not really true i think like compared to something likelog, the system would perform worse.
Starting point is 00:47:28 Okay. But obviously, these are the different trade-offs with Slog, right? As you said, alluded to earlier on, it falls into that fast category, but then we're probably giving up some flexibility. You will lose, like, you know, the... Generality of... Programmability. Yeah. Yeah.
Starting point is 00:47:43 Nice. Cool. So, I guess, yes. So where do you go next with Chablis then? So what is the next step? Is there a next step? There is. There is a next step.
Starting point is 00:47:51 And I think like so far, like in the paper, we kind of assume that we have like the geo-partitioning of the data is static and known, but we want the user to be able to run a transaction that includes move this piece of data from Europe to the US. Because say you are a user of Facebook and you were living in Europe and then you moved to the US, the app will want to move the data with you, right? So working on like geo partitioning and like making that a first class part of the system while supporting fast and general transactions, I think that's where we are going next.
Starting point is 00:48:43 Nice. This kind of concept of ownership i guess of like having the data move with with um within it reminds me of just on a tangent a system called zeus that did something similar maybe i don't know if it's on your radar but i and maybe it was maybe i'm getting it confused because it was something we're uh kind of like now like reading all the systems that yeah yeah so yeah it's just like like there are like problems on like okay like how do you figure out where the data item is without running a geo distributed query because you yeah like yeah so like there are whole sorts of like fun uh problems here that we are thinking about um so that's
Starting point is 00:49:27 certainly like one area that we are uh going for next with surely yeah that's i'd be really interested to see because there's a whole interesting space that i imagine of of things to tackle so that's cool and yeah so my next sort of set of questions are sort of more sort of higher level and sort of general. So yeah, the first question is sort of kind of what impact do you think your work on Shably can have? And as sort of a software engineer, developer, data engineer, et cetera, how do you think I can kind of leverage
Starting point is 00:49:58 those sort of findings of your work? So I think there are two things here here one is the epoch versioning and like having the service and so on i think it can be useful in a lot more contexts um like you know ordering events and stuff like that and just like it's something that comes up in a lot of contexts so yeah uh i think like people can like look at this and realize this is exactly the like the level of ordering that i need i just i don't have to order like every single like uh transaction or event in the system with regards to each other, but if I can group them after
Starting point is 00:50:50 the fact opportunistically using epochs then use the epoch boundaries to read that I think this is an idea that probably have more broader applicability.
Starting point is 00:51:09 So that's one aspect. The other is just I really do think that databases that we have right now, they were designed in some way for a different platform platform a different hardware a different era and like there is a potential to just like have something that's just straight up better and uh i'm hoping that like that kind of research you know encourages people to revisit the assumptions you know something like oh two-face commit is this like this horrible bottleneck that we can't solve, or oh, we need these fancy clocks to do geo
Starting point is 00:51:52 lock-free reads, like that kind of thing. I'm hoping that people can look at this and realize, oh, these assumptions no longer hold, and we can build like next generation systems that are better and you know to facilitate that i think we are working on releasing shably as an open source thing so there's still some work there to move
Starting point is 00:52:20 it from like research quality code to actual like thing production grade yeah people would want to use but it's something that's that will hopefully happen soon so awesome that'd be great to see how that progresses well good luck with that i hope it's a successful endeavor cool and awesome yeah so i guess you've worked on this i mean how long was you working on this when did the chardonnay project start uh i think chardonnay started um in late 2021 or like early 2022 something like that um it got rejected once before it got accepted so you know that always it's a rite of passage i mean someone told me once i'm sure it's the the raft consensus paper which obviously is extremely i've been extremely influential i'm sure that got rejected a couple of times before it was finally accepted so yeah
Starting point is 00:53:13 i mean it's a lottery sometimes right there so yeah new system but yeah yeah thankfully was well received from the start but certainly yeah yeah yeah so the question i'm leading up to is sort of across this sort of journey awakening on these systems what's the sort of like the most interesting lesson that you've you've kind of learned um i mean i really just think that like a lot of the like there is a lot of potential for just like like looking at the conventional wisdom and like not accepting it as true now it's not useful to just be contrarian i think that's just not a good thing i think it's important to understand why the conventional wisdom is the conventional wisdom like one thing that really
Starting point is 00:54:05 drove me mad for for a bit was just like people were saying oh transactions they don't scale you know okay but why like and like really they didn't really have how have you arrived at that conclusion like what's your thought process yeah like sure you tried running i don't know like my sequel and it didn't scale but that really just doesn't necessarily mean that and like so it's important like what i would say to people who are like you know thinking about starting their own like research journey it's really important to understand why the decisions were made in in certain, because even though they may have been the best decision at the time, if you don't understand what went into them, you can miss out on opportunities where it's no longer the case. Like, you know, two-phase commit was genuinely a problem. The people who tried to avoid it weren't misinformed or stupid.
Starting point is 00:55:09 It was really, really a problem, right? So it just so happens that I happened to stumble upon the fact that it's no longer an issue and I can build a system based on that uh the other lesson that's kind of like more practical i think or like more concrete is that when you are dealing with on-disk systems multi-core designs and and single data center distributed designs can look surprisingly similar and look for ideas in one to apply to the other and vice versa so this is kind of like the whole like epoch thing was in in many ways an extension or an application of the multi-core silo e-book. And it's very counterintuitive.
Starting point is 00:56:08 Like you would think, okay, like this can only like work in a multi-core single node thing. But no, because the latency of disk is much higher than the network. Suddenly, a lot of the things that made sense in multi-core can make sense in distributed systems and vice versa i suspect like you know yeah so that's just a thing that i i that wasn't at all obvious to me like when i started but i it's just something that's like always in in my mind right now like whenever i'm designing something now i'm like let me see what the multi-core people did about that and maybe i can steal an idea or two yeah i like that oh that's an interesting observation i'm definitely going to keep that in mind as well and i liked what you were saying about um kind of when you need this conventional
Starting point is 00:57:02 wisdom and like challenging the assumptions or reviewing those assumptions that were made to kind of get to that conventional wisdom because you know what the world changed right things change so maybe those assumptions are no longer valid assumptions anymore the the ball games changed but yeah no i really like that i'm going to jump over the next two questions just for the just for the interest of time to get to my favorite question, which is about the creative process. And yes, I kind of want to know here, Tamer, how you go about generating ideas. And then once you've sort of generated a whole bunch of ideas,
Starting point is 00:57:35 like selecting things to dedicate two to three years or however long to working on. I don't know. I'm very chaotic here. I't i wouldn't like this it's not but i think to me the way i try i try to think about problems it's just like really try to decompose the problems and figuring out why they are probably like you know you it's it's a problem because we do it that way okay fine but if we try to change this very small thing how can i take that and does it even like make sense and so on so i think like for example i was
Starting point is 00:58:14 like trying to figure out how to do block free reads in chardonnay and i kept like banging my head against the wall for a few weeks trying to like you know change the right algorithm to make it work like in such a way to felt like it just like nothing really worked but but then i remembered the silo paper and i read it and i was like you know what this epoch thing is interesting why can't it work in a distributed environment like and then i kept like and then i couldn't like there were two challenges that we solved but there was no fundamental like i i was just like trying to kind of prove by contradiction that it cannot work in a distributed environment like and then i couldn't prove right you know so at this point i was like
Starting point is 00:59:05 maybe it can be solved maybe it's just and then you sleep on it for a while and like one day you get the epiphany so that's kind of like how i like to do things really think about the problem decompose it read papers like i think just like in some ways there are very little new ideas under the sun it's really just like yeah being able to recommend like recognize something from a different context and apply it in a new context can be very valuable so that's really part of my creative process is just like you know see if there's if someone smarter than me figured something out and see if i can just like you know cleverly apply it and uh make it work in a way that it didn't work before so definitely standing on the shoulder of the giants is like that's the exact phrase i
Starting point is 01:00:06 had in my head when you were talking then as well yeah that's funny uh like we like trying to like reinvent the wheel which is i mean to be fair research is in many ways rewarded on very strong novelty so that's kind of like yeah the the the incentives are structured in that way but to me i really like solving an important problem a lot more than coming up with a cool technique that may or may not be uh that useful that's just uh yeah no it's a fantastic answer to that question and it's another one for my collection and so thank you for that it's great i love to see how everyone everyone has a different answer to that question it's it's it's it's um it's always interesting to see what people people say and kind of get an insight into how how you work how your brain ticks and how what makes you tick so yeah awesome stuff cool so it's then the last
Starting point is 01:01:05 question now so what's the one takeaway you want a listener to get from this podcast episode today uh transactions they can't scale they can be fast they can be general don't like you should expect more from your dbms because it can do more for you i love that what a great message to end on thank you so much it's been a fantastic chat thank you so much for having me brilliant where can we find you on on socials are you on any of the platforms where the listener can go and go and connect with you or anything like i'm not really i think linkedin is is the best way okay cool we'll drop it we'll drop that in the show notes and we'll put links to all of the with you or anything like that? I'm not really. I think LinkedIn is the best way. Okay, cool.
Starting point is 01:01:47 We'll drop that in the show notes. We'll put links to all of the work, and I think we spoke about it today in the show notes as well, so the listeners can go and find that. But yeah, thanks again. And just a quick reminder to the listeners, if you do enjoy the show, please consider supporting us
Starting point is 01:01:59 through Buy Me A Coffee. It helps us to keep making the show. And yeah, we'll see you all next time for some more awesome computer science research.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.