Disseminate: The Computer Science Research Podcast - Tobias Ziegler | Is Scalable OLTP in the Cloud a Solved Problem? | #23

Episode Date: February 20, 2023

Summary: Many distributed cloud OLTP databases have settled on a shared-storage design coupled with a single-writer. This design choice is remarkable since conventional wisdom promotes using a shared-...nothing architecture for building scalable systems. In this episode, Tobias revisits the question of what a scalable OLTP design for the cloud should look like by analysing the data access behaviour of different systems. Tune in to find out more!Links: PaperWebsiteEmail TwitterGoogle Scholar Hosted on Acast. See acast.com/privacy for more information.

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Disseminate the Computer Science Research Podcast. I'm your host Jack Wardby. Today we have another installment of our CIDR series and this is a special episode today because it's our first returning guest and so I'm happy to say we've got Tobias Ziegler back on the show who will be talking today about his paper, Is Scalable OLTP in the Cloud a Solved Problem? And for those who didn't listen last time, Tobias is a PhD student at the Technical University of Darmstadt. And so, yeah, Toby, thank you for coming back on the show. I'm really happy to be here again. Brilliant.
Starting point is 00:00:58 So I don't know whether I asked you this one last time, but we can start off with this. So can you maybe tell us a little bit more about yourself to those who who don't know you and explain how you became interested in database research sure so my name is toby as you said and besides dbms research i'm actually a pretty coffee and sport fanatic so i like to do those both actually and i'm in the process of finishing up i'm currently writing my thesis. So I will be actually not a PhD anymore in quite like in a bit. And how I got interested in databases was actually a lucky coincidence because we had two courses in bachelor. And the first one was actually super boring. So I learned a lot about entity relationship diagrams
Starting point is 00:01:43 and SQL and like how to model databases. And that I didn't, I was not really interested at all. And then my, my advisor teach actually the second course in Mannheim back in the days. And this got me really interested. It was like in terms of databases, how optimizers work, like very different indexes and like the full stack and why it was a coincidence was because actually it was planned that the first teacher basically gave the course again but he got sick and then like my now my supervisor kind of jumped in and it was basically just coincidental and then i ended up as a phd wow one of those sliding doors moments right where life could have taken a completely different path if that didn't happen exactly blessing in disguise yeah that's really cool
Starting point is 00:02:31 um so let's let's um let's talk about the paper then so the there's there's the saying right this is it better ridges law that any headline that ends in a question mark can be answered with a note so i'm guessing the answer to is scalable scalable OLTP in the cloud a solved problem? I'm guessing the answer is no, right? Yeah, the answer is no. Can you give us the elevator pitch for this work and some background to it? So as you actually said, the answer is no, but there's actually a path to getting there. And the title is essentially a question we ask ourselves.
Starting point is 00:03:04 Is it actually a solved problem? Because we were actually not sure, right? Because there's database research for quite some time now. So we have distributed databases since, I don't know, 70s, 80s, or even 60s. And for a long time, the shared nothing database was actually the de facto standard when building distributed databases. But then the cloud came along, and actually the expectations of customers influenced design decisions a lot. For instance, almost every database is now desegregated. And the reason is customers expect some flexibility,
Starting point is 00:03:33 elasticity, but also they store now huge data sets. So I think actually quite recently with the Socrates paper, I think Microsoft extended their storage offering from 4 terabyte to 100 terabytes, which kind of indicates that there's some new maybe workload characteristics going on. And I mean, when we look at OLAP, right, this has been the case for OLAP almost from the get-go in the cloud, that scalability is like an inherent property of distributed OLAP analytical databases. But for OTP, that's actually not so trivial, right, to achieve scalability because it's very latency sensitive. That means like distributed OTP databases have not been as performant as local databases
Starting point is 00:04:21 and customers would expect a similar performance as their local on-premise database when going to the cloud. It was actually not an easy question to answer. Yeah, I know. It sounds like it was a difficult question. So can we maybe talk a little bit more about the classical approaches then to people, how people have gone about designing OLTP distributed databases and tell us a little bit about what the old taxonomy used to be, what this classical taxonomy was and what the problems were with it.
Starting point is 00:04:53 Sure. So with the classical taxonomy, we refer to a paper which was published almost 40 years ago from Michael Stonebraker, so one of the grandfathers of databases. And the paper is actually called the Case for Shared Nothing Database. And surprisingly, he made the Case for Shared Nothing Databases. But he also characterized like shared storage, shared nothing, and like a shared memory thing, which is nowadays actually every multiprocessor is dead, so it's not really that interesting.
Starting point is 00:05:27 But shared nothing and shared storage is actually was very influential and is still very, very often used as a term. But in these 40 years, many things change. So one of the things is actually, as mentioned, the cloud came along and fast networks. And as I mentioned now, many cloud databases are actually desegregated. So they have shared nothing. They have shared storage under the hood. Basically, almost everyone uses shared storage. And what it makes hard, right, the connotation of those terms make it hard to talk about scalability. Because for everyone, when you said, okay, it's a shared storage system, many people assume implicitly that every node can access the
Starting point is 00:06:05 shared storage, as the name suggests, but also can modify it. That's actually not the case anymore, right? In cloud databases, there's actually many databases like AWS Aurora or Azure SQL Hyperscale which use shared storage under the hood, but they have still only a single read write node but that's actually the main point that the taxonomy is actually quite hard to use for scalability because we have these different like images in mind when we talk about chat nothing or share storage and the cloud kind of blurred these lines quite quite heavily there's also conversely when we when we talk about chat nothing right we always assume that only one node can access the private storage but there's also conversely, when we talk about chat nothing, right? We always assume
Starting point is 00:06:45 that only one node can access the private storage. But there's also a multi master architecture, which allow that multiple nodes update no private storage. So there are actually quite a few outliers. So yeah, we've got this sort of taxonomy that does not fit reality anymore. So I know a big one of the contributions in your work is proposing a new taxonomy for us not fit reality anymore so i know a big one of the contributions in your work is proposing a new taxonomy for us to think about these systems these old tp system designs and it's based on the data access path right so this was the sort of the the approach took what was the reasoning behind this and can you maybe tell us more about this taxonomy? Yeah, so as mentioned from the title, we are interested in scalability.
Starting point is 00:07:28 And we ask ourselves, what is the important characteristic for OLTP scalability? And at the end, eventually everything boils down at the lower level to data exorcise. Are reads and writes scalable? And how they are scalable? And to
Starting point is 00:07:43 answer these questions, we kind of needed to think a bit differently about architectures. And in terms of data access path, we kind of focused on read and write path specifically. So that's why we had these, we proposed three different data access archetypes, which are kind of prototypical architectures, where we can basically characterize different existing designs into those. And one of them, or actually the first one is single writer,
Starting point is 00:08:10 which means, for instance, AWS Aurora, where we have only a single read-write node, but we can have multiple read-only nodes. So those are called replicas. Then we have partitioned writer, which has multiple writers, but they work on a partitioned data set. So everyone has their own autonomy over their own part of the partition or part of the database
Starting point is 00:08:34 and can modify those. But what it needs to also have, it needs to have some coordination to glue these things together. We have two-phase commit or some similar protocols. And then we have shared writers, which actually use a shared storage, but then modify, can concurrently modify the contents of the shared storage and read, like read and write from the shared storage.
Starting point is 00:08:58 So those are basically our three data access archetypes. Nice. So how does this taxonomy perform better on categorizing existing systems then? How does, like, what are the benefits? Because, I mean, a lot of these kind of sound a little bit similar to kind of what we had before. Like, how does it plug the gaps, I guess, is what I'm trying to ask.
Starting point is 00:09:19 Yeah, exactly. So what we did in the paper, we kind of had four prototypical workloads at uniform reads, uniform writes, skewed reads, and skewed writes. And what we wanted to figure out is how those systems like on a design level perform. So we kind of abstracted away from like implementation details and all these kind of low-level details
Starting point is 00:09:41 and asked ourselves, what is the asymptotic scalability? And here we mean, if we look at the prototypical workload, for instance, uniform reads, can this architecture solve those things in a scalable manner? Can we add more nodes and we increase the performance? A bit like big O notation, a bit. And what we wanted to figure out is for those four workloads, which are kind of prototypical, what kind of architecture performs the best in which sense and why? And what are the drawbacks of every architecture?
Starting point is 00:10:14 And why they are different is, it's not a really different taxonomy, I would say. It's an orthogonal, right? Instead of focusing on the storage location, we focus on the data access pad. It's not actually meant to replace the old taxonomy, but kind of highlights on different properties. Okay, cool. So that makes sense. So it's an additional framework rather than a replacement. Cool. So a big key sort of, you mentioned it a few times so far, is latency is important.
Starting point is 00:10:46 Why is this? Can we maybe elaborate on what the importance of latency is in these different system designs? Yeah. So I think latency in general is important for OTP because OTP is known as a latency-sensitive workload. That means if latencies rise a bit higher, basically, or rise, then basically the performance is immediately reduced, I would say. Let's phrase it maybe differently. If performance rises, the throughput basically decreases, mostly, because we're interested in per transaction latency. And if this goes up, then customers will immediately complain about performance. And the challenge with that is that because most systems already are desegregated, that
Starting point is 00:11:35 means we already incur some network overhead, right? But to kind of mitigate it, we everywhere have caches now, like local caches for buffer manager and whatnot. So basically, to tie back your question, why it's important, right, it immediately is visible to the customer. And we kind of focused on throughput, but we have an additional discussion on latency. And there are multiple ways of, like, reducing latency in the cloud.
Starting point is 00:12:02 We can either push down computation to the storage, or we can use caches to basically avoid network roundtrips. We've spoke about the importance of latency and how we need to consider latency and throughput together, and we've got our new taxonomy here. So how do we go about answering the
Starting point is 00:12:22 title of the paper then? So what is the blueprint for a cloud OLTP database management system? Yeah, because we want to have a full scalable system, right, we consider write and reads. And because we have these four productivity workloads, we looked at uniform reads, uniform writes, skewed reads and skewed writes. And obviously skewed writes are super hard to solve. They are basically impossible to solve if you have consistency guarantees. So we ignore this for now.
Starting point is 00:12:50 But then we have skewed reads. And skewed reads are actually quite important because if you think about it, indexes and stuff, they can be actually quite skewed, right? You always look up some specific key or things. So that can happen quite often. And we analyzed different workloads
Starting point is 00:13:04 in certain databases. key or things. So that can happen quite often. And we analyzed different workloads and certain databases. And what came out is basically that they are actually quite skewed, or they can be quite skewed. And there are actually a lot of read-only workload or queries. And to build such a system, which is scalable for skewed reads, the question is now, how can we do this? And one of our archetypes is actually shared writer, but there are two flavors of it. There's one which basically always goes through to the shared storage, which means at every network access, or if every access you have a network round trip, which obviously incurs latencies. So this is not so optimal for latency-critical workloads.
Starting point is 00:13:47 But another thing which is also quite tricky there is if you have skewed reads and every node accesses this one single storage node, then this gets clearly overloaded, right? You cannot handle the load. You can replicate it, but that is kind of orthogonal, right? You can always replicate. So the question is now, how can replicate it, but that is kind of orthogonal, right? You can always replicate. So the question is now, how can you build like a Bayway system
Starting point is 00:14:08 which can handle skewed reads? The question we came up, or the answer basically we came up to this question is, we want to have caches. Like at the compute level, we want to have caches. So we cache the hot read items, but then the question becomes, if it gets updated, right, we need to somehow invalidate the cached items. And to really pull this off, right, we need some coherency protocol. And our blueprint for a scalable OTP database is basically a shared
Starting point is 00:14:39 writer where every node can actually access the storage to update with the Korean caching protocol. And it's actually very similar to Maisy. So like a cache grid protocol, which keeps invalidating elements, which get updated to avoid inconsistencies. So obviously you have been working on this sort of, this line of work for a while. And I know in episode nine, you came on to tell us a little bit about a skill store.
Starting point is 00:15:10 So I'm guessing this is a first step in that sort of direction. So maybe you can use this opportunity to tell the listener a little bit about skill. I always got it wrong last time as well. I always want to call it skill store. I don't know why. It's Scale Store, right? Yeah. Exactly.
Starting point is 00:15:28 Yeah. Scale Store. There we go. Yeah. So maybe you can tell us a little bit about the design of Scale Store. And obviously the listener can go and should go and check out episode nine as well. So Scale Store is actually a distributed storage engine. And at the core
Starting point is 00:15:48 is the coherency protocol. As I mentioned, it's very similar to Macy. Basically, think about a cache coherence protocol which is now used in multi-core CPUs. We use it
Starting point is 00:15:57 for the buffer manager. So, meaning we have hundreds of gigabytes of memory per node and if you want to update some page, we can invalidate it at other nodes if they are cached to keep basically data consistent, right?
Starting point is 00:16:10 We don't want to have inconsistent data because we're a database. That would be bad for us. So we basically use an invalidation-based protocol to update or to keep data consistent. But the benefit is, right, it's like everything that you can express in pages, you can put in ScaleStore because it's basically a page-based design. Instead of a cache line, we have pages.
Starting point is 00:16:32 We keep those consistent. And as a programmer, right, you can just use it as an abstraction. It feels like a single node. You can basically put any data structure in it. We, for instance, use the B-tree, but also we have implemented linked lists and everything. So you can basically put it in and because it's like this nice abstraction, you get for free basically that it's distributed. So everyone
Starting point is 00:16:54 can basically cache the inner nodes, right? So we have an index, a B-tree, and the inner nodes, they are actually rarely invalidated, but they are very read-heavy or basically hot for reads. That means that every node can now cache the inner nodes and access them very efficiently via the memory latency. And then you also get the benefit that you can handle out of memory workloads because
Starting point is 00:17:17 Scalester uses SSD to evade cold pages to SSD to be very cost efficient. That's basically the rundown. Nice, yeah and that's a good summary of the system and as like I said earlier on the listeners should go and check out episode 9 to learn more about Skillstore. So with this sort of I guess one step towards answering
Starting point is 00:17:39 this question, there is numerous other research opportunities that you mentioned in your paper so let's run's run through them yeah one by one and you can tell us tell us about the why they're hard and what kind of what the the direction might be with with solving it so let's start off with uh cashing and eviction yeah so basically maybe we should touch on them briefly and then yeah i go deeper into caching and eviction and maybe concur. Sure, yeah.
Starting point is 00:18:07 You can tell us all about all of the problem opportunities at Fallout first and we can go through them one by one. Sure, yeah. So caching and eviction, right, they go hand in hand. Because if we cache data, basically imagine we have a read-only workload. We start caching like a bunch of data and at some, our in-memory cache will be full, right? Because we have replicated the pages. Basically, if you and me, we cache the same page, we now have two physical copies of the
Starting point is 00:18:32 page, which basically means that if you, for instance, if you have a capacity of 50 pages, right, and we cache, or we have one data set which consists of 50 pages, and we both cache it, our memory is already filled, right? So that means the opportunity to cache also needs to be tamed, right? You can cache a lot of stuff, but at some point you need to evict it. And why this is actually quite hard in such a system is the following. When we cache it and we only decide locally if we want to evict. Basically, there's these nice terms from Vycom, also from the 80s actually, called egoistic and altruistic, meaning that if we have an egoistic strategy to evict, for instance, LRU or LFU, we only consider our own state and our own hotness or recency.
Starting point is 00:19:20 If we evict those pages, it can be that it's only a single copy of this page that's in memory currently. And because the network XSCs are actually a bit faster than going to SSD, it's actually beneficial if I keep this single page copy in my cache, because you can read it from there actually quite fast. But if I evict it, you need to tell me, hey, I want this page, then I need to read it from SSD and send it over. So latency is actually quite a bit higher. But if I have a page which is basically also cached at your in-memory region,
Starting point is 00:19:51 I can evict it actually quite fine, right? I can still read it over fast networks, meaning the latency is much better than when we go to SSD. And to really pull this off, right, we need some global state. We need to know who has it replicated. Am I the only one who has this page replicated? Or is this page replicated in other nodes as well? And this is quite altruistic.
Starting point is 00:20:13 You can basically just kick out replicas, but it's actually not so good either. So you need to kind of combine those two strategy in a cost-based manner there's actually a nice like theoretical discussion about it also 30 years ago or something which discusses exactly this problem and also proposes it but we implemented it and actually it didn't work so well so there since eight years there's a lot of advancements right in computer architecture we have multi-core we have many things in keeping like a global state across several nodes which process millions of transactions per second it's actually quite not quite hard right so there i think there's a lot of interesting work going on yeah that was the one thing that kind of jumped out to me when you were mentioning that is that
Starting point is 00:21:01 as soon as i heard the word global state it kind of triggered me a little bit i was like hang on a minute that sounds like a potential bottleneck right and that was what you found in your experiments right so i guess what was the so i mean i presume you're working on this at the moment but yeah if we what are the sort of ways around kind of maintaining a global state are there some way you can sort of disaggregate or distribute that in some fashion yeah so one of the things is which which is actually currently distributed already at the directories right the directories have kind of global state already they need to know who has the page cached because if they are all cached in shared mode i need to invalidate them for exclusive mode because if i want to
Starting point is 00:21:40 update it right they need to be invalidated so that the directory itself has already the global state for for every page it kind of maintains and we can actually use this right you tell me basically if i copy a page you can tell me how the current count is how many replicas are they currently flying around if i'm exclusive right i anyway know that i'm the only one if i want a page shared right you can already tell me now, okay, there are four pages. But this can be outdated at some point. And precisely this is hard to update, right? Because if someone evicted, it doesn't need to want to send like messages around the whole
Starting point is 00:22:16 cluster to say, okay, here, this page is now gone from my cache. Please, please update your account. That's not what we want, right? But what you can do is actually, because I have like a rough notion about there are four replicas, I can try to use this outdated information and evict it. Because the eviction is, I mean, it's described in the paper,
Starting point is 00:22:37 it's a bit complicated actually because we use RDMA and stuff. But how the eviction works in a rough, like high-level view is, I ask the directory if I can evict it. The directory reads the page with RDMA like one-sided from my buffer if it needs to be read. For instance, if it's dirty. If not, he just tells me, okay, you can evict it.
Starting point is 00:22:57 There are many reasons why we designed it this way. But it also helps us to have a second chance, basically. Because he knows the current state of this page i can basically assume that my knowledge has been outdated but i can still make a good decision about the like hotness and coldness of this page evicted and the directory can then basically re-evaluate this decision and send it back i see so it's sort of like a lazy lazy exactly yeah that's yeah that's a really good way of framing it yeah um and that's really when when another person basically reads the page from myself because i have a cache right you can read it from everywhere then we update the counter
Starting point is 00:23:40 so we try to converge and let the page not get outdated by piggybacking information across multiple messages we send around anyway. But if it's really off, the directory can basically step in and say, here, this page is not as cold as you expected. You're the only one keeping it at the moment. Here are
Starting point is 00:24:00 the current counts. Please reevaluate the decision. Nice. and this performs a lot better. Sorry. Much better. And the interesting thing is actually it's not only about the number of replicas, we also store the global heat. Basically, if I
Starting point is 00:24:16 evict the page, then I tell you if you're at a directory, how hot this page has been when I last updated it. So you know a bit more about it, right? You know how hot is the page? Does many nodes want to access it maybe in the near future? So that's why all these things are factored in.
Starting point is 00:24:34 And interestingly, it performs much, much better. So it performs like twice as good in the aggregated memory case because that's the hard case. The hard case is when our capacities is like 90% filled then we only have like 10% to operate to like cache and evict our pages and then basically even if the workload fits in the memory right and aggregate memory we start to evict to SSD and that makes it then slow again great so yeah that yeah, that was, that's the first research opportunity you outlined. So the next one is
Starting point is 00:25:10 elasticity. So tell us a little bit more about what we're looking at here. In particular, elasticity of the storage is interesting here because this is where we store our pages. In the compute layer, right, we cache them. And this comes basically for free because you can always, like, get new nodes and let nodes leave. And then they repopulate their cache and everything is fine. But currently on the storage side, you cannot easily add new nodes. And the problem currently is how it's implemented is the page IDs, they are kind of hard-coded, right? In the page ID, it basically tells me where the page is stored, like in which slot of
Starting point is 00:25:48 the SSD and which node ID is responsible for it. And clearly this is not really feasible if this one node goes down, right? We need to completely re-evaluate and re-tag all the pages. And note that here, the pages, right, they are like pointers. That means if we use them in some B-trees or some other data structures, we need to update all these things. Not only that we,
Starting point is 00:26:13 this one page needs to be relabeled, but everything needs to be checked and relabeled. So that's actually not possible. And what a challenge here is, I think one of the things is actually quite solved is we need to use consistent hashing to identify what kind of nodes keep the pages. Because when we use consistent hashing, right, the property is that if one node leaves, we don't need to reshuffle like everything.
Starting point is 00:26:40 We basically only need to shuffle a few pages to my neighbor and another few pages to this other node, and then to every node gets a chunk of the leaving node. And the nice thing is that we can update the data structure which kind of describes our ring, basically, and then we can basically rehash our hashes, and then we find a new node on this consistent hashing ring. And what, in addition addition is quite complex is keeping the state consistent because what happens if basically a node leaves, then it could be at some point that there are two owners, two directories, so we need to swap all the state
Starting point is 00:27:18 atomically. Okay, cool. So have you explored implementing this in ScaleStore at the moment, or is this still very much opportunity and you're just sort of thinking it through, or do you have any sort of experiments you've run with this sort of stuff? Not really. It's just a thought experiment for now. No, yeah, cool. But I guess it'd be very interesting to see how it performs. I guess it's on the roadmap, right? Yeah, it's on the roadmap cool cool um yeah so the next the next opportunity and acid transactions isolation guarantees are my
Starting point is 00:27:53 favorite thing so tell us a little bit more about this and that is isolation going to be solved and simplest simple in this approach yeah tell us all about it to pc we don't need to pc anymore right so that that simplifies things uh yeah tell us all about it. 2PC, we don't need 2PC anymore, right? So that simplifies things. Tell us all about it. So why we don't need 2PC is basically that we own all the data, right? If we touch it, it basically is transferred to our own local cache and we are not shipping compute or transactions back and forth between different nodes. So we don't need to kind of vote on a consistent state because we are
Starting point is 00:28:28 basically responsible for all the data we touched. So we can basically commit locally because we have seen all the data and nobody else is involved. That is nice. But what is actually quite hard is isolation. So concurrency control is, is not a soft topic. And unfortunately there's only little research in the recent years, but as much like 30 years
Starting point is 00:28:50 ago, 40 years ago. But the question is now, can we actually still reuse these things? In the meantime, there's been a lot of new concurrency schemes, especially in the optimistic concurrency world. We have now Silo, which may be a good fit, but there's so much going on in this world and I'm actually not a big expert on it. But what we have found is that
Starting point is 00:29:11 many of these existing strategies, they use pessimistic locking and they kind of coupled the locking scheme with the directory server. Basically, if you access a page, you also get the lock for it. And the question is then, is it to cost-grant? Because it's page-based, do we need more fine-grained? And there's actually a lot of opportunity to do
Starting point is 00:29:31 research. The next opportunity for research you outlined in the paper, and that is cloud infrastructure and services. What do we mean by this? What are the challenges here and the opportunities? Yeah. So because nowadays the cloud age or era basically started, right? We want to think about how can we bring this thing in the cloud? And so far we have assumed that the compute and storage layers
Starting point is 00:30:01 basically are under our control. But that's not really feasible often. Actually, it's feasible control but that's not not really feasible often or actually it's feasible but it's not cost efficient because aws s3 for instance is much cheaper it also offers built-in replication which would be maybe nice but on the other hand the latencies are much higher so the question is now how to incorporate these deeper memory hierarchies into into such a system because it's not quite obvious how we do this, right? Because in the cloud, the SSD is not really, if it's basically the instance local SSD,
Starting point is 00:30:33 it can be gone, right? If something really bad happens and the VM is lost, it could be that we lose the data. Because SSD is not like a physical thing where we can go to and we kind of get it out of our system. If something is really bad, it could be gone, right? If we don't use EBS or something like where it's guaranteed that it's back up, it's just
Starting point is 00:30:53 a local instance, then it could be gone. But on the other hand, like F3 latencies are much, much higher compared to like NVMe SSDs, which are around 80 microseconds. So the question is now how to balance these things. And that's actually not trivial. Because where do we evict it, right? Where do we evict it? I mean, if we have like memory, we need to evict SSD,
Starting point is 00:31:19 but then from SSD, we need to go to S3. Should we like write it to both? Yeah, so not really a good suggestion here. And another thing is Mighty Cloud Support. Okay. I think Mighty Cloud Support is another thing which is kind of tricky because in Scalestore, I mean, there was a research system. We assumed basically we have RDMA at our fingertips.
Starting point is 00:31:42 That's not really the case in the cloud, right? There's actually up to three like big vendors like Microsoft, Google and Amazon. There's actually only one, maybe Microsoft which offers even RDMA. And that's only for like super expensive HPC instances. And like an alternative to RDMA is maybe EFA, but the latency is still 10 times higher compared to RDMA.
Starting point is 00:32:06 So I'm not really sure if it's twice as good as TCP IP over Ethernet, but it's still not as comparable to RDMA as we would like to. And then they have different stacks, right? So you cannot really easily swap your cloud offering because you need to really adapt source code and stuff. So the only unifying thing you have is basically TCP IP. And that's slower again. So the question is now,
Starting point is 00:32:33 how do you support these different hardware stacks? Yeah, that's a really interesting point. I was having a similar sort of conversation with asking me about sort of how widely accessible, how widely available is the web I'm looking for. RDMA is in the cloud offerings. And I was like, I think some vendors have it and some don't. But yeah, only Microsoft do it, only for HPC.
Starting point is 00:32:55 That's interesting. Because I mean, it's kind of a question. It's actually only 5% of the available instance types of Microsoft have RDMA. I think we have this calculation somewhere. It's actually unfortunately little. And the reason is that RMA is quite hard to virtualize. Right. So typically these HPC instances, they are basically bare metal instances.
Starting point is 00:33:20 That makes sense. So, I mean, I guess the kind of question that jumps out at me from that is, where do we go as hardware keeps evolving over time? I mean, obviously, will this design and these challenges obviously will change as the hardware changes. But is there anything necessarily on the horizon in the short to medium term that's going to maybe revolutionize, but sort of move the goalposts a little bit? Or do you think that this is probably going to be the best way to design a system for the foreseeable future? I mean, it kind of feels like the shared nothing approach is kind of dead, but is pretty provably not the right way to go. Do you think that will always be the case
Starting point is 00:34:06 or do you think there's a way that that might change? That's hard to answer, actually. It's a very... Yeah, it's a... It's a horrible question. I apologize. But I mean, I think the shared nothing architecture still has a lot of merits.
Starting point is 00:34:20 But given the cloud development, right, where everything is disaggregated anyway, it's not really clear if it's like a real traditional shared nothing. Maybe it's basically a partitioned writer, more or less, because we still need to go over the network to write to the external storage, which needs to be disaggregated
Starting point is 00:34:37 because of elasticity reasons and stuff. I think to get the fastest database, it's still... I think single node the fastest database, it's still I think single node is still the one which provides the best latencies. And the question is now if you get much faster SSDs,
Starting point is 00:34:54 then it would be super interesting because then you can store a lot of data on a single node which having more or less fast access. But until that happens, and since networks actually evolve also quite fast, we can bridge this kind of latency cliff.
Starting point is 00:35:14 We don't really want to go to SSD because there's actually, it's much better latency compared to disk, but it's not as good as RMA, for instance. And we still have the benefit to scale out a lot of reads and we can handle a huge amount of workloads. So I think for most workloads, OTP workloads, probably single node is still enough. But then for the big, big players, basically,
Starting point is 00:35:38 I think the current architecture has maybe its limitations because they just have a single writer right yeah i mean you hit an interesting point there about like the for the most applications and apart from the very the very minor than the minority and i guess that's probably not going to necessarily change too much i want to thought like the vast majority of people are going to necessarily change too much i wouldn't have thought like the vast majority of people are going to be still have the vast majority sorry the way i'm thinking about it the vast majority of workloads are going to stay like you said within the the realms of yeah it's just run on a single node it's fine right like you don't need the stuff the the big the big boys do. I think an interesting point about it is, is it like 99% of the use cases
Starting point is 00:36:28 or is it 99% of the revenue? That's a good point. Because often, I mean, the big players drive, probably the big players drive the most revenue, right? So do you want to, yeah, that's interesting. Yeah, it's probably like, that's a really good way of thinking about it. Yeah, because I was just thinking in terms of like,
Starting point is 00:36:44 oh, well, all the workloads are the same, but it it's fine but if you're only getting a i don't know a few euros a few pounds off those and then you're getting millions or whatever it is off the big boys forget the little guys we'll go and build the system for the big boys um yeah yeah good business sense yeah um great anyway so yeah um my next question is as a like a software developer or dba or database architect or or whatnot how can i leverage the findings in in your work and these and the more generally how what sort of impact do you think this this work can have i think as a data administrator, actually, it's quite nice because having the opportunity to scale up and down at fingertips
Starting point is 00:37:31 without having user-defined partitioning is actually quite nice, right? Because typically you need to specify partitioning, maybe hand decide which kind of replication you need. But in our system, because everyone can access every data, it's actually quite nice, right? You don't need to have user-defined partitioning because every node can actually just process every transaction it gets. It gets some data and that's fine.
Starting point is 00:37:59 Yeah, that's a nice point. I mean, it's kind of a shift back towards, I guess, that idealistic goal of um having sort of the highest level abstraction possible and not necessarily because i mean a lot of these um like a lot of distributed business shared nothing ones essentially sort of like they almost break the third wall in the sense that you have to make so many or know so much about your workload and your um and the system underneath to like get the optimal partitioning line like you say or you know know that characteristics of the transactions or whatever to sort of leverage
Starting point is 00:38:30 or get good performance right and whereas i guess it's more stop the step towards that sort of yeah just fine run it and it'll the magic will happen behind the scenes right yeah yeah that's cool so hopefully having the least amount of work for users and developers which use the system. For sure, yeah. That would be a really nice impact to have. Cool. So yeah, the next question I have is,
Starting point is 00:38:59 what's been the most, I mean, so I guess this work, did this naturally fall out of the scale store work? I i mean you can tell us a little bit more about the backstory and the work but what's been the most sort of interesting maybe unexpected thing you've learned by whilst working on this on this paper i think it ties back to the taxonomy because it was not really clear from the beginning right that we need a new one we tried to get along with the old one, but it got actually quite complicated to communicate with each other even. So when I talked to Phil, I mean, we were not really on the same page often because we had different imaginations of a shared nothing database or a shared this database or how it's actually currently in the cloud and where the line is drawn for the different architectures. So I think that was the most surprising thing that it was actually super hard to communicate your idea or our ideas clearly.
Starting point is 00:39:54 And that made it actually, the new tech someone actually made it only after the revision. So basically because they also had had troubles the line like really drawn out. And yeah, we thought about it afterwards that we maybe need to think about differently. Namely the data access part are actually quite important for our like goal, so to say. So that was super, super surprising for me. Yeah, that is really interesting.
Starting point is 00:40:24 I mean, I guess on guess, let me be honest, maybe it's not like a slight tangent, but it's kind of related. But I know I always sort of struggle when trying to compare different, just different distributed transaction protocols. I've spent a lot of time looking into those and trying to sort of get the same sort of baseline to compare off is almost sort of difficult as well and just sort of i guess like you say we all have our own view of of what the systems would be like and it's only when you start communicating with somebody like you realize that you've both got two totally
Starting point is 00:40:58 different mental models of how the world's working but yeah no that's that's a really interesting um really interesting experience you had but i guess hopefully now we've got the answer we've got a nice clean taxonomy at taxonomy for all for us all to learn and work on hopefully okay i guess we've kind of maybe touched on this next question with answering that but i mean from the initial idea to the to the the actual publication and what was the sort of thing you tried along the way that failed i'm guessing the initial taxonomy was one thing right and but yeah is there anything else that you kind of that you can share with us on that
Starting point is 00:41:36 i think because before we asked phil bernstein i think one of the kind of failures was that we overlooked a lot of very important work like 40 years ago. We just didn't know that there was so much work in that direction because they were also labeled differently, right? There was like our shared caching system basically was labeled data sharing. And nowadays, when you think about data sharing, right, you think about the data sharing offers of like Snowflake, for instance, which allow you to share data between different companies. That's why it's actually quite hard to find this work. But luckily, Phil had a lot of insight into all these super nice, high-quality research papers. And that was actually really enriching to read those.
Starting point is 00:42:22 There were a lot of nice ideas. And interestingly, the papers from back then, they're so much different to nowadays papers. Have you read such old paper at some point? I've read one or two. Like I've read a few of the ones like Jim Greer's locking paper and a few of his stuff. And I occasionally read a few, but yeah, they're very different, right?
Starting point is 00:42:44 They have like a conceptually depth, right? They are so deep in the conceptual, like so much pages are written about the conceptual things, but the evaluation is actually, if they have even one, often they don't have actually an evaluation, right? They're just, basically contribution is the conceptual idea, which is at one point super nice, right? Because those ideas are often very very nice like arias or something and then i mean the question is now do we need to re-evaluate those on modern hardware because sometimes i think some of those are maybe conceptually still sound and work but it's still quite challenging to implement them efficiently and i think part of our current research is also
Starting point is 00:43:24 making things very efficient right i mean in-memory databases kind of started with the thing like we have many new indexes like originating from this research and i think it's it's nice to combine old ideas with new modern hardware yeah for sure i mean it's it's interesting what you say there about sort of finding uh and the name like so the name might be different right and then you find like I mean, it's interesting what you say there about sort of finding, like I said, the name might be different, right? And then you find like 20 other papers that are related with it and just so many good ideas.
Starting point is 00:43:53 I mean, there is that kind of, I guess, I've mentioned it a few times, of like everything's already been tried in database. All the ideas were done in the 70s and the 80s, right? All the cool ideas have already been done. Everyone's already done them, right? it's just there's nothing there's nothing new they did it in system r and it did work like sort of thing um and it's so interesting to read them right i mean and so many nice cool ideas and then you yeah you're just impressed actually how much work has been done in that field yeah it's wild it really is wild um there's so much to learn and when you think you've you've learned it all there's always about 100 other papers to read
Starting point is 00:44:29 so yeah um but yeah no cool um so i guess you mentioned it it's probably a nice segue into it's like what's next on the research agenda and you mentioned that you're you're writing up are you going to stick around in in academia and continue with research yeah i will be staying actually i will be staying in our group as a postdoc for one more year to kind of see where the journey will go we'll probably work a bit more on scale store and then then let's see nice is the long-term goal for scale store to be is it a commercial system or is it still very much research prototype very much a research prototype? Very much research prototype.
Starting point is 00:45:07 And it's not clear if there's an ambition to do it as a startup or something. But I mean, it's nice to develop such a system. You learn a lot, right? You get new ideas and I think that's a lot of fun. And I think that's currently
Starting point is 00:45:21 still the main goal. Yeah, nice. That's great stuff. So you're going to stick around and work on ScaleStore. That's awesome. lot of fun and i think that's currently still the main goal yeah nice that's great stuff so um you're gonna stick around and and and work on on scale so that's awesome so are there any sort of specific topics you're going to be researching on i think mainly the eviction concurrency control and then uh recovery like for us i think those are the big points which are still unclear. Fantastic.
Starting point is 00:45:48 And I mean, my next question I normally ask is, can you tell your listeners about your other research? But I guess the Scale Store project is that sort of big in itself, that kind of all the research happens within that project, right? I wish, right? kind of all your research happens within that pro that project right i wish right actually we have another paper on on at sigmo now accepted like just recently accepted it's about one-sided rdma and synchronization primitives because that's actually something quite interesting because there are many papers which use one-sided rdma nowadays because it's also a very intriguing
Starting point is 00:46:25 like primitive and for those of you who don't know one-sided RDMA nowadays because it's also a very intriguing, like primitive. And for those of you who don't know one-sided RDMA, it actually allows you to read and write remote memory almost as if it would be your own memory with quite low latencies, like in the range of single-digit microseconds, basically. You can write four kilobyte into another node's memory super fast. And interestingly, because we're database people, we need to synchronize
Starting point is 00:46:48 our stuff, because otherwise you get inconsistencies. And they provide atomic primitives, which they work fine, they synchronize and stuff, but they are not as efficient. Similar to in the multi-core CPU, if you use atomics, C++
Starting point is 00:47:03 atomics, you have a lot of cache line bouncing. There are different reasons in the network, but I mean, the outcome is still the same. They don't scale very well. And there are many reasons why that is the case, which is actually quite interesting. And we outlined optimization-like potential. And then there are other systems which use optimistic concurrency control schemes, such as Farm, like a big system developed for Microsoft, which actually has a nice working optimistic
Starting point is 00:47:28 concurrency control scheme for RDMA. So that one is actually working on x86 at least. But there are other schemes which are actually broken. So there are many publications about optimistic concurrency control or optimistic simulation primitives with RDMA, which
Starting point is 00:47:44 actually don't work that was our surprising like finding yeah they don't work I'm trying to think on the farm stuff there was one paper I remember reading once about because it's like called opacity opacity
Starting point is 00:47:59 farm project right there's a few of them they've all got like they're not very easy to remember those projects because they've got like acronyms based around. Exactly. Yeah. It's hard to remember their names, but yeah. And that was, I think, in the transaction protocol, right?
Starting point is 00:48:20 But even on the like latching level, if you use one-sided RDMA in an optimistic fashion without atomics, it's actually quite hard to get the thing right. Because it's not really specified how they behave. And I think that's the main gist. There's no specification around this. So you need to refer to the underlying protocols, like look at PCIe and stuff and see how the cache coherence protocol works on x86 and your target destination so it's actually quite interesting yeah yeah it sounds it for sure i'm cool so these these um this this next question is an interesting one so i'd like to get your let's get your right your your thoughts on this. How do you go about generating ideas and then specifically selecting which ones to pursue, right? Because it's quite a
Starting point is 00:49:11 difficult task. Because you always do really nice, interesting work. I'd love to know how your process behind that works. I think, to be honest, nowadays it actually comes quite naturally. Because when you build such a system like Scans, you see a lot of things which don't really work. So those are for sure interesting areas you can expand your work and then basically where you're interested most. The signal paper
Starting point is 00:49:34 synchronization thing basically came because my first paper, my first signal paper used optimistic synchronization and actually to my shame it is broken. Like my own synchronization scheme didn't really work. Oh, no. Because it's really subtle, right?
Starting point is 00:49:48 It's so hard to figure out why it's not working. And the thing is, it's actually quite rare that you see these bugs. So you really need to test for those. So in a research setting, that's why also many papers use some broken optimistic synchronization primitives. Because it's actually quite hard to figure those things out. There's a really minor chance that you hit these cases but i mean still nevertheless they are i mean they don't really synchronize which defies the purpose of synchronization exactly right i mean it might happen one in a million times but one in a million times it happens it's gonna
Starting point is 00:50:19 knacker your whole system right and then you've not got yeah i mean and you hit on an interesting point there like testing and observability of these sorts of issues i know when i've worked on developing concurrency control protocols before and stuff like the testing is just impossible to get right i mean it's really really hard because i mean even if you prove your algorithms correct or whatever right on it with a pen and paper you've got it it's very challenging to do the same with your actual implementation, right? It's a challenge, big challenge for sure. Cool.
Starting point is 00:50:51 Yeah, so on challenges, what do you think is the biggest challenge in our research area, in databases right now? Yeah, that's a good question. And one I actually also struggle a lot with because I think there is not, I'm not sure if there's a biggest challenge, but certainly there are important challenges. And I think many of them are connected to the cloud because one of the things which
Starting point is 00:51:15 is maybe approaching, I'm actually not really sure if that's the case, but I mean, for sure we have hardware accelerators, right? I mean, they're currently developed and also used at some point. And I don't know how we should like unify all these different things, right? I mean, they're currently developed and also used at some point. And I don't know how we should unify all these different things, right? I mean, it's so hard to even talk about distributed databases itself. But then if you need to navigate different operators to different accelerators, that makes it actually much harder, right? You have different... It's not really sure how this is done. there's no really good primitive for now to actually instruct those things and i think gustavo alonso said a nice thing where he said basically electrical engineers they don't know about abstraction and i mean that's why some of
Starting point is 00:51:57 the things which are super nice they're actually quite hard to use we're currently experimenting with like a program between the switch which uses p4 and it's so hard to get it right i mean there's so many things right so many accelerators or broken network cards it's it's just hard it's not really clear how to use them in a like in a real system i don't know how to implement all these different things then you have different vendors and stuff i don't know yeah it feels like the the sort of i guess it's somehow but like the diversity in hardware is yeah maybe the heterogeneity is is a massive sort of um challenge challenge for sure yeah and then sort of navigating that is yeah yeah that's why i don't know how to do it actually because even if i have two different networks,
Starting point is 00:52:45 as I mentioned before, EFA and RDMA, which actually sort of work very similar, they have completely different non-trivial performance characteristics. You need to specifically tune your system to that. And I think that ties back to the next point, is cost efficiency, right? In the cloud, performance is basically cost efficient. More or less like
Starting point is 00:53:05 performance per dollar if you if you pay for your network and you don't really utilize it it's not so good yeah so that's another challenge i think which is where i think there's a lot of potential to really optimize our current resources as best as possible that's a yeah that's a really good point like the economics of it start to play into it. I actually remember the talk recently. It might have been Victor who gave it. Yeah, at Damon. Yeah, at Damon at Sigmund, right?
Starting point is 00:53:35 Yeah. Yeah, I remember reading that. I'm literally listening to it. My background, my undergraduate degree was in economics. Me too. Was Victor also actually no way really I mean that's why I was like oh wow that's wild so yeah no I was kind of naturally drawn to that talk and it's intriguing right yeah yeah for me as well that's wild he's a really big opponent of these like economic thinking and how to commoditize different things in the cloud no that's really cool oh wow um cool yeah and now um time for the last word um what's the one key thing you want the listeners to take away from
Starting point is 00:54:18 from this work from your work in this uh podcast today yeah i think the goal with our paper was like a thought-provoking paper to basically shift the direction from only share nothing research to also different architectures. And obviously, given our paper, the takeaway need to be contribute to scale store or at least look at different designs,
Starting point is 00:54:38 how we can advance the field of distributed databases. Brilliant. Let's end it there. That's a great line to finish on. Thanks so much again, Toby. It's a great line to finish on. Thanks so much again, Toby. It's great to have you on again.
Starting point is 00:54:48 And hopefully this is not the last time you come on. We can get you on again at some point in the future. And you can keep telling us about Skillstore and all the cool work
Starting point is 00:54:56 you guys are doing. And as always, I'll put links to everything in the show notes. And we'll see you all next time for some more awesome computer science research

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.