Disseminate: The Computer Science Research Podcast - Tobias Ziegler | ScaleStore: A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA | #9

Starting point is 00:00:00 Hello and welcome to Disseminate, the computer science research podcast. I'm your host, Jack Wardby. This is episode nine of our SIGMOD 2022 series, and I'm delighted to say I'm joined today by Tobias Ziegler, who will be talking about his paper, Scalestore, a fast and cost-efficient storage engine using DRAM, NVMe and RDMA. Tobias is a PhD student at the Technical University of Darmstadt and his research interests are distributed systems with a focus on fast networks. Tobias, thanks for joining us on the show. Let's dive straight in. What is Scalestore? Yeah, as you mentioned, Scalestore is a distributed storage engine which exploits DRAM, NVMe, meaning flash and RDMA. I think the question which is most interesting is why do we chose those three technologies?

Starting point is 00:01:12 And when we look at the price development of the last 20 years, we can see that DRAM is actually not getting cheaper anymore. In fact, it's stagnating, the prices are not getting cheaper, but in contrast, flash is getting cheaper and cheaper. So there's quite an economical incentive to store cold data on Flash. And that is actually what recent papers suggested. There are many single node systems such as LeanStore or Umbra, which do exactly this. They use fast DRAM latencies and basically evict cold data to Flash storage. And the reason why we also use RDMA is the following, because when the working set continues to grow,

Starting point is 00:01:50 at some point it spills to SSD. And the SSD latencies are not terrible for cold data, but in fact, they are very bad for hot data. And performance will drop by two orders of magnitude. And RDMA is a fast network, which is remote direct memory access, and it allows to access remote memory almost as would be our own local memory with very fast latencies in the range of single digit microseconds. So there's an opportunity to basically distribute your hot set on multiple nodes and access them with RDMA. And then you have much lower latency compared to when reading from SSD.

Starting point is 00:02:26 But you still want to be able to basically a bit cold on unused data from expensive DRAM to flash SSDs. And when we are in distributed systems, I think what is special about Scalestore is that we have a caching protocol or coherence protocol to keep data consistent. Can you elaborate on how Scalestore solves the problems you just mentioned? Maybe talk a little bit more about its caching protocol. Yeah, so I think what Scalestore mainly focuses, the fact that we don't need to keep all data in memory,

Starting point is 00:03:01 but we can access remote data via the caching protocol. And the caching protocol allows us to, for instance, if we have now, you have a page and I have a page, just an ordinary paper page. And instead of telling you what I need to do on your page, we can just go to Google Drive, right? It's a similar metaphor with Galeflow. The caching protocol allows us to keep data coherent and we can work on the same time at our piece of paper or in our case now, all the data. And I think what that helps us is to dynamically adapt workload

Starting point is 00:03:38 or adapt the data placement to the workload, meaning that if I access certain pages very often and you access certain pages very often, we both have them in the DRAM. But the problem is now when you want to make an update, then somehow I need to know when you want to do this update to keep data consistent. That's basically the main gist, I would say. How does ScaleStar handle these concurrent updates

Starting point is 00:04:03 where two people want to update the same page? Yeah, so in general, we have this directory-based caching protocol, meaning that all of the nodes are directories or have information about some of the pages. And with this information, we mean who does currently store the page in their memory. And they are called owner in our case. So if I'm the directory of page one and you have the page currently in your memory, then I track that you are the owner of my page. And there are two different ownership modes.

Starting point is 00:04:39 One is shared ownership and one is exclusive ownership, meaning with shared, we can have the page both at the same time in the memory, but we only allowed to read it. And then if you have it exclusive, then you tell the directory that you want exclusive and then you get it granted. And then you can modify the page without any interference of other nodes because you have it exclusive. And there's a special case where we need to deal in a robust fashion with concurrent updates is when we have a very hot page and we all want to update this. And this is solved by anticipatory chaining. So how does anticipatory chaining work? And did you consider any other ways of dealing with concurrent updates to hot pages? So imagine we have five nodes and we have one page which is very hot and needs to be

Starting point is 00:05:29 modified by all of the five nodes. And then we have one directory which is responsible for this page. And the very naive solution would be busy polling. Right? With all five nodes, we go to the directory and say, hey, we want to modify this page exclusively. But then only one of us can actually win because we want to all have it exclusive. And that would be bad because we need to send many, many messages, actually.

Starting point is 00:05:51 And the people or the nodes who didn't get the page, they need to retry, meaning they send even more messages. But we know from the fact that we want exclusive access that eventually only one can win. Not all of them can win at the same time. So the next better solution would be to have a queue on the directory and order the request in the order they arrive. But that actually still has a lot of messages and we can do better.

Starting point is 00:06:18 And our solution is anticipatory chaining. So we designed our protocol to support this feature. And the feature is maybe a bit hard to explain without visuals. But the intention is that if all of us nodes want to have the page, for instance, you, Jack, and me, we want to have the same page from the directory. Then you go to the directory and you get the page. And instead of me going to the directory, I immediately wait at your place to kind of hand over the page. And instead of me going to directory, I immediately wait at your place to kind of hand over the page. It's a bit like a relay race where the page is handed over from node to node to node without any additional messages once we know that where the page is initially stored. So we

Starting point is 00:06:58 anticipate the owner, which is before ourself. Kind of difficult to explain with our visuals, but in essence, we have a chain of dependencies of all five nodes, and then the page is handed from node to node to node. So over time, pages get cached, and the workload might change, and the DRAM buffers fill up. How does ScaleStore handle cache eviction?

Starting point is 00:07:24 Yeah, so in a normal single node of management or storage engine, you would just need to find pages you don't want anymore and write them to SSD. But since we're distributed, we have only one location where we can write pages to, and this is on a directory, meaning that our eviction protocol needs to be very closely or work in concert with our caching protocol itself

Starting point is 00:07:46 because we then need to tell the directory that we actually don't have a copy anymore and send it back to the directory and then it's eventually written to SSD. And to make this efficient, we have an epoch-based LRU approximation. And in that, we decoupled the tracking of the access times from the eviction itself, meaning

Starting point is 00:08:05 that the worker, they track the access time. So when have we accessed the page recently? And there are dedicated threats, the page provider, which do the eviction to SSD. And they also not only write to SSD, but they also contact directories of pages, which we are not the directory itself, and then handle the protocol details with the page provider on the directory to send the page over and then let it write to SSD. And I think the key points here are that we have this mechanism to track the hotness of pages efficiently where we use epochs instead of updating a counter or like the timestamp every time we access the page, we kind of, yeah, we kind of do it only once in a time. And then the page provider samples over the hash table,

Starting point is 00:08:51 which is essentially randomly, and evicts the coldest X percent it can find in this given sample. As a user, how do I interact with Skill Store? That's a good question. So first of all, we get as a user is that we have everything transparently, meaning that ScaleStore will deal with caching to get the best performance

Starting point is 00:09:15 and also evict cold data to SSD. And how we interact is currently as a library and the interface is currently page guards. It's similar to lock guards because if we want to modify a page, and the interface is currently page guards. It's similar to log guards because we can, if we want to modify a page, we can basically specify what we want to do with it. For instance, we only want to read it.

Starting point is 00:09:31 Then there is page guards, which you have shared or optimistic accesses, or we want to modify it and then we have an exclusive access. And what this allows is, for instance, if you have a root node or a root page, basically if you have a B-tree and our data is organized in pages and every page is if you have a b-tree and our data is organized in pages and every page is a node in this b-tree then we can get the node id of the root or the

Starting point is 00:09:52 page id i should say of the root and then we can basically instantiate such a page guard with an reading only thing with the page id of the root and then we can traverse from there and once the page guards are destroyed, everything is basically unlatched and everything is ensured that the protocol can still continue without having to deal with unlatching and everything by hand. It's actually quite easy to use. So how did you go about evaluating ScaleStar?

Starting point is 00:10:22 What did you compare it against? What were the key results from your evaluation yeah i think we had three parts in our evaluation and we first showed the characteristics of scalestore itself so we first showed how does scalestore actually scale with an increasing number of servers and the next we showed how how we handle larger data or ever-growing data sets with different kind of localities. And there we can clearly see that we can infect evict data to SSD, which is unused, and locality is very beneficial for this because we can keep the hot set, which is then only a fraction of the complete data in the memory and evict all data to cost-efficient SSDs.

Starting point is 00:11:10 And the third point of our evaluation consisted of a system comparison for different settings because Scalestock can be used for single node and distributed. So we first compared against in-memory only systems, and there we used just a simple B-tree, which is in-memory only, and compared against Scalestock, and obviously we can see that there's some overhead due to the tracking of locations, meaning is the page in-memory or is it on SSD? And then we compared against out-of-memory or in an out-of-memory use case against LeanStore,

Starting point is 00:11:39 which is a recent high-performance storage engine for single nodes. And we are on par with LeanStore. And then we finally compared in a distributed setting against highly optimized in-memory engines. Because there is not such a system as ScaleStore, so we couldn't really compare in a fair manner. But we compared against Farm and NumDB, which are only in-memory distributed systems.

Starting point is 00:11:59 And the key results, I would say, are from that. That locality is very beneficial from scale store. And this is actually not uncommon in databases. So you have often, you access the most recent tuples or something else. And then that anticipatory chaining, the protocol makes it very robust against write skew. For instance, when we remember ourselves to the previous example where the five nodes accessed the same page,

Starting point is 00:12:24 there we are quite robust compared to our competitors. You said there that ScaleStore is pretty unique in that there is no other system quite like it. But are there any situations in which it performs poorly or is maybe the wrong choice? Yeah. So maybe I wouldn't say poor, but not optimal in the sense that, for instance, if we only have in-memory data, then we add some overhead because we have this hash table translation where we need to translate page IDs to the in-memory objects. Whereas a complete in-memory database can access like via pointer or via RDMA directly the data. So we can see this in the evaluation when we have a random workload where we have no locality and we compare against the in-memory systems as farm,

Starting point is 00:13:13 which is highly optimized. We can see that farm is faster in this setting, but then we get better with more locality and then outperform because we can use local DRAM performance. But if you only have memory which fits in memory and is not expected to grow, and you don't have really cold data but only working set, then I think you should opt for a highly optimized

Starting point is 00:13:33 in-memory database systems, which is distributed. And I think for workloads, random workloads are not optimal because we profit from the fact that we can cache in DRAM. But if we need to access a lot of data over the network, we have higher or larger network transfer latencies because we organize data in four kilobyte pages. And this is actually also the unit we send over the network compared to other distributed in-memory systems,

Starting point is 00:13:59 which only need to send the tuples. So usually the tuples are much smaller compared to four kilobyte. Therefore, we have a higher latency cost, I would say. Where do you see this research having the biggest impact? Who do you think will find ScaleStore useful and who are the results most relevant for? Yeah, that's a good question. Actually, I hope that everyone finds it useful thinking about building distributed systems,

Starting point is 00:14:26 which has capability to cache and evict. But a more realistic assessment would be that currently, since Scaleso uses IBM, I think not many groups in the world can actually use it. And for industry, that's actually the same because IBM is not really broadly spread in industry. And also, when you look at the cloud, actually only one of the top or from the top three cloud vendors, only Microsoft Azure even offers RMA. So that kind of limits its use currently.

Starting point is 00:14:57 But of course, we can adapt the protocol to handle arbitrary networks. But they have different characteristics. For instance, Ethernet has much higher latency, so it would be beneficial to read from SSD instead from the remote nodes and so on. So you can definitely build a system which kind of takes the latencies as an input and behaves then in a smarter way. Currently, everything is kind of optimized for RDMA.

Starting point is 00:15:21 So what was the most interesting or maybe unexpected lesson that you learned while building ScaleStore? So I think that one of the key findings we had is that caching is really key for good performance. Even though the network is getting faster and faster, they are still orders of magnitude away from DRAM latencies. And that's not really surprising, but compared to, or basically a recent trend is that you

Starting point is 00:15:48 decouple databases, for instance, NumDB, which exits then every data item over the network. And that is actually much slower if you, so you would need to use caching additionally to really be on par with distributed or highly optimized systems. And eviction is from the cost perspective, very beneficial. And I would say one of the properties which Scalestore is really good is if you have hot read skew, because then we can replicate, driven by the workload, dynamically the hot pages on all the nodes, which

Starting point is 00:16:26 allows us to kind of really leverage the performance and read only workloads. And other systems struggle with that when we do not intervene manually. For instance, like the typical architecture previously was static partitioning or sharding.

Starting point is 00:16:41 Then we need to know the queries and the workload beforehand and then we need to use, or we need to know the queries and the workload beforehand. And then we can partition it cleverly and the hot items we can replicate by hand, but it is not automatically and done at runtime. So whilst we're on this,

Starting point is 00:16:55 this line of thinking, progress in research is sort of non-linear, right? So from the conception of the idea to the end product, the end goal, were there things that you tried that failed? What were the dead ends that you ran into that other people could benefit from knowing about so they don't make the same mistakes?

Starting point is 00:17:15 Yeah, that's a good question. So I think what is really beneficial is doing micro-experiments first. So what we evaluated first was directory-based protocol versus a snooping-based protocol where we kind of listen on all messages and need to broadcast it to every participant of our cluster. But this obviously,

Starting point is 00:17:35 but this didn't scale so well as the directory-based alternative. So that was definitely one thing where we kind of won some time because we didn't build the full system, but first some micro experiments. And in general, I think distributed program is very hard in itself. And handling all the edge cases is very, very time consuming.

Starting point is 00:17:53 In addition, if you build such a system, which has many interactions with different components like the page provider, the message handler, the hash table, the protocol itself. That's quite challenging also from the performance tuning point of view. And then, I mean, additionally, using RDMA and Flash is not trivial as well. They have many strange behaviors sometimes and it's actually quite hard to use. So what do you have planned next for Skillstorm for future research? Yeah, that's actually a good question because so I will be finishing up soon, but the vision is kind of

Starting point is 00:18:30 building a distributed OTP system in a cloud because many cloud native OTP systems are still primary secondary architecture. And the question is, why is that actually the case? Because many of the current production ready databases are still primary secondary and we think it's because of they were invented in the era of slow networks, where the network

Starting point is 00:18:49 was the main bottleneck. And in research, there are actually many proposals which kind of have different ideas how to build distributed OTP databases. And I think there are many open questions. For instance, there's no verdict which is better on, for instance, sending compute to data or data to compute. So basically, if you kind of partition and then you have a query and then you can kind of split down the query into multiple chunks and send it to all the participants, that is sending compute to data. And the other thing as scale store is we kind of fetch the data. And this has implications on other components.

Starting point is 00:19:25 For instance, two-phase commit. If you send compute to data, you need two-phase commit. But actually, if you get the data yourself, you don't need two-phase commit because only one node is involved in a transaction and it can get all the data. So you don't need two-phase commit. So I think the foundation with scale stores that we want want to build basically scale store is the foundation for our distributed OTV system which we want to build in future and I think there are many many open questions for instance how one

Starting point is 00:19:53 designs a concurrency control scheme without destroying the scalability of scale store that is not as easy What attracted you to this research area and I know we just touched on it a little bit in the previous question but what do you think is the biggest challenge in this area now yeah that's a good question i think the biggest challenge in this area has really come up with a

Starting point is 00:20:17 otp systems which scales well and with the new technologies such as fast networks and what attracted me was actually the fact that network XCCs are so fast now. That's incredible. When we think of TCP Ivy or something, which we know as a normal user, and now we have RDMA, which is in the single-digit microsecond range,

Starting point is 00:20:39 that is quite unbelievable. Yeah, definitely. And building distributed systems with adma is actually interesting because usually one provides if you build a system right you you kind of know your bottlenecks for instance the network has always been the bottleneck and now that's not the case anymore so you often need to adapt the whole design of the system to change it and do new bottlenecks will be revealed because the old bottleneck, which was known, is not the bottleneck anymore. And then it's quite interesting to design such a new system

Starting point is 00:21:09 because you have very different properties. If the network's no longer the bottleneck, what is the new bottleneck? Yeah, I think maybe it's a bit of an overstatement to say the bottleneck is not, or the network is not a bottleneck because for OTP, that's actually still mostly true. It's much faster, and you can have different design choices. For instance, shipping data over the network is now actually a design choice, which you can really consider. Previously, that's not the case.

Starting point is 00:21:39 And for analytical databases, I think that's true already that the bottleneck is not anymore the network. For instance, Snowflake uses a completely desegregated architecture where they read everything over network. And I think what is challenging, and I think your next guest will talk about it, is that you have many accelerators in different places in the network and you need to employ caching very clever to kind of get the best performance of all these accelerators, which are distributed over a network and maybe even in the switch and what else. I mean, I think that is very challenging.

Starting point is 00:22:14 So it's time for the last word now. What's the one key thing you want listeners to take away from your research? So the one key thing is that ScaleStore is a billing block for distributed systems. I think that's the main key thing. Fantastic. And we will end it there. Thanks so much, Tobias.

Starting point is 00:22:36 If you are interested in knowing more about Tobias' work, we will put all the links to the paper and all the other relevant materials in the show notes. Thanks so much for listening and we will see you next time.

Your Ad Here

Disseminate: The Computer Science Research Podcast - Tobias Ziegler | ScaleStore: A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA | #9

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.