Disseminate: The Computer Science Research Podcast - Tobias Ziegler | ScaleStore: A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA | #9
Episode Date: August 1, 2022Summary: In this episode Tobias talks about his work on ScaleStore, a distributed storage engine that exploits DRAM caching, NVMe storage, and RDMA networking to achieve high performance, cost-efficie...ncy, and scalability. Using low latency RDMA messages, ScaleStore implements a transparent memory abstraction that provides access to the aggregated DRAM memory and NVMe storage of all nodes. In contrast to existing distributed RDMA designs such as NAM-DB or FaRM, ScaleStore stores cold data on NVMe SSDs (flash), lowering the overall hardware cost significantly. At the heart of ScaleStore is a distributed caching strategy that dynamically decides which data to keep in memory (and which on SSDs) based on the workload. Tobias also talks about how the caching protocol provides strong consistency in the presence of concurrent data modifications. Questions: 0:56: What is ScaleStore? 2:43: Can you elaborate on how ScaleStore solves the problems you just mentioned? And talk more about its caching protocol?3:59: How does ScaleStore handle these concurrent updates, where two people want to update the same page?5:16: Cool, so how does anticipatory chaining work and did you consider any other ways of dealing with concurrent updates to hot pages?7:13: So over time pages get cached, the workload may change, and the DRAM buffers fill up. How does ScaleStore handle cache eviction? 8:57: As a user, how do I interact with ScaleStore?10:19: How did you evaluate ScaleStore? What did you compare it against? What were the key results? 12:31: You said that ScaleStore is pretty unique in that there is no other system quite like it, but are there any situations in which it performs poorly or is maybe the wrong choice?14:09: Where do you see this research having the biggest impact? Who will find ScaleStore useful, who are the results most relevant for? 15:23: What are the most interesting or maybe unexpected lessons that you have learned while building ScaleStore?16:55: Progress in research is sort of non-linear, so from the conception of the idea to the end, where there things you tried that failed? What were the dead ends you ran into that others could benefit from knowing about so they don’t make the same mistakes? 18:19: What do you have planned for future research?20:01: What attracted you to this research area? What do you think is the biggest challenge in this area now? 20:21: If the network is no longer the bottleneck, what is the new bottleneck?22:15: The last word now: what’s the one key thing you want listeners to take away from your research?Links: SIGMOD PaperSIGMOD PresentationWebsiteEmail TwitterGoogle Scholar Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate, the computer science research podcast. I'm your host, Jack Wardby.
This is episode nine of our SIGMOD 2022 series, and I'm delighted to say I'm joined today by Tobias Ziegler,
who will be talking about his paper, Scalestore, a fast and cost-efficient storage engine using DRAM, NVMe and RDMA. Tobias is a PhD student at the Technical University of Darmstadt
and his research interests are distributed systems with a focus on fast networks. Tobias,
thanks for joining us on the show. Let's dive straight in. What is Scalestore?
Yeah, as you mentioned, Scalestore is a distributed storage engine which exploits DRAM,
NVMe, meaning flash and RDMA.
I think the question which is most interesting is why do we chose those three technologies?
And when we look at the price development of the last 20 years, we can see that DRAM is actually
not getting cheaper anymore. In fact, it's stagnating, the prices are not getting cheaper,
but in contrast, flash is getting cheaper and cheaper. So there's quite an economical incentive to store cold data on Flash.
And that is actually what recent papers suggested.
There are many single node systems such as LeanStore or Umbra, which do exactly this.
They use fast DRAM latencies and basically evict cold data to Flash storage.
And the reason why we also use RDMA is the following,
because when the working set continues to grow,
at some point it spills to SSD.
And the SSD latencies are not terrible for cold data,
but in fact, they are very bad for hot data.
And performance will drop by two orders of magnitude.
And RDMA is a fast network, which is remote direct memory access, and it allows to
access remote memory almost as would be our own local memory with very fast latencies in the range
of single digit microseconds. So there's an opportunity to basically distribute your hot
set on multiple nodes and access them with RDMA. And then you have much lower latency compared to when reading from SSD.
But you still want to be able to basically a bit cold on unused data from expensive DRAM to flash
SSDs. And when we are in distributed systems, I think what is special about Scalestore is that
we have a caching protocol or coherence protocol to keep data consistent.
Can you elaborate on how Scalestore solves the problems
you just mentioned?
Maybe talk a little bit more about its caching protocol.
Yeah, so I think what Scalestore mainly focuses,
the fact that we don't need to keep all data in memory,
but we can access remote data via the caching protocol.
And the caching protocol allows us to, for instance, if we have now, you have a page and I have a page, just an ordinary paper page.
And instead of telling you what I need to do on your page, we can just go to Google
Drive, right?
It's a similar metaphor with Galeflow.
The caching protocol allows us to keep data coherent and we can work on the same time at our piece of paper
or in our case now, all the data.
And I think what that helps us is to dynamically adapt workload
or adapt the data placement to the workload,
meaning that if I access certain pages very often
and you access certain pages very often, we both have them in the DRAM.
But the problem is now when you want to make an update,
then somehow I need to know when you want to do this update
to keep data consistent.
That's basically the main gist, I would say.
How does ScaleStar handle these concurrent updates
where two people want to update the same page?
Yeah, so in general, we have this directory-based caching protocol,
meaning that all of the nodes are directories or have information about some of the pages.
And with this information, we mean who does currently store the page in their memory.
And they are called owner in our case.
So if I'm the directory of page one and you have the page currently in your memory, then
I track that you are the owner of my page.
And there are two different ownership modes.
One is shared ownership and one is exclusive ownership, meaning with shared, we can have
the page both at the same time in the memory, but we only allowed to read it.
And then if you have it exclusive, then you tell the directory that you want exclusive and then you get it granted.
And then you can modify the page without any interference of other nodes because you have it exclusive. And there's a special case where we need to deal in a robust fashion with concurrent updates is when we have a very hot page and we all want to update this.
And this is solved by anticipatory chaining.
So how does anticipatory chaining work?
And did you consider any other ways of dealing with concurrent updates to hot pages?
So imagine we have five nodes and we have one page which is very hot and needs to be
modified by all of the five nodes.
And then we have one directory which is responsible for this page.
And the very naive solution would be busy polling.
Right?
With all five nodes, we go to the directory and say, hey, we want to modify this page
exclusively.
But then only one of us can actually win because we want to all have it exclusive.
And that would be bad because we need to send many, many messages, actually.
And the people or the nodes who didn't get the page, they need to retry, meaning they
send even more messages.
But we know from the fact that we want exclusive access that eventually only one can win.
Not all of them can win at the same time.
So the next better solution would be to have a queue on the directory
and order the request in the order they arrive.
But that actually still has a lot of messages
and we can do better.
And our solution is anticipatory chaining.
So we designed our protocol to support this feature.
And the feature is maybe a bit hard to explain without visuals.
But the intention is that if all of us nodes want to have the page, for instance, you, Jack, and me, we want to have the same page from the directory.
Then you go to the directory and you get the page.
And instead of me going to the directory, I immediately wait at your place to kind of hand over the page. And instead of me going to directory, I immediately wait at your place to kind of hand
over the page. It's a bit like a relay race where the page is handed over from node to node to node
without any additional messages once we know that where the page is initially stored. So we
anticipate the owner, which is before ourself. Kind of difficult to explain with our visuals,
but in essence, we have a chain of dependencies
of all five nodes,
and then the page is handed from node to node to node.
So over time, pages get cached,
and the workload might change,
and the DRAM buffers fill up.
How does ScaleStore handle cache eviction?
Yeah, so in a normal single node of management or storage engine,
you would just need to find pages you don't want anymore
and write them to SSD.
But since we're distributed,
we have only one location where we can write pages to,
and this is on a directory,
meaning that our eviction protocol needs to be very closely
or work in concert with our caching protocol itself
because we then need to tell the directory
that we actually don't have a copy anymore
and send it back to the directory
and then it's eventually written to SSD.
And to make this efficient,
we have an epoch-based LRU approximation.
And in that, we decoupled the tracking of the access times
from the eviction itself, meaning
that the worker, they track the access time. So when have we accessed the page recently?
And there are dedicated threats, the page provider, which do the eviction to SSD.
And they also not only write to SSD, but they also contact directories of pages, which we
are not the directory itself, and then handle the protocol details with the page provider on the directory
to send the page over and then let it write to SSD. And I think the key points here are
that we have this mechanism to track the hotness of pages efficiently where we use epochs instead
of updating a counter or like the timestamp every time we access the page, we kind of, yeah, we kind of do it only once in a time.
And then the page provider samples over the hash table,
which is essentially randomly,
and evicts the coldest X percent it can find in this given sample.
As a user, how do I interact with Skill Store?
That's a good question.
So first of all, we get as a user
is that we have everything transparently,
meaning that ScaleStore will deal with caching
to get the best performance
and also evict cold data to SSD.
And how we interact is currently as a library
and the interface is currently page guards.
It's similar to lock guards
because if we want to modify a page, and the interface is currently page guards. It's similar to log guards because we can,
if we want to modify a page,
we can basically specify what we want to do with it.
For instance, we only want to read it.
Then there is page guards,
which you have shared or optimistic accesses,
or we want to modify it and then we have an exclusive access.
And what this allows is, for instance,
if you have a root node or a root page,
basically if you have a B-tree
and our data is organized in pages and every page is if you have a b-tree and our data is organized in
pages and every page is a node in this b-tree then we can get the node id of the root or the
page id i should say of the root and then we can basically instantiate such a page guard
with an reading only thing with the page id of the root and then we can traverse from there
and once the page guards are destroyed,
everything is basically unlatched and everything is ensured that the
protocol can still continue without having to deal with unlatching and
everything by hand.
It's actually quite easy to use.
So how did you go about evaluating ScaleStar?
What did you compare it against?
What were the key results from
your evaluation yeah i think we had three parts in our evaluation and we first showed the
characteristics of scalestore itself so we first showed how does scalestore actually scale with
an increasing number of servers and the next we showed how how we handle larger data or ever-growing data sets with
different kind of localities. And there we can clearly see that we can infect evict data to SSD,
which is unused, and locality is very beneficial for this because we can keep the hot set, which
is then only a fraction of the complete data in the memory and evict all data to cost-efficient SSDs.
And the third point of our evaluation consisted of a system comparison for different settings because Scalestock can be used for single node
and distributed.
So we first compared against in-memory only systems,
and there we used just a simple B-tree, which is in-memory only,
and compared against Scalestock,
and obviously we can see that there's some overhead
due to the tracking of locations, meaning is the page in-memory or is it on SSD?
And then we compared against out-of-memory or in an out-of-memory use case against LeanStore,
which is a recent high-performance storage engine for single nodes.
And we are on par with LeanStore.
And then we finally compared in a distributed setting
against highly optimized in-memory engines.
Because there is not such a system as ScaleStore,
so we couldn't really compare in a fair manner.
But we compared against Farm and NumDB,
which are only in-memory distributed systems.
And the key results, I would say, are from that.
That locality is very beneficial from scale store.
And this is actually not uncommon in databases.
So you have often, you access the most recent tuples or something else.
And then that anticipatory chaining,
the protocol makes it very robust against write skew.
For instance, when we remember ourselves to the previous example
where the five nodes accessed the same page,
there we are quite robust compared to our competitors.
You said there that ScaleStore is pretty unique in that there is no other system quite like it.
But are there any situations in which it performs poorly or is maybe the wrong choice?
Yeah.
So maybe I wouldn't say poor, but not optimal in the sense that, for instance, if we only have in-memory data, then we add some overhead because we have this hash table translation where we need to translate page IDs to the in-memory objects.
Whereas a complete in-memory database can access like via pointer or via RDMA directly the data. So we can see this in the evaluation
when we have a random workload where we have no locality
and we compare against the in-memory systems as farm,
which is highly optimized.
We can see that farm is faster in this setting,
but then we get better with more locality
and then outperform because we can use local DRAM performance.
But if you only have memory which fits in memory
and is not expected to grow,
and you don't have really cold data but only working set,
then I think you should opt for a highly optimized
in-memory database systems, which is distributed.
And I think for workloads, random workloads are not optimal
because we profit from the fact that we can cache in DRAM.
But if we need to access a lot of data over the network,
we have higher or larger network transfer latencies
because we organize data in four kilobyte pages.
And this is actually also the unit we send over the network
compared to other distributed in-memory systems,
which only need to send the tuples.
So usually the tuples are much smaller compared to four kilobyte.
Therefore, we have a higher latency cost,
I would say. Where do you see this research having the biggest
impact? Who do you think will find ScaleStore useful
and who are the results most relevant for?
Yeah, that's a good question. Actually,
I hope that everyone finds it useful thinking about building distributed systems,
which has capability to cache and evict.
But a more realistic assessment would be that currently,
since Scaleso uses IBM, I think not many groups in the world
can actually use it.
And for industry, that's actually the same
because IBM is not really broadly spread in industry.
And also, when you look at the cloud, actually only one of the top or from the top three cloud vendors, only Microsoft Azure even offers RMA.
So that kind of limits its use currently.
But of course, we can adapt the protocol to handle arbitrary networks.
But they have different characteristics. For instance, Ethernet has much higher latency,
so it would be beneficial to read from SSD
instead from the remote nodes and so on.
So you can definitely build a system
which kind of takes the latencies as an input
and behaves then in a smarter way.
Currently, everything is kind of optimized for RDMA.
So what was the most interesting
or maybe unexpected lesson that you learned while building
ScaleStore?
So I think that one of the key findings we had is that caching is really key for good
performance.
Even though the network is getting faster and faster, they are still orders of magnitude
away from DRAM latencies.
And that's not really surprising, but compared to, or basically a recent trend is that you
decouple databases, for instance, NumDB, which exits then every data item over the network.
And that is actually much slower if you, so you would need to use caching additionally
to really be on par with distributed or highly optimized systems.
And eviction is from the cost perspective, very beneficial.
And I would say one of the properties which Scalestore is really good
is if you have hot read skew,
because then we can replicate, driven by the workload,
dynamically the hot pages on all the nodes, which
allows us to kind of
really leverage the performance and read
only workloads.
And other systems struggle with that when
we do not intervene manually.
For instance, like the typical
architecture previously was static partitioning
or sharding.
Then we need to know
the queries and the workload beforehand and then we need to use, or we need to know the queries
and the workload beforehand.
And then we can partition it cleverly
and the hot items we can replicate by hand,
but it is not automatically
and done at runtime.
So whilst we're on this,
this line of thinking,
progress in research
is sort of non-linear, right?
So from the conception of the idea
to the end product, the end goal,
were there things that you tried that failed?
What were the dead ends that you ran into that other people could benefit from knowing about
so they don't make the same mistakes?
Yeah, that's a good question.
So I think what is really beneficial is doing micro-experiments first.
So what we evaluated first was directory-based protocol versus a snooping-based
protocol
where we kind of listen on all messages
and need to broadcast it to every
participant of our cluster.
But this obviously,
but this didn't scale so well as the directory-based
alternative. So that was definitely
one thing where we
kind of won some time
because we didn't build the full system, but first
some micro experiments.
And in general, I think distributed program is very hard in itself.
And handling all the edge cases is very, very time consuming.
In addition, if you build such a system, which has many interactions with different components
like the page provider, the message handler, the hash table, the protocol itself.
That's quite challenging also from the performance tuning point of view.
And then, I mean, additionally, using RDMA and Flash is not trivial as well.
They have many strange behaviors sometimes and it's actually quite hard to use.
So what do you have planned next for Skillstorm for future research?
Yeah, that's actually a good question because so I will be finishing up soon,
but the vision is kind of
building a distributed OTP system in a cloud
because many cloud native OTP systems
are still primary secondary architecture.
And the question is,
why is that actually the case?
Because many of the current production ready databases
are still primary secondary
and we think it's because of they were invented in the era of slow networks, where the network
was the main bottleneck.
And in research, there are actually many proposals which kind of have different ideas how to
build distributed OTP databases.
And I think there are many open questions.
For instance, there's no verdict which is better on, for instance, sending compute to data or data to compute.
So basically, if you kind of partition and then you have a query and then you can kind of split down the query into multiple chunks and send it to all the participants, that is sending compute to data.
And the other thing as scale store is we kind of fetch the data.
And this has implications on other components.
For instance, two-phase commit.
If you send compute to data, you need two-phase commit.
But actually, if you get the data yourself, you don't need two-phase commit because only one node is involved in a transaction and it can get all the data.
So you don't need two-phase commit.
So I think the foundation with scale stores that we want want to build basically scale store is the foundation for our
distributed OTV system which we want to
build in future and I think there are many
many open questions for instance how one
designs a concurrency control
scheme without destroying the scalability
of scale store that is
not as easy
What attracted you to
this research area and I know we just touched on it a little bit in the
previous question but what do you think is the biggest challenge in this area now
yeah that's a good question i think the biggest challenge in this area has really come up with a
otp systems which scales well and with the new technologies such as fast networks
and what attracted me was actually the fact that
network XCCs are so fast now.
That's incredible.
When we think of TCP Ivy or something,
which we know as a normal user,
and now we have RDMA,
which is in the single-digit microsecond range,
that is quite unbelievable.
Yeah, definitely.
And building distributed systems with adma is actually
interesting because usually one provides if you build a system right you you kind of know your
bottlenecks for instance the network has always been the bottleneck and now that's not the case
anymore so you often need to adapt the whole design of the system to change it and do new
bottlenecks will be revealed because the old bottleneck, which was known, is not the bottleneck anymore.
And then it's quite interesting to design such a new system
because you have very different properties.
If the network's no longer the bottleneck, what is the new bottleneck?
Yeah, I think maybe it's a bit of an overstatement to say the bottleneck is not,
or the network is not a bottleneck because for OTP, that's actually still mostly true.
It's much faster, and you can have different design choices.
For instance, shipping data over the network is now actually a design choice,
which you can really consider.
Previously, that's not the case.
And for analytical databases, I think that's true already that the bottleneck is not anymore the network.
For instance, Snowflake uses a completely desegregated architecture where they read
everything over network.
And I think what is challenging, and I think your next guest will talk about it, is that
you have many accelerators in different places in the network and you need to employ caching
very clever to kind of get the best performance of all these accelerators,
which are distributed over a network and maybe even in the switch and what else.
I mean, I think that is very challenging.
So it's time for the last word now.
What's the one key thing you want listeners to take away from your research?
So the one key thing is that ScaleStore
is a billing block for distributed systems.
I think that's the main key thing.
Fantastic.
And we will end it there.
Thanks so much, Tobias.
If you are interested in knowing more about Tobias' work,
we will put all the links to the paper
and all the other relevant materials in the show notes.
Thanks so much for listening and we will see you next time.