Disseminate: The Computer Science Research Podcast - Tobias Ziegler | Is Scalable OLTP in the Cloud a Solved Problem? | #23
Episode Date: February 20, 2023Summary: Many distributed cloud OLTP databases have settled on a shared-storage design coupled with a single-writer. This design choice is remarkable since conventional wisdom promotes using a shared-...nothing architecture for building scalable systems. In this episode, Tobias revisits the question of what a scalable OLTP design for the cloud should look like by analysing the data access behaviour of different systems. Tune in to find out more!Links: PaperWebsiteEmail TwitterGoogle Scholar Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate the Computer Science Research Podcast. I'm your host Jack
Wardby. Today we have another installment of our CIDR series and this is a special episode today
because it's our first returning guest and so I'm happy to say we've got Tobias Ziegler back on the
show who will be talking today about his paper, Is Scalable OLTP in the Cloud a Solved Problem? And for those who didn't listen last time,
Tobias is a PhD student at the Technical University of Darmstadt.
And so, yeah, Toby, thank you for coming back on the show.
I'm really happy to be here again.
Brilliant.
So I don't know whether I asked you this one last time,
but we can start off with this.
So can you maybe tell us a little bit more about yourself to those who who don't know you and explain how you became
interested in database research sure so my name is toby as you said and besides dbms research
i'm actually a pretty coffee and sport fanatic so i like to do those both actually and i'm in
the process of finishing up i'm currently writing my thesis. So I will be actually not a PhD anymore in quite like in a bit. And how I got interested in
databases was actually a lucky coincidence because we had two courses in bachelor.
And the first one was actually super boring. So I learned a lot about entity relationship diagrams
and SQL and like how to model databases. And that I didn't, I was not really interested at all. And then my,
my advisor teach actually the second course in Mannheim back in the days. And this got
me really interested. It was like in terms of databases, how optimizers work, like very
different indexes and like the full stack and why it was a coincidence was because
actually it was planned that the first teacher basically gave the course again but he got sick
and then like my now my supervisor kind of jumped in and it was basically just coincidental and then
i ended up as a phd wow one of those sliding doors moments right where life could have taken a completely
different path if that didn't happen exactly blessing in disguise yeah that's really cool
um so let's let's um let's talk about the paper then so the there's there's the saying right this
is it better ridges law that any headline that ends in a question mark can be answered with a
note so i'm guessing the answer to is scalable scalable OLTP in the cloud a solved problem?
I'm guessing the answer is no, right?
Yeah, the answer is no.
Can you give us the elevator pitch for this work and some background to it?
So as you actually said, the answer is no, but there's actually a path to getting there.
And the title is essentially a question we ask ourselves.
Is it actually a solved problem?
Because we were actually not sure, right?
Because there's database research for quite some time now.
So we have distributed databases since, I don't know, 70s, 80s, or even 60s.
And for a long time, the shared nothing database was actually the de facto standard when building distributed databases.
But then the cloud came along, and actually the expectations of customers influenced design decisions a lot.
For instance, almost every database is now desegregated.
And the reason is customers expect some flexibility,
elasticity, but also they store now huge data sets.
So I think actually quite recently with the Socrates paper,
I think Microsoft extended their storage offering
from 4 terabyte to 100 terabytes, which kind of indicates that there's some new maybe workload characteristics going on.
And I mean, when we look at OLAP, right, this has been the case for OLAP almost from the get-go in the cloud, that scalability is like an inherent property of distributed OLAP analytical databases.
But for OTP, that's actually not so trivial, right, to achieve scalability because it's
very latency sensitive.
That means like distributed OTP databases have not been as performant as local databases
and customers would expect a similar performance as their local on-premise database when going to the cloud.
It was actually not an easy question to answer.
Yeah, I know.
It sounds like it was a difficult question.
So can we maybe talk a little bit more about the classical approaches
then to people, how people have gone about designing OLTP distributed databases
and tell us a little bit about what the old taxonomy used to be,
what this classical taxonomy was and what the problems were with it.
Sure.
So with the classical taxonomy, we refer to a paper
which was published almost 40 years ago from Michael Stonebraker,
so one of the grandfathers of databases.
And the paper is actually called the Case for Shared Nothing Database.
And surprisingly, he made the Case for Shared Nothing Databases.
But he also characterized like shared storage, shared nothing, and like a shared memory thing,
which is nowadays actually every multiprocessor is dead, so it's not really that interesting.
But shared nothing and shared storage is actually was very influential and is still very, very
often used as a term. But in these 40 years, many things change.
So one of the things is actually, as mentioned, the cloud came along and fast networks.
And as I mentioned now, many cloud databases are actually desegregated. So they have
shared nothing. They have shared storage under the hood. Basically, almost everyone uses
shared storage. And what it makes hard, right, the connotation of those terms make it hard to
talk about scalability. Because for everyone, when you said, okay, it's a shared storage system,
many people assume implicitly that every node can access the
shared storage, as the name suggests, but also can modify it.
That's actually not the case anymore, right?
In cloud databases, there's actually many databases like AWS Aurora or Azure SQL Hyperscale
which use shared storage under the hood, but they have still only a single read write node but that's actually the main point that the taxonomy
is actually quite hard to use for scalability because we have these different like images in
mind when we talk about chat nothing or share storage and the cloud kind of blurred these lines
quite quite heavily there's also conversely when we when we talk about chat nothing right
we always assume that only one node can access the private storage but there's also conversely, when we talk about chat nothing, right? We always assume
that only one node can access the private storage. But there's also a multi master architecture,
which allow that multiple nodes update no private storage. So there are actually quite a few
outliers. So yeah, we've got this sort of taxonomy that does not fit reality anymore. So I know a big
one of the contributions in your work is proposing a new taxonomy for us not fit reality anymore so i know a big one of the contributions in your work
is proposing a new taxonomy for us to think about these systems these old tp system designs and it's
based on the data access path right so this was the sort of the the approach took what was the
reasoning behind this and can you maybe tell us more about this taxonomy? Yeah, so as mentioned from the title,
we are interested in scalability.
And we ask ourselves,
what is the important characteristic for
OLTP scalability? And at the
end, eventually everything boils down
at the lower level to data exorcise.
Are reads and writes scalable?
And how they are scalable?
And to
answer these questions,
we kind of needed to think a bit differently about architectures.
And in terms of data access path, we kind of focused on read and write path specifically.
So that's why we had these,
we proposed three different data access archetypes,
which are kind of prototypical architectures,
where we can basically characterize different existing designs into those.
And one of them, or actually the first one is single writer,
which means, for instance, AWS Aurora,
where we have only a single read-write node,
but we can have multiple read-only nodes.
So those are called replicas.
Then we have partitioned writer, which has multiple writers,
but they work on a partitioned
data set.
So everyone has their own autonomy over their own part of the partition or part of the database
and can modify those.
But what it needs to also have, it needs to have some coordination to glue these things
together.
We have two-phase commit or some similar protocols.
And then we have shared writers,
which actually use a shared storage,
but then modify, can concurrently modify the contents of the shared storage and read,
like read and write from the shared storage.
So those are basically our three data access archetypes.
Nice. So how does this taxonomy perform better
on categorizing existing systems then?
How does, like, what are the benefits?
Because, I mean, a lot of these kind of sound
a little bit similar to kind of what we had before.
Like, how does it plug the gaps, I guess,
is what I'm trying to ask.
Yeah, exactly.
So what we did in the paper,
we kind of had four prototypical workloads
at uniform reads, uniform writes, skewed reads, and skewed writes.
And what we wanted to figure out is how those systems
like on a design level perform.
So we kind of abstracted away from like implementation details
and all these kind of low-level details
and asked ourselves, what is the asymptotic scalability?
And here we mean, if we look at the prototypical workload, for instance, uniform reads, can
this architecture solve those things in a scalable manner?
Can we add more nodes and we increase the performance?
A bit like big O notation, a bit.
And what we wanted to figure out is for those four workloads, which are kind of prototypical,
what kind of architecture performs the best in which sense and why?
And what are the drawbacks of every architecture?
And why they are different is, it's not a really different taxonomy, I would say.
It's an orthogonal, right?
Instead of focusing on the storage location, we focus on the data access pad. It's not actually meant to replace the old taxonomy,
but kind of highlights on different properties.
Okay, cool. So that makes sense. So it's an additional framework
rather than a replacement. Cool.
So a big key sort of, you mentioned it a few times so far,
is latency is important.
Why is this?
Can we maybe elaborate on what the importance of latency is in these different system designs?
Yeah.
So I think latency in general is important for OTP because OTP is known as a latency-sensitive workload. That means if latencies rise a bit higher, basically, or rise, then basically the performance is immediately reduced, I would say.
Let's phrase it maybe differently.
If performance rises, the throughput basically decreases, mostly, because we're interested in per transaction latency.
And if this goes up, then customers will immediately complain about performance.
And the challenge with that is that because most systems already are desegregated, that
means we already incur some network overhead, right?
But to kind of mitigate it, we everywhere have caches now, like local caches for buffer
manager and whatnot.
So basically, to tie back your question, why it's important, right,
it immediately is visible to the customer.
And we kind of focused on throughput,
but we have an additional discussion on latency.
And there are multiple ways of, like, reducing latency in the cloud.
We can either push down computation to the storage,
or we can use caches to basically avoid network roundtrips.
We've spoke about the importance of latency
and how we need to consider
latency and throughput together,
and we've got our new taxonomy
here. So how do we go
about answering the
title of the paper then? So what is the
blueprint for a cloud OLTP database
management system? Yeah, because we want to have a full scalable system, right, we consider write
and reads. And because we have these four productivity workloads, we looked at uniform
reads, uniform writes, skewed reads and skewed writes. And obviously skewed writes are super
hard to solve. They are basically impossible to solve
if you have consistency guarantees.
So we ignore this for now.
But then we have skewed reads.
And skewed reads are actually quite important
because if you think about it,
indexes and stuff,
they can be actually quite skewed, right?
You always look up some specific key or things.
So that can happen quite often.
And we analyzed different workloads
in certain databases. key or things. So that can happen quite often. And we analyzed different workloads and certain
databases. And what came out is basically that they are actually quite skewed, or they can be
quite skewed. And there are actually a lot of read-only workload or queries. And to build such
a system, which is scalable for skewed reads, the question is now, how can we do this? And one of
our archetypes is actually shared writer, but there are two flavors of it. There's one which
basically always goes through to the shared storage, which means at every network access,
or if every access you have a network round trip, which obviously incurs latencies.
So this is not so optimal for latency-critical workloads.
But another thing which is also quite tricky there is
if you have skewed reads and every node
accesses this one single storage node,
then this gets clearly overloaded, right?
You cannot handle the load.
You can replicate it, but that is kind of orthogonal, right?
You can always replicate.
So the question is now, how can replicate it, but that is kind of orthogonal, right? You can always replicate. So the question is now, how can you build like a Bayway system
which can handle skewed reads?
The question we came up, or the answer basically we came up to this question is,
we want to have caches.
Like at the compute level, we want to have caches.
So we cache the hot read items, but then the question becomes,
if it gets updated, right,
we need to somehow invalidate the cached items. And to really pull this off, right,
we need some coherency protocol. And our blueprint for a scalable OTP database is basically a shared
writer where every node can actually access the storage to update with the Korean caching protocol.
And it's actually very similar to Maisy.
So like a cache grid protocol,
which keeps invalidating elements,
which get updated to avoid inconsistencies.
So obviously you have been working on this sort of,
this line of work for a while.
And I know in episode nine, you came on to tell us a little bit about a skill store.
So I'm guessing this is a first step in that sort of direction.
So maybe you can use this opportunity to tell the listener a little bit about skill.
I always got it wrong last time as well.
I always want to call it skill store.
I don't know why.
It's Scale Store, right?
Yeah.
Exactly.
Yeah.
Scale Store.
There we go.
Yeah.
So maybe you can tell us a little bit about the design of Scale Store.
And obviously the listener can go and should go and check out episode nine as well.
So Scale Store is actually a distributed storage engine.
And at the core
is the coherency protocol.
As I mentioned,
it's very similar to Macy.
Basically,
think about a cache coherence protocol
which is now used
in multi-core CPUs.
We use it
for the buffer manager.
So, meaning we have
hundreds of gigabytes of memory
per node
and if you want to update
some page,
we can invalidate it at other nodes if they are cached
to keep basically data consistent, right?
We don't want to have inconsistent data
because we're a database.
That would be bad for us.
So we basically use an invalidation-based protocol
to update or to keep data consistent.
But the benefit is, right,
it's like everything that you can express in pages, you can put in ScaleStore
because it's basically a page-based design. Instead of a cache line, we have pages.
We keep those consistent. And as a programmer, right, you can just use it
as an abstraction. It feels like a single node. You can basically
put any data structure in it. We, for instance, use the B-tree, but also we
have implemented linked lists and everything.
So you can basically put it in and
because it's like
this nice abstraction, you get for free
basically that it's distributed. So everyone
can basically cache the inner nodes, right?
So we have an index, a B-tree,
and the inner nodes, they are actually rarely
invalidated, but they are very
read-heavy or
basically hot for reads. That means that every
node can now cache the inner nodes and access them very efficiently via the memory latency.
And then you also get the benefit that you can handle out of memory workloads because
Scalester uses SSD to evade cold pages to SSD to be very cost efficient. That's basically the rundown. Nice, yeah
and that's a good summary of the
system and as like I said earlier on the
listeners should go and check out episode
9 to learn more about
Skillstore. So
with this sort of I guess
one step towards answering
this question, there is
numerous other research opportunities
that you mentioned in your paper so let's run's run through them yeah one by one and you can tell us tell us about
the why they're hard and what kind of what the the direction might be with with solving it so
let's start off with uh cashing and eviction yeah so basically maybe we should touch on them
briefly and then yeah i go deeper into caching
and eviction and maybe concur.
Sure, yeah.
You can tell us all about all of the problem opportunities at Fallout first and we can
go through them one by one.
Sure, yeah.
So caching and eviction, right, they go hand in hand.
Because if we cache data, basically imagine we have a read-only workload.
We start caching like a bunch of data and at some, our in-memory cache will be full, right?
Because we have replicated the pages.
Basically, if you and me, we cache the same page, we now have two physical copies of the
page, which basically means that if you, for instance, if you have a capacity of 50 pages,
right, and we cache, or we have one data set which consists of 50 pages, and we both cache
it, our memory is already filled,
right? So that means the opportunity to cache also needs to be tamed, right? You can cache a lot of
stuff, but at some point you need to evict it. And why this is actually quite hard in such a system
is the following. When we cache it and we only decide locally if we want to evict. Basically, there's these nice terms from Vycom, also from the 80s actually,
called egoistic and altruistic, meaning that if we have an egoistic strategy to evict,
for instance, LRU or LFU, we only consider our own state and our own hotness or recency.
If we evict those pages, it can be that it's only a single copy of this page that's in memory currently.
And because the network XSCs are actually a bit faster than going to SSD,
it's actually beneficial if I keep this single page copy in my cache,
because you can read it from there actually quite fast.
But if I evict it, you need to tell me, hey, I want this page,
then I need to read it from SSD and send it over.
So latency is actually quite a bit higher.
But if I have a page which is basically also cached at your in-memory region,
I can evict it actually quite fine, right?
I can still read it over fast networks,
meaning the latency is much better than when we go to SSD.
And to really pull this off, right, we need some global state.
We need to know who has it replicated.
Am I the only one who has this page replicated?
Or is this page replicated in other nodes as well?
And this is quite altruistic.
You can basically just kick out replicas, but it's actually not so good either.
So you need to kind of combine those two strategy in a cost-based manner there's actually a nice
like theoretical discussion about it also 30 years ago or something which discusses exactly
this problem and also proposes it but we implemented it and actually it didn't work so well
so there since eight years there's a lot of advancements right in computer architecture
we have multi-core we have many things in keeping like a global state across several nodes which process millions of transactions per second
it's actually quite not quite hard right so there i think there's a lot of interesting work going on
yeah that was the one thing that kind of jumped out to me when you were mentioning that is that
as soon as i heard the word global state it kind of triggered me a little bit i was like hang on a minute that sounds like a potential bottleneck right and that
was what you found in your experiments right so i guess what was the so i mean i presume you're
working on this at the moment but yeah if we what are the sort of ways around kind of maintaining
a global state are there some way you can sort of disaggregate or distribute that in some fashion
yeah so one of the things is
which which is actually currently distributed already at the directories right the directories
have kind of global state already they need to know who has the page cached because if they are
all cached in shared mode i need to invalidate them for exclusive mode because if i want to
update it right they need to be invalidated so that the directory itself has already the global
state for for every page it kind of maintains and we can actually use this right you tell me
basically if i copy a page you can tell me how the current count is how many replicas are they
currently flying around if i'm exclusive right i anyway know that i'm the only one if i want a page
shared right you can already tell me now, okay, there are four pages.
But this can be outdated at some point.
And precisely this is hard to update, right?
Because if someone evicted, it doesn't need to want to send like messages around the whole
cluster to say, okay, here, this page is now gone from my cache.
Please, please update your account.
That's not what we want, right? But what you can do is actually,
because I have like a rough notion about
there are four replicas,
I can try to use this outdated information and evict it.
Because the eviction is,
I mean, it's described in the paper,
it's a bit complicated actually
because we use RDMA and stuff.
But how the eviction works in a rough,
like high-level view is,
I ask the directory if I can evict it.
The directory reads the page with RDMA like one-sided from my buffer if it needs to be read.
For instance, if it's dirty.
If not, he just tells me, okay, you can evict it.
There are many reasons why we designed it this way.
But it also helps us to have a second chance, basically.
Because he knows the current state of this page
i can basically assume that my knowledge has been outdated but i can still make a good decision
about the like hotness and coldness of this page evicted and the directory can then basically
re-evaluate this decision and send it back i see so it's sort of like a lazy lazy exactly yeah that's yeah that's a really
good way of framing it yeah um and that's really when when another person basically reads the page
from myself because i have a cache right you can read it from everywhere then we update the counter
so we try to converge and let the page not get outdated by piggybacking information
across multiple messages
we send around anyway.
But if it's really off, the directory can
basically step in and say, here,
this page is not
as cold as you expected. You're the only one
keeping it at the moment. Here are
the current counts.
Please reevaluate the decision.
Nice. and this performs
a lot better. Sorry.
Much better. And the interesting thing is actually
it's not only about the
number of replicas, we also
store the global heat. Basically, if I
evict the page, then I tell you
if you're at a directory, how hot this page
has been when I
last updated it. So you know
a bit more about it, right?
You know how hot is the page?
Does many nodes want to access it maybe in the near future?
So that's why all these things are factored in.
And interestingly, it performs much, much better.
So it performs like twice as good in the aggregated memory case
because that's the hard case.
The hard case is when our capacities is like 90%
filled then we only have like 10% to operate to like cache and evict our pages and then basically
even if the workload fits in the memory right and aggregate memory we start to evict to SSD and that
makes it then slow again great so yeah that yeah, that was, that's the first
research opportunity you outlined. So the next one is
elasticity. So tell us a little bit more about what we're looking at here.
In particular, elasticity of the storage is interesting
here because this is where we store our pages. In the
compute layer, right, we cache them. And this comes basically for free because you can always, like, get new nodes and let nodes leave.
And then they repopulate their cache and everything is fine.
But currently on the storage side, you cannot easily add new nodes.
And the problem currently is how it's implemented is the page IDs, they are kind of hard-coded, right?
In the page ID, it basically tells me where the page is stored, like in which slot of
the SSD and which node ID is responsible for it.
And clearly this is not really feasible if this one node goes down, right?
We need to completely re-evaluate and re-tag all the pages.
And note that here, the pages, right, they are like pointers.
That means if we use them in some B-trees
or some other data structures,
we need to update all these things.
Not only that we,
this one page needs to be relabeled,
but everything needs to be checked and relabeled.
So that's actually not possible.
And what a challenge here is,
I think one of the things is actually quite solved is we need
to use consistent hashing to identify what kind of nodes keep the pages.
Because when we use consistent hashing, right, the property is that if one node leaves, we
don't need to reshuffle like everything.
We basically only need to shuffle a few pages to my neighbor and another few pages to this other node, and then to every node gets a chunk of the
leaving node. And the nice thing is that we can update the data structure
which kind of describes our ring, basically, and then we
can basically rehash our hashes, and then we find a new node
on this consistent hashing ring.
And what, in addition addition is quite complex is keeping
the state consistent because what happens if basically a node leaves, then it could be at
some point that there are two owners, two directories, so we need to swap all the state
atomically. Okay, cool. So have you explored implementing this in ScaleStore at the moment,
or is this still very much opportunity and you're just sort of thinking it through,
or do you have any sort of experiments you've run with this sort of stuff?
Not really. It's just a thought experiment for now.
No, yeah, cool. But I guess it'd be very interesting to see how it performs.
I guess it's on the roadmap, right?
Yeah, it's on the roadmap cool
cool um yeah so the next the next opportunity and acid transactions isolation guarantees are my
favorite thing so tell us a little bit more about this and that is isolation going to be solved and
simplest simple in this approach yeah tell us all about it to pc we don't need to pc anymore right
so that that simplifies things uh yeah tell us all about it. 2PC, we don't need 2PC anymore, right? So that simplifies things. Tell us all about it.
So why we don't need 2PC is basically that we own all the data, right?
If we touch it, it basically is transferred to our own local cache
and we are not shipping compute
or transactions back and forth between different nodes. So we don't need to
kind of vote on a consistent state because we are
basically responsible for all the data we touched.
So we can basically commit locally because we have seen all the data and
nobody else is involved.
That is nice.
But what is actually quite hard is isolation.
So concurrency control is,
is not a soft topic.
And unfortunately there's only little research in the recent years, but as much like 30 years
ago, 40 years ago.
But the question is now, can we actually still reuse these things?
In the meantime, there's been a lot of new concurrency schemes, especially in the optimistic
concurrency world.
We have now Silo, which may be a good fit,
but there's so much going on in this world
and I'm actually not a big expert on it.
But what we have found is that
many of these existing strategies,
they use pessimistic locking
and they kind of coupled the locking scheme
with the directory server.
Basically, if you access a page,
you also get the lock for it.
And the question is then, is it to cost-grant? Because it's page-based,
do we need more fine-grained? And there's actually a lot of opportunity to do
research. The next opportunity for
research you outlined in the paper, and that is cloud
infrastructure and services. What do we
mean by this? What are the challenges here and the opportunities?
Yeah.
So because nowadays the cloud age or era basically started, right?
We want to think about how can we bring this thing in the cloud?
And so far we have assumed that the compute and storage layers
basically are under our control.
But that's not really feasible often. Actually, it's feasible control but that's not not really feasible often
or actually it's feasible but it's not cost efficient because aws s3 for instance is much
cheaper it also offers built-in replication which would be maybe nice but on the other hand the
latencies are much higher so the question is now how to incorporate these deeper memory hierarchies
into into such a system because it's not quite obvious how we do this, right?
Because in the cloud, the SSD is not really,
if it's basically the instance local SSD,
it can be gone, right?
If something really bad happens and the VM is lost,
it could be that we lose the data.
Because SSD is not like a physical thing
where we can go to and we kind of get it out of
our system.
If something is really bad, it could be gone, right?
If we don't use EBS or something like where it's guaranteed that it's back up, it's just
a local instance, then it could be gone.
But on the other hand, like F3 latencies are much, much higher compared to like NVMe SSDs,
which are around 80 microseconds.
So the question is now how to balance these things.
And that's actually not trivial.
Because where do we evict it, right?
Where do we evict it?
I mean, if we have like memory, we need to evict SSD,
but then from SSD, we need to go to S3.
Should we like write it to both?
Yeah, so not really a good suggestion here.
And another thing is Mighty Cloud Support.
Okay.
I think Mighty Cloud Support is another thing which is kind of tricky
because in Scalestore, I mean, there was a research system.
We assumed basically we have RDMA at our fingertips.
That's not really the case in the cloud, right?
There's actually up to three like big vendors
like Microsoft, Google and Amazon.
There's actually only one,
maybe Microsoft which offers even RDMA.
And that's only for like super expensive HPC instances.
And like an alternative to RDMA is maybe EFA,
but the latency is still 10 times higher compared to RDMA.
So I'm not really sure if it's twice as good as TCP IP over Ethernet,
but it's still not as comparable to RDMA as we would like to.
And then they have different stacks, right?
So you cannot really easily swap your cloud offering
because you need to really adapt source code and stuff.
So the only unifying thing you have is basically TCP IP.
And that's slower again.
So the question is now,
how do you support these different hardware stacks?
Yeah, that's a really interesting point.
I was having a similar sort of conversation
with asking me about sort of how widely accessible,
how widely available is the web I'm looking for.
RDMA is in the cloud offerings.
And I was like, I think some vendors have it and some don't.
But yeah, only Microsoft do it, only for HPC.
That's interesting.
Because I mean, it's kind of a question.
It's actually only 5% of the available instance types of Microsoft have RDMA.
I think we have this calculation somewhere.
It's actually unfortunately little.
And the reason is that RMA is quite hard to virtualize.
Right.
So typically these HPC instances, they are basically bare metal instances.
That makes sense.
So, I mean, I guess the kind of question that jumps out at me from that is, where do we go as hardware keeps evolving over time?
I mean, obviously, will this design and these challenges obviously will change as the hardware changes.
But is there anything necessarily on the horizon in the short to medium term that's going to maybe revolutionize,
but sort of move the goalposts a little bit?
Or do you think that this is probably going to be the best way to design a system for the foreseeable future?
I mean, it kind of feels like the shared nothing approach is kind of dead, but is pretty provably not the right way to go.
Do you think that will always be the case
or do you think there's a way that that might change?
That's hard to answer, actually.
It's a very...
Yeah, it's a...
It's a horrible question.
I apologize.
But I mean, I think the shared nothing architecture
still has a lot of merits.
But given the cloud development, right,
where everything is disaggregated anyway,
it's not really clear if it's like a real traditional shared nothing.
Maybe it's basically a partitioned
writer, more or less, because we still
need to go over the network
to write to the external storage,
which needs to be disaggregated
because of elasticity reasons
and stuff. I think
to get the fastest database,
it's still... I think single node the fastest database, it's still
I think single node is still
the one which provides the best latencies.
And the question is now if you get
much faster SSDs,
then it would be super interesting
because then you can store a lot of data
on a single node
which having more or less fast access.
But until that
happens,
and since networks actually evolve also quite fast,
we can bridge this kind of latency cliff.
We don't really want to go to SSD because there's actually,
it's much better latency compared to disk,
but it's not as good as RMA, for instance.
And we still have the benefit to scale out a lot of reads
and we can handle a huge amount of workloads.
So I think for most workloads,
OTP workloads, probably single node is still enough.
But then for the big, big players, basically,
I think the current architecture has maybe its limitations
because they just have a single writer right
yeah i mean you hit an interesting point there about like the for the most applications and
apart from the very the very minor than the minority and i guess that's probably not going
to necessarily change too much i want to thought like the vast majority of people are going to necessarily change too much i wouldn't have thought like the vast majority of
people are going to be still have the vast majority sorry the way i'm thinking about it
the vast majority of workloads are going to stay like you said within the the realms of yeah it's
just run on a single node it's fine right like you don't need the stuff the the big the big boys do. I think an interesting point about it is, is it like 99% of the use cases
or is it 99% of the revenue?
That's a good point.
Because often, I mean, the big players drive,
probably the big players drive the most revenue, right?
So do you want to, yeah, that's interesting.
Yeah, it's probably like,
that's a really good way of thinking about it.
Yeah, because I was just thinking in terms of like,
oh, well, all the workloads are the same, but it it's fine but if you're only getting a i don't know
a few euros a few pounds off those and then you're getting millions or whatever it is off the big
boys forget the little guys we'll go and build the system for the big boys um yeah
yeah good business sense yeah um great anyway so yeah um my next question is as a like a software
developer or dba or database architect or or whatnot how can i leverage the findings in in
your work and these and the more generally how what sort of impact do you think this this work
can have i think as a data administrator, actually, it's quite nice
because having the opportunity to scale up and down at fingertips
without having user-defined partitioning is actually quite nice, right?
Because typically you need to specify partitioning,
maybe hand decide which kind of replication you need.
But in our system, because everyone can access every data,
it's actually quite nice, right?
You don't need to have user-defined partitioning
because every node can actually just process every transaction it gets.
It gets some data and that's fine.
Yeah, that's a nice point.
I mean, it's kind of a shift back towards, I guess,
that idealistic goal of
um having sort of the highest level abstraction possible and not necessarily because i mean a lot
of these um like a lot of distributed business shared nothing ones essentially sort of like
they almost break the third wall in the sense that you have to make so many or know so much
about your workload and your um and the system underneath to like get the optimal partitioning line like you say or
you know know that characteristics of the transactions or whatever to sort of leverage
or get good performance right and whereas i guess it's more stop the step towards that sort of yeah
just fine run it and it'll the magic will happen behind the scenes right yeah yeah that's cool so
hopefully having the least amount of work for users
and developers which use the system.
For sure, yeah.
That would be a really nice impact to have.
Cool.
So yeah, the next question I have is,
what's been the most, I mean, so I guess this work,
did this naturally fall out of the scale store work? I i mean you can tell us a little bit more about the backstory and the work
but what's been the most sort of interesting maybe unexpected thing you've learned by whilst working
on this on this paper i think it ties back to the taxonomy because it was not really clear from the
beginning right that we need a new one we tried to get along with the old one, but it got actually quite complicated to communicate with each other even.
So when I talked to Phil, I mean, we were not really on the same page often because we had different imaginations of a shared nothing database or a shared this database or how it's actually currently in the cloud and where the line is drawn for the different architectures.
So I think that was the most surprising thing that it was actually super hard to communicate
your idea or our ideas clearly.
And that made it actually, the new tech someone actually made it only after the revision.
So basically because they also had had troubles the line like really drawn out.
And yeah, we thought about it afterwards
that we maybe need to think about differently.
Namely the data access part
are actually quite important for our like goal, so to say.
So that was super, super surprising for me.
Yeah, that is really interesting.
I mean, I guess on guess, let me be honest,
maybe it's not like a slight tangent, but it's kind of related.
But I know I always sort of struggle when trying to compare different,
just different distributed transaction protocols.
I've spent a lot of time looking into those
and trying to sort of get the same sort of baseline to compare off is almost sort of difficult as well and just sort
of i guess like you say we all have our own view of of what the systems would be like and it's only
when you start communicating with somebody like you realize that you've both got two totally
different mental models of how the world's working but yeah no that's that's a really interesting
um really interesting experience you
had but i guess hopefully now we've got the answer we've got a nice clean taxonomy at
taxonomy for all for us all to learn and work on hopefully okay i guess we've kind of maybe
touched on this next question with answering that but i mean from the initial idea to the
to the the actual publication and what was the
sort of thing you tried along the way that failed i'm guessing the initial taxonomy was one thing
right and but yeah is there anything else that you kind of that you can share with us on that
i think because before we asked phil bernstein i think one of the kind of failures was that we
overlooked a lot of very important work like 40 years ago.
We just didn't know that there was so much work in that direction because they were also labeled differently, right?
There was like our shared caching system basically was labeled data sharing.
And nowadays, when you think about data sharing, right, you think about the data sharing offers of like Snowflake, for instance, which allow you to share data between different companies.
That's why it's actually quite hard to find this work.
But luckily, Phil had a lot of insight into all these super nice, high-quality research papers.
And that was actually really enriching to read those.
There were a lot of nice ideas.
And interestingly, the papers from back then,
they're so much different to nowadays papers.
Have you read such old paper at some point?
I've read one or two.
Like I've read a few of the ones like Jim Greer's locking paper
and a few of his stuff.
And I occasionally read a few, but yeah, they're very different, right?
They have like a conceptually depth, right?
They are so deep in the conceptual, like so much pages are written about the conceptual things,
but the evaluation is actually, if they have even one, often they don't have actually an evaluation, right?
They're just, basically contribution is the conceptual idea, which is at one point super nice, right?
Because those ideas are often very very nice like arias
or something and then i mean the question is now do we need to re-evaluate those on modern hardware
because sometimes i think some of those are maybe conceptually still sound and work but it's still
quite challenging to implement them efficiently and i think part of our current research is also
making things very efficient
right i mean in-memory databases kind of started with the thing like we have many new indexes
like originating from this research and i think it's it's nice to combine old ideas with new
modern hardware yeah for sure i mean it's it's interesting what you say there about sort of
finding uh and the name like so the name might be different right and then you find like I mean, it's interesting what you say there about sort of finding,
like I said, the name might be different, right?
And then you find like 20 other papers that are related with it
and just so many good ideas.
I mean, there is that kind of, I guess, I've mentioned it a few times,
of like everything's already been tried in database.
All the ideas were done in the 70s and the 80s, right?
All the cool ideas have already been done.
Everyone's already done them, right? it's just there's nothing there's nothing new they did it in system r and it did work like sort of thing um and it's so interesting to read them right i mean
and so many nice cool ideas and then you yeah you're just impressed actually how much work
has been done in that field yeah it's wild it really is wild um there's so much to learn and
when you think you've you've learned it all there's always about 100 other papers to read
so yeah um but yeah no cool um so i guess you mentioned it it's probably a nice segue into
it's like what's next on the research agenda and you mentioned that you're you're writing up are
you going to stick around in in academia and continue with research yeah i will be staying
actually i will be staying in our group as a postdoc for one more year to kind of see where
the journey will go we'll probably work a bit more on scale store and then then let's see nice
is the long-term goal for scale store to be is it a commercial system or is it still very much
research prototype very much a research prototype?
Very much research prototype.
And it's not clear if there's an ambition to do it
as a startup or something.
But I mean,
it's nice to develop such a system.
You learn a lot, right?
You get new ideas
and I think that's a lot of fun.
And I think that's currently
still the main goal.
Yeah, nice.
That's great stuff.
So you're going to stick around and work on ScaleStore. That's awesome. lot of fun and i think that's currently still the main goal yeah nice that's great stuff so
um you're gonna stick around and and and work on on scale so that's awesome so are there any
sort of specific topics you're going to be researching on i think mainly the eviction
concurrency control and then uh recovery like for us i think those are the big points which are still unclear.
Fantastic.
And I mean, my next question I normally ask is,
can you tell your listeners about your other research?
But I guess the Scale Store project is that sort of big in itself,
that kind of all the research happens within that project, right?
I wish, right? kind of all your research happens within that pro that project right i wish right actually we have
another paper on on at sigmo now accepted like just recently accepted it's about one-sided rdma
and synchronization primitives because that's actually something quite interesting because
there are many papers which use one-sided rdma nowadays because it's also a very intriguing
like primitive and for those of you who don't know one-sided RDMA nowadays because it's also a very intriguing, like primitive.
And for those of you who don't know one-sided RDMA,
it actually allows you to read and write remote memory almost as if it would be your own memory
with quite low latencies,
like in the range of single-digit microseconds, basically.
You can write four kilobyte into another node's memory super fast.
And interestingly, because
we're database people, we need to synchronize
our stuff, because otherwise you get inconsistencies.
And they provide
atomic primitives,
which they work fine,
they synchronize and stuff,
but they are not as efficient.
Similar to in the multi-core CPU,
if you use atomics, C++
atomics, you have a lot of cache line bouncing.
There are different reasons in the network, but I mean, the outcome is still the same.
They don't scale very well.
And there are many reasons why that is the case, which is actually quite interesting.
And we outlined optimization-like potential.
And then there are other systems which use optimistic concurrency control schemes,
such as Farm, like a big system developed for Microsoft, which actually has
a nice working optimistic
concurrency control scheme for RDMA.
So that one is actually working
on x86 at least.
But there are other schemes
which are actually broken. So there are many
publications about optimistic
concurrency control or optimistic
simulation primitives with RDMA, which
actually don't work
that was our surprising
like finding
yeah they don't work
I'm trying to think on the farm stuff there was one
paper I remember reading once about because it's like called
opacity
opacity
farm project right
there's a few of them they've all got
like they're not very easy to remember those projects because they've
got like acronyms based around.
Exactly.
Yeah.
It's hard to remember their names, but yeah.
And that was, I think, in the transaction protocol, right?
But even on the like latching level, if you use one-sided RDMA in an optimistic fashion without atomics, it's actually quite hard to get the thing right.
Because it's not really specified how they behave.
And I think that's the main gist.
There's no specification around this.
So you need to refer to the underlying protocols, like look at PCIe and stuff and see how the cache coherence protocol works on x86 and your target destination so it's actually quite interesting
yeah yeah it sounds it for sure i'm cool so these these um this this next question is an
interesting one so i'd like to get your let's get your right your your thoughts on this. How do you go about generating ideas
and then specifically selecting which ones to pursue, right? Because it's quite a
difficult task. Because you always do really nice, interesting work. I'd love to know how your process
behind that works. I think, to be honest, nowadays
it actually comes quite naturally. Because when you build such a system like Scans, you see a lot
of things which don't really work.
So those are for sure
interesting areas you can expand
your work and then basically
where you're interested most. The signal paper
synchronization thing basically came
because my first
paper, my first signal paper used optimistic
synchronization and actually to my
shame it is broken.
Like my own synchronization scheme didn't really work.
Oh, no.
Because it's really subtle, right?
It's so hard to figure out why it's not working.
And the thing is, it's actually quite rare that you see these bugs.
So you really need to test for those.
So in a research setting, that's why also many papers use some broken
optimistic synchronization primitives.
Because it's actually quite hard to figure those things out.
There's a really minor chance that you hit these cases but i mean still nevertheless they are i mean they don't really synchronize which defies the purpose of synchronization exactly right i
mean it might happen one in a million times but one in a million times it happens it's gonna
knacker your whole system right and then you've not got yeah i mean and you hit on an interesting
point there like testing and observability of these sorts of issues i know when i've
worked on developing concurrency control protocols before and stuff like the testing is just
impossible to get right i mean it's really really hard because i mean even if you prove your
algorithms correct or whatever right on it with a pen and paper you've got it it's very challenging
to do the same with your actual implementation, right?
It's a challenge, big challenge for sure.
Cool.
Yeah, so on challenges,
what do you think is the biggest challenge
in our research area, in databases right now?
Yeah, that's a good question.
And one I actually also struggle a lot with
because I think there is not, I'm not sure if there's
a biggest challenge, but certainly there are important challenges.
And I think many of them are connected to the cloud because one of the things which
is maybe approaching, I'm actually not really sure if that's the case, but I mean, for sure
we have hardware accelerators, right?
I mean, they're currently developed and also used at some point. And I don't know how we should like unify all these different things, right? I mean, they're currently developed and also used at some point. And I don't know how we should unify all these different things, right? I mean, it's so hard to
even talk about distributed databases itself. But then if you need to navigate different operators
to different accelerators, that makes it actually much harder, right? You have different... It's not
really sure how this is done. there's no really good primitive for
now to actually instruct those things and i think gustavo alonso said a nice thing where he said
basically electrical engineers they don't know about abstraction and i mean that's why some of
the things which are super nice they're actually quite hard to use we're currently experimenting with like a program between the switch which uses p4
and it's so hard to get it right i mean there's so many things right so many accelerators or
broken network cards it's it's just hard it's not really clear how to use them in a
like in a real system i don't know how to implement all these different things then
you have different vendors and stuff i don't know yeah it feels like the the sort of i guess it's somehow but like the diversity in hardware
is yeah maybe the heterogeneity is is a massive sort of um challenge challenge for sure yeah and
then sort of navigating that is yeah yeah that's why i don't know how to do it actually because
even if i have two different networks,
as I mentioned before, EFA and RDMA,
which actually sort of work very similar,
they have completely different non-trivial performance characteristics.
You need to specifically tune your system to that.
And I think that ties back to the next point,
is cost efficiency, right?
In the cloud, performance is basically cost efficient.
More or less like
performance per dollar if you if you pay for your network and you don't really utilize it
it's not so good yeah so that's another challenge i think which is where i think there's a lot of
potential to really optimize our current resources as best as possible that's a yeah that's a really
good point like the economics of it start to play into it.
I actually remember the talk recently.
It might have been Victor who gave it.
Yeah, at Damon.
Yeah, at Damon at Sigmund, right?
Yeah.
Yeah, I remember reading that.
I'm literally listening to it.
My background, my undergraduate degree was in economics. Me too. Was Victor also actually no way really I mean that's why I was like oh wow
that's wild so yeah no I was kind of naturally drawn to that talk and it's intriguing right
yeah yeah for me as well that's wild he's a really big opponent of these like economic thinking and how to
commoditize different things in the cloud no that's really cool oh wow um cool yeah and now
um time for the last word um what's the one key thing you want the listeners to take away from
from this work from your work in this uh podcast today yeah i think the goal with our paper was
like a thought-provoking paper
to basically shift the direction
from only share nothing research
to also different architectures.
And obviously, given our paper,
the takeaway need to be contribute to scale store
or at least look at different designs,
how we can advance the field of distributed databases.
Brilliant.
Let's end it there.
That's a great line to finish on.
Thanks so much again, Toby. It's a great line to finish on. Thanks so much again,
Toby.
It's great to have you
on again.
And hopefully this is
not the last time
you come on.
We can get you on again
at some point in the future.
And you can keep telling us
about Skillstore
and all the cool work
you guys are doing.
And as always,
I'll put links to everything
in the show notes.
And we'll see you all
next time for some more
awesome computer science research