Disseminate: The Computer Science Research Podcast - Michael Abebe | Proteus: Autonomous Adaptive Storage for Mixed Workloads | #7
Episode Date: July 18, 2022Summary:Enterprises use distributed database systems to meet the demands of mixed or hybrid transaction/analytical processing (HTAP) workloads that contain both transactional (OLTP) and analytical (OL...AP) requests. Distributed HTAP systems typically maintain a complete copy of data in row-oriented storage format that is well-suited for OLTP workloads and a second complete copy in column-oriented storage format optimised for OLAP workloads. Maintaining these data copies consumes significant storage space and system resources. Conversely, if a system stores data in a single format, OLTP or OLAP workload performance suffers.In this interview, Michael talks about Proteus, a distributed HTAP database system that adaptively and autonomously selects and changes its storage layout to optimize for mixed workloads. Proteus generates physical execution plans that utilize storage-aware operators for efficient transaction execution. For HTAP workloads, Proteus delivers superior performance while providing OLTP and OLAP performance on par with designs specialized for either type of workload.Questions:0:56: Can you start off by explaining what a mixed workload is? 1:58: What is the challenge database systems face in trying to support these mixed workloads? 3:23: How have previous database systems tried to support mixed workloads? 5:19: What are the design goals of Proteus? 7:23: Can you elaborate more on the architecture of Proteus and how it makes decisions? 8:46: Can you dig into how you predict the transaction latency, what is the mechanism behind this? 10:35: It feels to me that you are accumulating a lot of metadata, this must have some overhead, how does this impact performance? 12:08: It sounds like the Adaptive Storage Advisor is a centralized coordinator, what are the limitations of this decision choice? 13:35: Are we in the context of a data-center here or can Proteus handle a geo-distributed deployment? 14:34: Changing the storage layout has some implicit cost, how does Proteus decide whether a storage layout change is good or bad? 16:57: How does Proteus predict what the transaction is going to be?18:46: How did you evaluate Proteus?20:20: If you had to summarize your work, what is the one key insight the listener can take away?21:07: Is Proteus publicly available? 21:39: What are the next steps? 22:57: What is the most unexpected lesson you have learned whilst working on distributed database systems? 24:21: Do you think a single system catering for both workload types is better than two specialized engines? 26:10: What attracted you to work on this topic? Links:Paper: https://cs.uwaterloo.ca/~mtabebe/publications/abebeProteus2022SIGMOD.pdf Presentation: https://www.youtube.com/watch?v=qbe29viYTasUni of Waterloo Data Systems Group: https://uwaterloo.ca/data-systems-group/ Contact:Website: https://cs.uwaterloo.ca/~mtabebe/ Email: mtabebe@uwaterloo.ca GitHub: @mtabebe Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate, the podcast bringing you the latest computer science research.
I'm your host, Jack Wardby. I'm delighted to say I'm joined today by Michael Abubba,
who will be talking about his ACM SIGMOD paper, PROTUS, Autonomous Adaptive Storage for Mixed
Workloads. Michael is a PhD student in the data systems group at the University of Waterloo.
He's interested in
distributed systems, databases, and machine learning. Especially, he's interested in building
adaptive and distributed data systems. Michael, thanks for joining us on the show.
Happy to be here.
Can you start off by explaining to our listeners, what is a mixed workload?
Yeah, that's a great question. So in databases, we typically think of workloads
falling into two broad categories, transaction processing and analytical processing. So you can
think of transaction processing as something like what you do when you're placing orders on Amazon,
right? Adding things to your cart and then clicking purchase and all that order information
needs to be recorded
in the database.
Whereas analytical processing, you can think of
as maybe someone at Amazon who is doing stuff
like trying to analyze sales data over the past month,
maybe look for trends in order history and stuff like that.
So a mixed workload is when a database has both
of those two types of workloads
present in it. So exactly like that example I said, you'd have people submitting their orders
as well as analysts running queries to try to learn information.
So what is the challenge database systems face in trying to support these mixed workloads? Yeah, so there's several challenges.
The first is that these workloads often have conflicting goals, right?
So you want your analytics to run quickly
and you want your transaction processing to support high throughput.
And so this leads to natural conflicts
in things like concurrency control and storage
layouts. So in this work, we were really focused on the conflict that arises from storage layouts.
So what people have found historically is that laying out your data in a row format,
so storing your entire row of data contiguously before moving on to storing the next one works really well for transactional workloads.
Whereas for analytical workloads, people have found that storing data in columns works really well. IDs in a column contiguously, and then store your users who placed those orders contiguously,
and then store the price of the order, right? And so these are conflicting because when you
have a workload that has both types of access patterns, it's not easy to know what layout of
data that you should choose. How have previous database systems tried to support mixed workloads?
How have they tried to combine row stores and column stores? And what are the drawbacks
of their previous approaches? Yeah, so prior approaches typically fall into one of two
categories. So the simplest approach is to just pick a layout and suffer poor performance on one aspect of the workload.
So basically say, well, you know, we only care we care more about transactional processing, so we'll keep it in row format or we care more about analytical processing.
So keep it in columnar format. But the kind of more traditional approach is to have two separate copies of your data, one entirely in a row format where you run your transactions,
and then one in a columnar format where you run your analytical queries.
And periodically you take your updates from the row format and then apply
them in bulk on the columnar format.
And so there's two key limitations with this approach.
First is the storage overhead of doing this, right?
You're storing every data item twice.
So you need to pay for your storage to actually maintain those versions,
as well as the cost of actually managing and applying those updates on both copies of data.
The second is that if you simply run your analytics
on your columnar data where you're only periodically applying your updates, then you're
going to be running analytics over stale or data that's missing. And so this means that
you're not going to get the freshest insights in your analytics. So if we go back to that example of
querying for order patterns over a month, you might be missing the last 15 minutes or last
hour or last day, depending on how frequently you apply those updates. So with these drawbacks of
previous approaches in mind, tell us about Protus.
What are the design goals of Protus?
How do we go about getting good performance
for mixed workloads?
What's the magic sauce in Protus that allows us to do this?
Yeah, so Protus tries to do things adaptively.
And what this means is that it's making decisions
on the fly about how data should be stored
and in what format.
So what that means is that it has the choice
to store data in a row format or a column format,
or it can choose to replicate the data in both.
And importantly, this is not a global decision.
It's making this decision on a finer grain granularity
of what we call a data partition.
So subsets of the data.
And Protos is defining what these partitions are
dynamically as well.
Right, and so what this allows us to do is that,
you know, it allows us to store data
that's mostly used in analytical queries
in a columnar format,
and store data that's mostly used
in transactional processing,
say your most recent insertions in a row format.
And then the stuff in between, it's kind of keeping data in both formats and doing that
dynamically. And what this allows us to do is to kind of take advantage of both worlds. So it can
allows us to be, use the latest query processing techniques and tricks that people have come up with
over data in columnar format. So things like applying compression and sorting that accelerates
those analytical queries without having the trade-off of not having fresh results in our queries, because we also merge our analytical queries
over those recently inserted rows as well.
So basically, Protos is flexibly storing data,
but it's freeing the user from making that choice
because it's doing it itself.
Can you elaborate a little bit more
on the architecture of Protos
and how it actually
makes these decisions? Yeah, so Protos has a two-tier architecture. So the one tier is the
data sites that are responsible for storing and managing the data. And so this means storing data
in the format that's prescribed to those data sites, executing the queries, managing any replication
and concurrency control, and then applying updates
to those data either as transactions
or changing the storage formats or layouts.
Above those data sites, we have what we call
an adaptive storage advisor,
and this is really the brains of the system.
So the adaptive storage advisor. And this is really the brains of the system. So the adaptive storage advisor
is going to receive the transactions,
decide how it's going to execute them
and where over those data sites.
And also it's going to make decisions
about how data should be stored
and when to make those changes.
And what's kind of interesting about our architecture
is that this adaptive storage advisor
also has some machine
learning in it to predict what the workload is going to look like and predict what the latency
of transactions are going to be under different storage layouts so that it can actually make these
decisions in an autonomous way. Awesome. So can you dig into a little bit more or describe a
little bit more about how
you actually predict the transaction latency? What's the mechanism behind that? You mentioned
that there was some machine learning magic happening there. Can you elaborate on that,
please? Yeah. So Protus decomposes transactions into key operators. So what this means is that
if you have a transaction, say, going back to our example of ordering something off Amazon,
Protus is going to break that down into core building block operations, like making a network request, acquiring a transaction lock,
waiting for any updates that it needs to ensure it observes consistent state,
actually performing the insertion of the row, and then committing that data. And so each of these building block operators, we've defined to have different cost parameters.
So, for example, when you're doing an insertion, Protus cares about the size of that record insertion.
When you're acquiring a lock, Protus cares about how contended that lock is going to be. And so Protus keeps all
these statistics and then predicts these costs using a few different techniques. The first is
simple linear regression. The second is nonlinear regression.
And the third is shallow neural networks.
And so these different costs are all then
combined by summing them, because they're
over distinct aspects of the transaction,
and to give a latency.
And Protus can then record these different observations. And as it's doing this,
it's remembering what the layout of the data was that was accessed in those different operators
and keeping a different model for each storage layout. It feels to me that you're accumulating
a lot of metadata there. There must be some overhead there. How does that affect performance?
Yeah, that's a great question. So Protus keeps some of these statistics on a per partition basis, and they're heavily sampled.
And so this is one of the reasons why Protus models have a relatively small number of input
parameters, because we don't want to keep tons and tons of history and observations.
So Protos keeps a reservoir sample,
which means that it's a bounded size of observations
for each of these different observations
that it's using in training.
And so this reservoir is updated
as new observations come in.
So over time, you're always keeping your predictions
with based on the more recent history.
And the other thing that's important here is that
when we're doing predictions on the critical path,
we focus, we use our predictions on the cheaper models we focus we use our predictions on uh the the cheaper models
the linear regression models whereas when we're off the critical path that's when we start to use
the more expensive models um one thing that we didn't explore in this work was kind of offloading
these models to another machine um but i think that's something that could be done to improve or reduce the performance overheads.
One question I do have on the architecture with the adaptive storage advisor, it sounds as if it's a centralized sort of coordinator over the system.
Does that impose any restrictions or what the limitations of that there?
Does that kind of constrain our transactions per second, our throughput?
Or does each data site have its
own adaptive storage advisor? So we went for a centralized adaptive storage advisor. And in some
experiments that didn't make it in the paper, but I do have, we found that this was able to sustain
over a million transactions per second. So the bottleneck was not the adaptive storage advisor.
The bottleneck was actually doing the transactional work
at the data sites.
And the core challenge with the adaptive storage advisor
is keeping that metadata about where data is stored
and in what layout in sync.
But, you know, even though we're applying storage layout changes kind of continuously, the rate of change of those compared to the input rate of transactions is a couple orders of magnitude smaller.
So we didn't see it become a limitation but again i think you could
imagine an architecture where this adaptive storage advisor was partitioned or sharded or replicated
if it became a bottleneck yeah so are we are we in the context of a single data center here we are
is is can protus handle like geo replicationation? Does that change the equation a little bit?
Yeah, so this work is done in the context of a single data center
or at the very most, a single availability zone
where the wide area network latency
is relatively small.
I think that there would be some serious challenges
if you move to a geo-distributed,
you know, around the world setting,
because of the necessity to go to the centralized
adaptation advisor to make these decisions.
So that's an environment where you probably would want
a distributed advisor to make these decisions.
But the, and that would also obviously change some of the
calculus of how you choose to execute transactions and what the overheads of replication would be.
Obviously, changing the storage layout has some implicit cost as well. How does Protos decide
whether that change is a good change or it's actually going to be a bad change?
It's going to be you're going to be wasting work. How does it decide that?
Yeah, so this is a great question and I think really cuts to the heart of the work that we did here.
And so Protus basically breaks this core decision down into three components.
The first and the simplest is,
what is the upfront cost to actually perform
that storage layout change?
And so just like how Protus predicts
the latency of transactions,
it can break down the latency of performing a layout change
into similar operations that it can learn and predict.
And a lot of those operations are actually shared
with transactional work. And so we have lots of history that we can use in making those predictions.
But once you know the upfront cost of what you're actually going to pay to make the layout change,
then the other two components that Protos uses to make its decisions are the expected benefit for transactions that are currently ongoing
and those that are predicted to arrive.
So Protos basically computes the latency of transactions that would be affected by the
layout change, and it compares the predicted latency under the current storage layout and
the layout that will be changed. So for example, if Protus decides to change the
storage format from row to column, and the transactions that are accessing that partition
are OLAP queries, then the predicted latency of those queries will go down because Protus's
cost model and learned cost model of operations like scans and joins are going
to indicate that under this new layout that those operations take less time. And so Protus will be
inclined to make those layout changes. And so I said that Protus does that for ongoing transactions
and transactions that are predicted to arrive.
So Protus basically treats these the same, they just have slightly different weights as to the confidence of what those transactions will actually be.
So how does Protus predict what the transaction is going to be?
Yeah, so there's two components here. The first is what the latency is going to be? Yeah, so there's two components here.
The first is what the latency is going to be,
and the second is what the transactions
that are going to arrive are.
So predicting latency of transactions
uses those learned cost models
that we had talked about before
and combines the different operators together,
including changing the different models
if data is stored in
a different format.
The second component of predicting what transactions are going to arrive uses two different sets
of machine learning techniques here, but they're based on the assumption that workloads have
recurring patterns, particularly analytical workloads.
So for example, that query that we talked about at the beginning of, you know, what is the sales information over a month, that might be a query that's run as an hourly job or every five minutes as part of refreshing a dashboard. technique known as SPAR or sparse periodic auto regression, as well as another technique where it combines
a recurrent neural network with a trend line.
And so the key observation here is that periodic behavior
often recurs on maybe a known or an unknown timeframe,
say hourly, daily, or monthly,
as well as it's subject
to local trends. So maybe over a month, this information, this query runs more frequently.
So by combining a periodic prediction with the trend prediction, Protus can have some
confidence about what transactions or queries are likely to arrive in the system.
So how did you evaluate Protus? What benchmarks did you use?
Yeah, so the core of Protus's evaluation uses the CH benchmark, and the CH benchmark combines
two industry standard benchmarks, TPCC, which is a transactional workload, kind of similar
to that example that I was using with ordering things on Amazon, and TPCH, which is an analytical
workload and again, sort of similar to that example of a query where you're looking for
order sales history over a month. And so this is a common mixed workload that is used to evaluate
such systems. And we also did some experimentation with Twitter, where we combined some of the
Twitter core APIs, things like inserting tweets, looking for tweets from followers,
with some of the types of queries that you can run
over Twitter's Firehose API.
And importantly for Twitter,
this is not over real Twitter data,
it's over generated synthetic benchmark data.
And we compared Protus to implementation
of a row-only store and a column-only store,
as well as two systems that
fully replicate data. One is another research project known as Janus, and the other is an
industrial system, TIDB. If you had to summarize your work in one thing what is the one key insight that a listener can take away
and apply to their day-to-day life wow that's a tough question uh i will say that the insight
that i would take away is by automatically adapting how and where data is stored, in this case, row and column data,
and being selective about the storage, you can get excellent performance on mixed workloads
where you don't need to sacrifice either aspect of transactional or analytical query performance.
Fantastic. So is there plans to make Protus publicly available? Are you going to spin
this out as a new database product or is it purely for research? We're hoping that we can make it
available going forward. Some aspects of it are still under review for another conference. So
hopefully that gets in and we can make everything publicly available.
No plans to commercialise it or anything like that.
So hoping that this is something that the research community can build on.
What else do you have planned for?
So you obviously have this other publication under review at the moment.
What are the next steps?
Well, I'm wrapping up my PhD, so I'm not quite sure where exactly my research will go next,
but I think that this work in particular has a few potentially really nice extensions.
The first is something that we had talked about, investigating how this works in a
geo-distributed environment, you know, global, at a global scale, not just, you know, within a data center or an availability
zone.
And the second is, how does this approach change in a disaggregated environment where
storage and compute are decoupled?
Again, I think that kind of changes some of the calculus of how data is stored and where
it should be placed.
And I think one thing that would be really interesting
with that is what could a cloud provider do when they have control of both storage and the database
and how could they use these kind of techniques to automatically change the layouts and the formats
for a customer without them actually knowing it.
What was the most interesting or perhaps unexpected lesson you've learned over the course of your PhD
of working with distributed databases and specifically with HTAP database systems? Well, I think one thing that was really interesting is how much work has been done on OLAP and
OLTP performance independently.
So for example, there's this really nice book that's available on the design and implementation
of modern column stores, right?
And it has tons of algorithms and techniques that are used to speed up OLAP queries.
And I can't think of an equivalent book for OLTP off the top of my head, but many do exist.
But then how little work has actually been done on the challenge of what happens when
you have these mixed workloads, right? So not as much time has been spent on,
well, how do these techniques apply
when you're also dealing with updates
to your analytical data as well?
And so I think that this is really an exciting area
that hasn't been examined as much
and it poses a bunch of challenges
because for example,
in that book about column stories, so much of it assumes that data is read-only, which just isn't
true when you have mixed workloads. So trying to apply these techniques when you have updates was
a real challenge. Do you think the best system architecture is a system, a single system that can handle both workloads as opposed to specialized systems connected via some sort of ETL process?
Do you think that the one size almost fits all approaches is better?
Yeah, I think it is better because one of the things that I've come back to a lot recently is these quotes from Ed Codd,
one of the original relational database people from the 80s, where he says something along the lines of,
a relational database or a database in general should just allow users to tell you what to store, you know, what data to store, and what the query that it wants
to execute is, and that the system should then make those decisions of how to execute that query,
how to store the result by itself. And so these separate systems, what you have to do is then
you're kind of pushing that work back up to the developer, who then needs to make a decision.
And it's probably an incomplete decision of, well,
I got this query.
Where should I run it?
Or how do I know that this query is run,
and it's going to see the results of this other update
that I made when they're in two disjoint systems. So I think systems should provide
a nice unified interface
that allows users to just worry
about what they want to store
and how they want to query it
and leave everything else
to the system under the covers.
And work like mine has shown
that it's possible to get good performance out of such approaches.
That's fascinating. And one last question. And what attracted you to work in this area?
I've been interested in distributed systems and the challenges that come from from that for a long time, starting with some internships in undergrad
that just really exposed me
to these really challenging problems.
And, you know, databases is, you know,
almost every application at some layer
requires a database to interact with,
and, you know, managing and storing data at scale
is, you know, one of the challenges
that arises from distributed systems. So to me, it's always the distributed databases is a really
nice intersection of challenging problems. And, you know, I think HTAP in particular is just
another example of combining two things that are challenging. So I've always found the intersections of areas
to be the most interesting and challenging and unexplored.
So yeah, that's what led me to distributed databases
and HTAP in particular.
Brilliant.
Well, I hope this podcast and this interview
attracts more people to it as well.
So I think we'll end it there.
Thank you so much, Michael.
If you are interested
in knowing more about Michael's work I'll put links to his paper in the show notes and all of
the other relevant materials and thank you for being on the show Michael and we'll see you next
time thanks for having me Thank you.