Disseminate: The Computer Science Research Podcast - Michael Abebe | Proteus: Autonomous Adaptive Storage for Mixed Workloads | #7

Episode Date: July 18, 2022

Summary:Enterprises use distributed database systems to meet the demands of mixed or hybrid transaction/analytical processing (HTAP) workloads that contain both transactional (OLTP) and analytical (OL...AP) requests. Distributed HTAP systems typically maintain a complete copy of data in row-oriented storage format that is well-suited for OLTP workloads and a second complete copy in column-oriented storage format optimised for OLAP workloads. Maintaining these data copies consumes significant storage space and system resources. Conversely, if a system stores data in a single format, OLTP or OLAP workload performance suffers.In this interview, Michael talks about Proteus, a distributed HTAP database system that adaptively and autonomously selects and changes its storage layout to optimize for mixed workloads. Proteus generates physical execution plans that utilize storage-aware operators for efficient transaction execution. For HTAP workloads, Proteus delivers superior performance while providing OLTP and OLAP performance on par with designs specialized for either type of workload.Questions:0:56: Can you start off by explaining what a mixed workload is? 1:58: What is the challenge database systems face in trying to support these mixed workloads? 3:23: How have previous database systems tried to support mixed workloads? 5:19: What are the design goals of Proteus? 7:23: Can you elaborate more on the architecture of Proteus and how it makes decisions? 8:46: Can you dig into how you predict the transaction latency, what is the mechanism behind this? 10:35: It feels to me that you are accumulating a lot of metadata, this must have some overhead, how does this impact performance? 12:08: It sounds like the Adaptive Storage Advisor is a centralized coordinator, what are the limitations of this decision choice?  13:35: Are we in the context of a data-center here or can Proteus handle a geo-distributed deployment? 14:34: Changing the storage layout has some implicit cost, how does Proteus decide whether a storage layout change is good or bad? 16:57: How does Proteus predict what the transaction is going to be?18:46: How did you evaluate Proteus?20:20: If you had to summarize your work, what is the one key insight the listener can take away?21:07: Is Proteus publicly available? 21:39: What are the next steps? 22:57: What is the most unexpected lesson you have learned whilst working on distributed database systems? 24:21: Do you think a single system catering for both workload types is better than two specialized engines? 26:10: What attracted you to work on this topic? Links:Paper: https://cs.uwaterloo.ca/~mtabebe/publications/abebeProteus2022SIGMOD.pdf Presentation: https://www.youtube.com/watch?v=qbe29viYTasUni of Waterloo Data Systems Group: https://uwaterloo.ca/data-systems-group/ Contact:Website: https://cs.uwaterloo.ca/~mtabebe/ Email: mtabebe@uwaterloo.ca GitHub: @mtabebe Hosted on Acast. See acast.com/privacy for more information.

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Disseminate, the podcast bringing you the latest computer science research. I'm your host, Jack Wardby. I'm delighted to say I'm joined today by Michael Abubba, who will be talking about his ACM SIGMOD paper, PROTUS, Autonomous Adaptive Storage for Mixed Workloads. Michael is a PhD student in the data systems group at the University of Waterloo. He's interested in distributed systems, databases, and machine learning. Especially, he's interested in building adaptive and distributed data systems. Michael, thanks for joining us on the show. Happy to be here.
Starting point is 00:00:56 Can you start off by explaining to our listeners, what is a mixed workload? Yeah, that's a great question. So in databases, we typically think of workloads falling into two broad categories, transaction processing and analytical processing. So you can think of transaction processing as something like what you do when you're placing orders on Amazon, right? Adding things to your cart and then clicking purchase and all that order information needs to be recorded in the database. Whereas analytical processing, you can think of
Starting point is 00:01:29 as maybe someone at Amazon who is doing stuff like trying to analyze sales data over the past month, maybe look for trends in order history and stuff like that. So a mixed workload is when a database has both of those two types of workloads present in it. So exactly like that example I said, you'd have people submitting their orders as well as analysts running queries to try to learn information. So what is the challenge database systems face in trying to support these mixed workloads? Yeah, so there's several challenges.
Starting point is 00:02:05 The first is that these workloads often have conflicting goals, right? So you want your analytics to run quickly and you want your transaction processing to support high throughput. And so this leads to natural conflicts in things like concurrency control and storage layouts. So in this work, we were really focused on the conflict that arises from storage layouts. So what people have found historically is that laying out your data in a row format, so storing your entire row of data contiguously before moving on to storing the next one works really well for transactional workloads.
Starting point is 00:02:49 Whereas for analytical workloads, people have found that storing data in columns works really well. IDs in a column contiguously, and then store your users who placed those orders contiguously, and then store the price of the order, right? And so these are conflicting because when you have a workload that has both types of access patterns, it's not easy to know what layout of data that you should choose. How have previous database systems tried to support mixed workloads? How have they tried to combine row stores and column stores? And what are the drawbacks of their previous approaches? Yeah, so prior approaches typically fall into one of two categories. So the simplest approach is to just pick a layout and suffer poor performance on one aspect of the workload. So basically say, well, you know, we only care we care more about transactional processing, so we'll keep it in row format or we care more about analytical processing.
Starting point is 00:03:57 So keep it in columnar format. But the kind of more traditional approach is to have two separate copies of your data, one entirely in a row format where you run your transactions, and then one in a columnar format where you run your analytical queries. And periodically you take your updates from the row format and then apply them in bulk on the columnar format. And so there's two key limitations with this approach. First is the storage overhead of doing this, right? You're storing every data item twice. So you need to pay for your storage to actually maintain those versions,
Starting point is 00:04:35 as well as the cost of actually managing and applying those updates on both copies of data. The second is that if you simply run your analytics on your columnar data where you're only periodically applying your updates, then you're going to be running analytics over stale or data that's missing. And so this means that you're not going to get the freshest insights in your analytics. So if we go back to that example of querying for order patterns over a month, you might be missing the last 15 minutes or last hour or last day, depending on how frequently you apply those updates. So with these drawbacks of previous approaches in mind, tell us about Protus.
Starting point is 00:05:25 What are the design goals of Protus? How do we go about getting good performance for mixed workloads? What's the magic sauce in Protus that allows us to do this? Yeah, so Protus tries to do things adaptively. And what this means is that it's making decisions on the fly about how data should be stored and in what format.
Starting point is 00:05:46 So what that means is that it has the choice to store data in a row format or a column format, or it can choose to replicate the data in both. And importantly, this is not a global decision. It's making this decision on a finer grain granularity of what we call a data partition. So subsets of the data. And Protos is defining what these partitions are
Starting point is 00:06:08 dynamically as well. Right, and so what this allows us to do is that, you know, it allows us to store data that's mostly used in analytical queries in a columnar format, and store data that's mostly used in transactional processing, say your most recent insertions in a row format.
Starting point is 00:06:26 And then the stuff in between, it's kind of keeping data in both formats and doing that dynamically. And what this allows us to do is to kind of take advantage of both worlds. So it can allows us to be, use the latest query processing techniques and tricks that people have come up with over data in columnar format. So things like applying compression and sorting that accelerates those analytical queries without having the trade-off of not having fresh results in our queries, because we also merge our analytical queries over those recently inserted rows as well. So basically, Protos is flexibly storing data, but it's freeing the user from making that choice
Starting point is 00:07:19 because it's doing it itself. Can you elaborate a little bit more on the architecture of Protos and how it actually makes these decisions? Yeah, so Protos has a two-tier architecture. So the one tier is the data sites that are responsible for storing and managing the data. And so this means storing data in the format that's prescribed to those data sites, executing the queries, managing any replication and concurrency control, and then applying updates
Starting point is 00:07:52 to those data either as transactions or changing the storage formats or layouts. Above those data sites, we have what we call an adaptive storage advisor, and this is really the brains of the system. So the adaptive storage advisor. And this is really the brains of the system. So the adaptive storage advisor is going to receive the transactions, decide how it's going to execute them
Starting point is 00:08:10 and where over those data sites. And also it's going to make decisions about how data should be stored and when to make those changes. And what's kind of interesting about our architecture is that this adaptive storage advisor also has some machine learning in it to predict what the workload is going to look like and predict what the latency
Starting point is 00:08:33 of transactions are going to be under different storage layouts so that it can actually make these decisions in an autonomous way. Awesome. So can you dig into a little bit more or describe a little bit more about how you actually predict the transaction latency? What's the mechanism behind that? You mentioned that there was some machine learning magic happening there. Can you elaborate on that, please? Yeah. So Protus decomposes transactions into key operators. So what this means is that if you have a transaction, say, going back to our example of ordering something off Amazon, Protus is going to break that down into core building block operations, like making a network request, acquiring a transaction lock,
Starting point is 00:09:20 waiting for any updates that it needs to ensure it observes consistent state, actually performing the insertion of the row, and then committing that data. And so each of these building block operators, we've defined to have different cost parameters. So, for example, when you're doing an insertion, Protus cares about the size of that record insertion. When you're acquiring a lock, Protus cares about how contended that lock is going to be. And so Protus keeps all these statistics and then predicts these costs using a few different techniques. The first is simple linear regression. The second is nonlinear regression. And the third is shallow neural networks. And so these different costs are all then
Starting point is 00:10:12 combined by summing them, because they're over distinct aspects of the transaction, and to give a latency. And Protus can then record these different observations. And as it's doing this, it's remembering what the layout of the data was that was accessed in those different operators and keeping a different model for each storage layout. It feels to me that you're accumulating a lot of metadata there. There must be some overhead there. How does that affect performance? Yeah, that's a great question. So Protus keeps some of these statistics on a per partition basis, and they're heavily sampled.
Starting point is 00:10:51 And so this is one of the reasons why Protus models have a relatively small number of input parameters, because we don't want to keep tons and tons of history and observations. So Protos keeps a reservoir sample, which means that it's a bounded size of observations for each of these different observations that it's using in training. And so this reservoir is updated as new observations come in.
Starting point is 00:11:25 So over time, you're always keeping your predictions with based on the more recent history. And the other thing that's important here is that when we're doing predictions on the critical path, we focus, we use our predictions on the cheaper models we focus we use our predictions on uh the the cheaper models the linear regression models whereas when we're off the critical path that's when we start to use the more expensive models um one thing that we didn't explore in this work was kind of offloading these models to another machine um but i think that's something that could be done to improve or reduce the performance overheads.
Starting point is 00:12:07 One question I do have on the architecture with the adaptive storage advisor, it sounds as if it's a centralized sort of coordinator over the system. Does that impose any restrictions or what the limitations of that there? Does that kind of constrain our transactions per second, our throughput? Or does each data site have its own adaptive storage advisor? So we went for a centralized adaptive storage advisor. And in some experiments that didn't make it in the paper, but I do have, we found that this was able to sustain over a million transactions per second. So the bottleneck was not the adaptive storage advisor. The bottleneck was actually doing the transactional work
Starting point is 00:12:52 at the data sites. And the core challenge with the adaptive storage advisor is keeping that metadata about where data is stored and in what layout in sync. But, you know, even though we're applying storage layout changes kind of continuously, the rate of change of those compared to the input rate of transactions is a couple orders of magnitude smaller. So we didn't see it become a limitation but again i think you could imagine an architecture where this adaptive storage advisor was partitioned or sharded or replicated if it became a bottleneck yeah so are we are we in the context of a single data center here we are
Starting point is 00:13:39 is is can protus handle like geo replicationation? Does that change the equation a little bit? Yeah, so this work is done in the context of a single data center or at the very most, a single availability zone where the wide area network latency is relatively small. I think that there would be some serious challenges if you move to a geo-distributed, you know, around the world setting,
Starting point is 00:14:07 because of the necessity to go to the centralized adaptation advisor to make these decisions. So that's an environment where you probably would want a distributed advisor to make these decisions. But the, and that would also obviously change some of the calculus of how you choose to execute transactions and what the overheads of replication would be. Obviously, changing the storage layout has some implicit cost as well. How does Protos decide whether that change is a good change or it's actually going to be a bad change?
Starting point is 00:14:47 It's going to be you're going to be wasting work. How does it decide that? Yeah, so this is a great question and I think really cuts to the heart of the work that we did here. And so Protus basically breaks this core decision down into three components. The first and the simplest is, what is the upfront cost to actually perform that storage layout change? And so just like how Protus predicts the latency of transactions,
Starting point is 00:15:13 it can break down the latency of performing a layout change into similar operations that it can learn and predict. And a lot of those operations are actually shared with transactional work. And so we have lots of history that we can use in making those predictions. But once you know the upfront cost of what you're actually going to pay to make the layout change, then the other two components that Protos uses to make its decisions are the expected benefit for transactions that are currently ongoing and those that are predicted to arrive. So Protos basically computes the latency of transactions that would be affected by the
Starting point is 00:15:55 layout change, and it compares the predicted latency under the current storage layout and the layout that will be changed. So for example, if Protus decides to change the storage format from row to column, and the transactions that are accessing that partition are OLAP queries, then the predicted latency of those queries will go down because Protus's cost model and learned cost model of operations like scans and joins are going to indicate that under this new layout that those operations take less time. And so Protus will be inclined to make those layout changes. And so I said that Protus does that for ongoing transactions and transactions that are predicted to arrive.
Starting point is 00:16:51 So Protus basically treats these the same, they just have slightly different weights as to the confidence of what those transactions will actually be. So how does Protus predict what the transaction is going to be? Yeah, so there's two components here. The first is what the latency is going to be? Yeah, so there's two components here. The first is what the latency is going to be, and the second is what the transactions that are going to arrive are. So predicting latency of transactions uses those learned cost models
Starting point is 00:17:16 that we had talked about before and combines the different operators together, including changing the different models if data is stored in a different format. The second component of predicting what transactions are going to arrive uses two different sets of machine learning techniques here, but they're based on the assumption that workloads have recurring patterns, particularly analytical workloads.
Starting point is 00:17:42 So for example, that query that we talked about at the beginning of, you know, what is the sales information over a month, that might be a query that's run as an hourly job or every five minutes as part of refreshing a dashboard. technique known as SPAR or sparse periodic auto regression, as well as another technique where it combines a recurrent neural network with a trend line. And so the key observation here is that periodic behavior often recurs on maybe a known or an unknown timeframe, say hourly, daily, or monthly, as well as it's subject to local trends. So maybe over a month, this information, this query runs more frequently. So by combining a periodic prediction with the trend prediction, Protus can have some
Starting point is 00:18:40 confidence about what transactions or queries are likely to arrive in the system. So how did you evaluate Protus? What benchmarks did you use? Yeah, so the core of Protus's evaluation uses the CH benchmark, and the CH benchmark combines two industry standard benchmarks, TPCC, which is a transactional workload, kind of similar to that example that I was using with ordering things on Amazon, and TPCH, which is an analytical workload and again, sort of similar to that example of a query where you're looking for order sales history over a month. And so this is a common mixed workload that is used to evaluate such systems. And we also did some experimentation with Twitter, where we combined some of the
Starting point is 00:19:37 Twitter core APIs, things like inserting tweets, looking for tweets from followers, with some of the types of queries that you can run over Twitter's Firehose API. And importantly for Twitter, this is not over real Twitter data, it's over generated synthetic benchmark data. And we compared Protus to implementation of a row-only store and a column-only store,
Starting point is 00:20:04 as well as two systems that fully replicate data. One is another research project known as Janus, and the other is an industrial system, TIDB. If you had to summarize your work in one thing what is the one key insight that a listener can take away and apply to their day-to-day life wow that's a tough question uh i will say that the insight that i would take away is by automatically adapting how and where data is stored, in this case, row and column data, and being selective about the storage, you can get excellent performance on mixed workloads where you don't need to sacrifice either aspect of transactional or analytical query performance. Fantastic. So is there plans to make Protus publicly available? Are you going to spin
Starting point is 00:21:06 this out as a new database product or is it purely for research? We're hoping that we can make it available going forward. Some aspects of it are still under review for another conference. So hopefully that gets in and we can make everything publicly available. No plans to commercialise it or anything like that. So hoping that this is something that the research community can build on. What else do you have planned for? So you obviously have this other publication under review at the moment. What are the next steps?
Starting point is 00:21:39 Well, I'm wrapping up my PhD, so I'm not quite sure where exactly my research will go next, but I think that this work in particular has a few potentially really nice extensions. The first is something that we had talked about, investigating how this works in a geo-distributed environment, you know, global, at a global scale, not just, you know, within a data center or an availability zone. And the second is, how does this approach change in a disaggregated environment where storage and compute are decoupled? Again, I think that kind of changes some of the calculus of how data is stored and where
Starting point is 00:22:21 it should be placed. And I think one thing that would be really interesting with that is what could a cloud provider do when they have control of both storage and the database and how could they use these kind of techniques to automatically change the layouts and the formats for a customer without them actually knowing it. What was the most interesting or perhaps unexpected lesson you've learned over the course of your PhD of working with distributed databases and specifically with HTAP database systems? Well, I think one thing that was really interesting is how much work has been done on OLAP and OLTP performance independently.
Starting point is 00:23:12 So for example, there's this really nice book that's available on the design and implementation of modern column stores, right? And it has tons of algorithms and techniques that are used to speed up OLAP queries. And I can't think of an equivalent book for OLTP off the top of my head, but many do exist. But then how little work has actually been done on the challenge of what happens when you have these mixed workloads, right? So not as much time has been spent on, well, how do these techniques apply when you're also dealing with updates
Starting point is 00:23:53 to your analytical data as well? And so I think that this is really an exciting area that hasn't been examined as much and it poses a bunch of challenges because for example, in that book about column stories, so much of it assumes that data is read-only, which just isn't true when you have mixed workloads. So trying to apply these techniques when you have updates was a real challenge. Do you think the best system architecture is a system, a single system that can handle both workloads as opposed to specialized systems connected via some sort of ETL process?
Starting point is 00:24:34 Do you think that the one size almost fits all approaches is better? Yeah, I think it is better because one of the things that I've come back to a lot recently is these quotes from Ed Codd, one of the original relational database people from the 80s, where he says something along the lines of, a relational database or a database in general should just allow users to tell you what to store, you know, what data to store, and what the query that it wants to execute is, and that the system should then make those decisions of how to execute that query, how to store the result by itself. And so these separate systems, what you have to do is then you're kind of pushing that work back up to the developer, who then needs to make a decision. And it's probably an incomplete decision of, well,
Starting point is 00:25:31 I got this query. Where should I run it? Or how do I know that this query is run, and it's going to see the results of this other update that I made when they're in two disjoint systems. So I think systems should provide a nice unified interface that allows users to just worry about what they want to store
Starting point is 00:25:54 and how they want to query it and leave everything else to the system under the covers. And work like mine has shown that it's possible to get good performance out of such approaches. That's fascinating. And one last question. And what attracted you to work in this area? I've been interested in distributed systems and the challenges that come from from that for a long time, starting with some internships in undergrad that just really exposed me
Starting point is 00:26:28 to these really challenging problems. And, you know, databases is, you know, almost every application at some layer requires a database to interact with, and, you know, managing and storing data at scale is, you know, one of the challenges that arises from distributed systems. So to me, it's always the distributed databases is a really nice intersection of challenging problems. And, you know, I think HTAP in particular is just
Starting point is 00:26:58 another example of combining two things that are challenging. So I've always found the intersections of areas to be the most interesting and challenging and unexplored. So yeah, that's what led me to distributed databases and HTAP in particular. Brilliant. Well, I hope this podcast and this interview attracts more people to it as well. So I think we'll end it there.
Starting point is 00:27:23 Thank you so much, Michael. If you are interested in knowing more about Michael's work I'll put links to his paper in the show notes and all of the other relevant materials and thank you for being on the show Michael and we'll see you next time thanks for having me Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.