Disseminate: The Computer Science Research Podcast - Matt Perron | Analytical Workload Cost and Performance Stability With Elastic Pools | #57

Starting point is 00:00:00 Hi everyone, Jack here from Disseminate the Computer Science Research Podcast. Welcome to another episode in our ongoing Cutting Edge series. Today's focus will be analytical query workloads, which some of you may know are quite hard to predict in terms of their resource demands, which makes provisioning a bit of a challenge. And to solve this problem for us, we've got Matt Perron with me. Matt is a PhD student in the MIT Data Systems Group, and his primary research is making analytical database systems more easy to use. So welcome to the show, Matt. Thank you. Happy to be here.

Starting point is 00:00:57 Great stuff then. So in tradition of the podcast, we always start off by getting you to tell a listener more about yourself and how you kind of ended up interested in database research and yeah, and specifically why analytical databases as well, I guess. Sure. Yeah. I think I have kind of a non-traditional computer science background. I mean, none of the people in my family had done PhDs, so this was all kind of new to me. So I started my career or I guess my education at Rochester Institute of Technology in Rochester, New York. And I was kind of, I would describe myself as a generalist, but with a kind of interest in systems, the kinds of problems. And furthermore, I had an interest in Japan. So I ended

Starting point is 00:01:36 up studying abroad for a year. And then after that, I actually worked at SoftBank in Tokyo for a few years. And I got plopped into a distributed key value store project. And I felt wildly underqualified on all of these problems. So after a couple of years of working there, I decided to go back and get a master's degree with absolutely no intention of doing a PhD or research or anything. And then at Carnegie Mellon University, I started this master's degree. And there I bumped into Andy Pavlo, who's a great database systems researcher and a super nice guy. And I took his database systems class. He had lecture series about these components that I had been using when I was working at SoftBank. And I just thought all of this was great, but it was kind of a generally general database systems stuff at the time. Andy Pavlo convinced me to do a PhD and I thought like doing research is great. And I applied

Starting point is 00:02:31 around, I got into CMU of course, and MIT and a couple other places, but I ended up at MIT and there, the kind of first projects I worked on were related to analytical database systems rather than transaction processing or other things. So while I had like kind of my first taste in research at CMU, I think I really kind of delved into analytics at MIT. And as you'll see, we'd like worked on this kind of elastic pool technology or serverless systems that came a little bit later. And that's part of what we'll talk about today. So I never really intended to do a PhD, but people kept convincing me to move forward with my education now that I'm at the end.

Starting point is 00:03:09 It's been a lot of fun. Yeah, that's an awesome story. It's always great to see how people kind of end up where they do. And I'm super jealous about you doing the study abroad and getting to live in Japan for some time because that's kind of my one regret when i look back at my my own sort of um journey if i'd never i always wanted to live abroad at some point i never ever did it as part i think kind of doing it's part of um state kind of studying isn't a really nice way to do that obviously i've had like a few research visits here and there and stuff but yeah still

Starting point is 00:03:39 something on my bucket list to do that and japan's also very close to the top of my bucket list is also very envious with you on that one there matt yeah i figured at the time i was like if i don't do this now i'm never gonna do it so i might as well give it a shot so and i had made good friends there when i studied abroad so it was nice to return to the city yeah yeah you've always got i guess uh somewhere to crash when you go back over there then as well definitely cool right let's get back on topic then so we've mentioned a few sort of key terms today so for the um for the uninitiated or the uninitiated listener let's set some context for for the chat then so can you start by telling us in some detail what analytical

Starting point is 00:04:18 databases and workloads are and yeah what are elastic pools sure so let's start with analytical databases and workloads so oftentimes if you've taken kind of an undergraduate database systems class they tend to focus on kind of transaction processing workloads so kind of like bank workloads where the amount of data that you touch with any individual query is tends to not be that much uh but you have a lot of transactions that need to happen all at once. So sometimes in the millions of transactions for some systems, whereas in analytics, the focus is on kind of extracting value from large values of information or

Starting point is 00:04:51 large volumes of information rather. So you could imagine terabytes or petabytes of information that gets kind of joined table, different tables joined together. And then you, you output a small number of aggregates. So there's some processing that goes on. So you think about the number of queries being lower, but that the work of each individual query may be quite large. Because individual queries can consume a lot of resources, if you kind of mix a lot of these things together, you might expect, well, it smooths out as you

Starting point is 00:05:20 increase the number of queries. But it turns out that individual queries can consume tons of resources all at once. So if you add a lot of analytical queries together, it's not true that you just end up with kind of a smoothly changing demand curve of resources. You often get huge spikes as individual queries could touch petabytes of data, require dozens of cores of compute, lots of memory to process these things in an efficient way. So analytical workloads have that property that they can be repetitive in that sometimes they're used for dashboards or regular reporting tools where the workload is very consistent, but sometimes they're used directly by end users or through tools in ways that will spike the resource demands of the workload very,

Starting point is 00:06:05 very quickly. Cool. Yeah. So kind of with that, then where did these, tell us more about the elastic pools and so these sort of resources that these cloud providers have and give us access to. So yeah, where do they fit into this picture of provisioning resources? Sure. So I want to, I want to draw kind of a contrast between what I'll call like a classic cloud or kind of like the cloud from 10 years ago view of the world versus what I'll call elastic pools. So typically when you started thinking about the cloud, say a decade ago, you would, you know, the ability of you to provision additional hardware resources, provision virtual machines, disks, those kinds of things was given to you and you could start

Starting point is 00:06:40 them up and down, you know, in, you know, an hour, few minutes, something like that. But the burden of that provisioning or that decision was left to the user. You decide how many VMs need to be used. In contrast to that, I want to describe a set of systems, which I'll call elastic pools of resources. So if you're familiar with cloud object storage systems like Amazon S3, you don't decide how much storage space you need ahead of time just by kind of interacting with the system. By doing gets and puts of the storage system, the system provides you with the hardware resources that you need without you ever having

Starting point is 00:07:14 to decide ahead of time what you actually need. Similar to that, there are cloud function services, which I'll also group in this kind of elastic pool category. So if you're not familiar with these services, essentially what they allow you to do is write some piece of code, upload that code to this Cloud Function service, and then you'll invoke the code through the service. And the cloud provider takes care of all of this provisioning, deciding how many machines to run, exactly where to run. They handle all of the provisioning and the assignment of tasks to that provisioned hardware. I mean, this is not only available here. Like if you're a large enough service provider, if you're someone like Snowflake, you could potentially run a set of virtual machines in

Starting point is 00:07:56 a multi-tenant way such that you could hand them out to users on demand rather than users having to choose how much they need to do ahead of time. And this is true of Databricks, Amazon, Microsoft, et cetera. So given this picture, then we know sort of the challenges of analytic workloads, which I found really interesting that you kind of, you assume almost like the law of large numbers in the sense that kind of, it would become this nice smooth curve, but it is still very best yet time, which is really fascinating and then this this this um this uh the infrastructure we have available to ourselves in terms of these elastic poles you've published a paper at seymour

Starting point is 00:08:34 hence why we're hence why we're chatting why we're chatting called cackle and so can you kind of give us the elevator pitch then for cackle, given this backdrop. Sure. And I just want to come back briefly to the description. And I think with sufficient volume, even though individual queries can be large, you can smooth these things out, which is why these kind of multi-tenant elastic pools of resources can still make sense or are still valuable. But for individual users, if you have to decide how much hardware to provision for your workload, and it's not the biggest workload in the world, this can be really hard. So the idea of CACL is that given that the resource demands of individual users or individual

Starting point is 00:09:16 firms' workloads on this are spiky, difficult to predict, and change pretty rapidly over the day, how can you get the benefits of elasticity? That is to say, you never have to decide how many hardware resources that you need ahead of time at a minimal cost. So it turns out that, and this is some work that we had done prior to this through a project called Starling, where we just built an analytical database system on elastic pools. And this works great for what I'll call ad hoc single user analytical workloads where maybe there's someone sitting at their computer and they want to do

Starting point is 00:09:50 analysis of a moderately large data set and a terabyte, 10 terabytes of data. And they just want to run a handful of queries every couple of hours or so. For that, if you just run on Elastic Pool resources and not exactly in a naive way, you still have to make a bunch of optimizations that are described in that prior work of Starling. It turns out that if you have anything except this kind of infrequent set of workloads, or you have a more consistent workload, it can be very, very expensive to exclusively rely on Elastic Pools of resources. So the pitch of CACL is that we can mix lower cost virtual machines. So you'll provision some virtual machines that are dedicated to a user. And these tend to be

Starting point is 00:10:31 relatively slow to start up. So say in the range of tens of seconds to minutes, but they tend to be less expensive on a per unit time basis than an elastic pool is. And this is because there's no magic in the world. So you have to pay for this elasticity to be available. Someone has to pay. Exactly what the gap between, the cost gap between a virtual machine and an elastic pool is, you know, depends on a lot of factors, economics, how many users there are, and, you know, the needs of the cloud provider at the time. but there is some kind of cost difference there. And so how can you gain kind of the benefits of this, the illusion of

Starting point is 00:11:11 this completely elastic hardware while still maintaining kind of the low cost of virtual machines or hardware resources that are dedicated to individual users? That was the kind of core idea of Kackle. And it's actually very difficult to do this because all these components change. Workloads change. The costs of things can change. Startup times of virtual machines can change over time. Or even like end user preferences can change. So the goal of Cackle was to find a way of gaining the illusion of rapid elasticity while still relying on kind of lower cost resources and then spilling out into the elastic elastic pools of resources to to fill the gaps when you

Starting point is 00:11:49 don't have enough hardware resources available well it means a lot of variables here to sort of try and tackle right to do this effectively so let's get into the details of cackle and some more how you went about trying to solve this problem so, maybe start off with the key components of CACL. Sure. So if you're familiar with systems like Spark or DataWorks Spark or Presto, in general, the way that you execute queries, especially these large analytics queries, is a user submits some SQL that gets transformed eventually into an execution plan, which is a directed acyclic graph of kind of stages of compute that are connected. And so you have a number of these stages, which are connected in a graph, and each of those stages will have a number

Starting point is 00:12:35 of tasks. So essentially what, and in between those stages, I should say, you have to exchange data in between like compute resources. It's not embarrassingly parallel like some workloads are. You actually have to do this exchange. So the execution part of CACL is divided into two pieces. One is the compute side, where you break up the work that you have to do into manageable tasks that you can assign to hardware resources. And then for the communication parts, we have kind of shuffling mechanisms. And that's also a mix of both virtual machines and an elastic pool. So in this case, Amazon S3 is what we implemented

Starting point is 00:13:16 on. So the kind of core components are a set of VMs that do compute, a set of VMs that are responsible for shuffling, and then elastic pool alternatives for both of those. So in the actual implementation of CACL, we use Amazon Lambda as our source of elasticity. And for the shuffling, we use Amazon S3. But above that, there's a controlling component. You can change the number of virtual machines over time. And that's the kind of core thing that CACL does. It's trying to figure out how many virtual machines do you need for compute? How many virtual machines do you need for shuffling over time? And that's the kind of core mechanism of Kekl, deciding what the split of virtual machines versus elastic pools are.

Starting point is 00:14:00 And I want to add just one thing there. So in CACL versus if you're familiar with systems like Redshift or Snowflake or like Microsoft Fabric, these kind of things, in general, when you submit queries and you've submitted too many, then queries can back up into this queue and slow a lot of things down. Whereas in Kekl, we're making the assumption that any query that you submit needs to be run right now. You really care about the latency. So our goal is to never delay work, never delay the execution of your query. Instead, we're going to spill out into these elastic components rather than restrict ourselves to only using virtual machetes. Nice. Cool. Yeah. So those systems have this admission control sort of scheme. So you kind of components rather than restrict ourselves to only using virtual machetes nice cool yeah so those systems have this admission admission control sort of scheme so you kind of put things in a queue and i'd like to maybe get into this a little bit at some later on maybe in the podcast

Starting point is 00:14:54 where we talk i'm the question i'm gonna probably gonna ask you at some point i need to get it out my head now because um is if you hand if you allow if you allowed a user to assign priorities to the queries that are coming in and say, okay, this one's a lower priority, could that then change this whole mechanism of, okay, I don't need to schedule so much now? And that would be another variable to factor in as well, or is that really not something? I guess you're relaxing that constraint of this needs to execute now, which maybe makes things easier or harder. I'm not really too sure. So we don't cover this particular case of prioritization in CACL, and we can come back to that a little bit later. But especially in the kind of future work sections, I have some thoughts on interesting problems in that space. But the reason that we defer this prioritization is the idea that even though hardware resources in the cloud are somewhat slow to start up, we're talking about

Starting point is 00:15:53 queries in general that may last tens of seconds to a few minutes. So if you have to delay their execution until hardware resources become available, that can significantly impact query latency. So if you're not very latency sensitive, you can always wait for virtual machines to start up. So we don't cover those cases because you could just cover them by waiting for these resources to start up. That's not something we explicitly cover in CACL, but I think it's an interesting direction for future work. Cool. Sure. So you said then that the key secret source of CACLN is deciding how many of these VMs and machines to provision for each of these different categories

Starting point is 00:16:29 you've got for compute shuffling and for the elastic pull. So yeah, tell us about the different approaches you considered for solving this problem of allocation of resources then to these different categories. Sure. So I want to say like upfront that, you know, the first thing that you'd consider doing in these cases is just kind of predict what your workload is going to be ahead of time and try to ensure that hardware resources are available. But I want to come back to the kind of background that we talked about. And a lot of the queries that users will submit are, you know, there's a human at the other end and they may consume a ton of resources. So you might, and this, we actually observed this in real workloads that we gathered from

Starting point is 00:17:09 Microsoft, Alibaba, and Google. Some of those are publicly available and some were just shared with us that the hardware resource demands can spike two, three times, four times in the span of seconds. So if these things are human generated at the end of the day, there's no predictive algorithm that's going to tell you, oh, there's going to be a, the human's going to hit enter in the next five seconds. Like, it's just not going to happen. So, you can kind of throw out the perfect predictor assumption right away. You can never predict what your needs are going to be. I mean, you can't predict exactly what your needs are going to be ahead of time. So that's the first thing that was kind of dismissed. So we tried a bunch of different approaches, but we found kind of surprisingly that because in Kaggle we never delay work, you can measure essentially what your hardware resource demands were looking back into the past. You can figure out exactly what your hardware resource demands were looking back into the past. You can figure

Starting point is 00:18:05 out exactly what your needs were. And we use that to kind of inform what the best allocation strategies are going to be moving forward. So what I mean by an allocation strategy is you'll look back into this workload history, see what your hardware resource demands were, and then you're going to choose from a number of extremely simplistic strategies to find one that minimizes the cost. So in the case of CACL, we're going to assume that the elastic pool resources that we have available and the virtual machine resources that we have available are going to be equal in performance, and we're only going to focus on cost. And we try to design the system when we actually go to the implementation

Starting point is 00:18:45 such that that's as close to true as possible. So what we actually do in Kaggle is we take strategies of the form, look back some number of minutes or seconds into the past, try to take the nth percentile, some percentile of that workload, and then a multiplier. And so the intuition there is that sometimes you want to over-provision. You don't want to only look like, what's the maximum I've used in the past? Sometimes you actually want to over-provision in these cases. So we find, and we just take whatever the minimum cost thing was over that period, and that's the strategy we choose moving forward. There is a delay in starting new virtual machines, and there's some restrictions on exactly how this operates.

Starting point is 00:19:26 But essentially, the mechanism is figure out what strategy worked in the past and choose that moving forward. What's interesting about this is that even if you make a mistake, if you make a mistake and you didn't provision enough, this becomes a cost problem. It's not a performance problem because you can always spill out into this elastic pool of resources. We conceptualize it as something that's essentially infinitely elastic with a fixed cost. The only difference is how much is each individual strategy going to cost. So what I've described is kind of the simplistic version of what we do. In fact, we use some kind of fancy randomized algorithm, this multiplicative weights algorithm, which I probably don't want to talk about too much today. But essentially what it's doing is trying to find minimal cost

Starting point is 00:20:08 strategies over this workload history. And that way we can minimize the cost of workload going forward. Yeah. There's a few kind of variables as you were talking that jump out. How do you decide how far to look back, for example? So basically we just have a bag of strategies. And so some strategies will look back one minute, some strategies will look back, for example? So basically, we just have a bag of strategies. And so some strategies will look back one minute, some strategies will look back 10 minutes. And the intuition behind this is that virtual machines can take variable amounts of time to start up. So if you assume that virtual machines are going to start instantaneously, then you probably never have to use an elastic pool ever because they'll be available as soon as you request them. But the reality is they usually

Starting point is 00:20:50 take a minute or they could take five minutes and they're like, exactly what that time is will impact how far back into the past you probably want to look. Like if it takes a day to get new virtual machines, then you probably want to decide based on looking back a week or so. And the kind of intuition behind CACL is that no matter what of these factors change, no matter how your workload changes or the cost change or virtual machine startup time change or cost models even, or like minimum billing times, these kinds of things, CACL will choose one of these very simplistic strategies that minimizes the cost over time. And I think potentially you can improve by adding to your bag more sophisticated strategies, predictive strategies, etc.

Starting point is 00:21:32 But what was surprising about Kackle is that even with this kind of extremely simple bag of strategies, you can get pretty close to optimal. And I'm sure we'll come back to that in the results, but not optimal, but let's say what an oracle could do. Yeah, that's really cool. I mean, so obviously maybe this, it depends how fast you, how dynamic your workload is, because I guess you're looking back a couple of minutes, depending on the various strategy, and you've got this work coming in.

Starting point is 00:21:59 How often is it the case that the one you've chosen is immediately sort of rendered kind of not the optimal choice essentially because the workloads just change so fast and you is there a limit to sort of how fast your workloads because you're always looking back to make a judgment of the future right and the future can you get these black swan events whatever right how does that factor into it as well like is it just or can or is it i guess just most of the time it is close enough to be an optimal over the long term, it works out quite well.

Starting point is 00:22:28 So we tried with a number of different workloads. We didn't play, we make an assumption that the workload is more or less consistent, but you could imagine a situation where, you know, you're wildly under, under provisioning for the workload that you have. There's some big event that changes and suddenly you've, youprovisioned. That's going to start costing a lot of money because you're going to spend money on these elastic pool resources that you would have been better served by having a virtual machine. So quickly, a strategy that matches your new workload is going to become the least expensive among those strategies. And

Starting point is 00:23:06 you'll switch to provisioning for this new world. That being said, in this paper, we didn't investigate kind of these rapid workload shifts that much. I think there's probably some work that needs to be done to ensure that for all of these kinds of changing situations, you improve. But what was surprising is that for a wide range of workloads, even this very simplistic strategies worked. Yeah, cool. It's beauty in simplicity, Matt. So yeah, cool. So let's talk about evaluation in some more detail then. So first of all, how did you go about evaluating all these different strategies and everything? What was your experimental setup? Sure. So we really broke the evaluation up into two separate parts. I mean, the kind of conditions that exist in real cloud environments don't change that rapidly. Like they could change

Starting point is 00:23:54 over the course of the cost of, say, a spot instance virtual machine on AWS could change over the course of a month or so, or even a couple of days. But you don't have control over these things. So the first thing that we did is we built an analytical model. It's just something basically we wrote in Python that tries to model all of the important components of the system. And we took kind of an off-the-shelf set of queries, TPC-H queries, and also some TPC-DS queries. So these are standard analytical database workload benchmark queries. And then we changed, say, the arrival rates of these queries. So we wanted to model what real workloads look like. So real workloads have kind of these unpredictable spikes, as we

Starting point is 00:24:36 described, as well as kind of regular components where every single day or maybe every single hour, certain queries will arrive. So we tried to vary the workloads as much as possible, changing the period of these peaks, changing how much randomness, changing how many queries were in there. And then we also changed the environment in which we assumed that these things were executing. So changing the startup time of virtual machines, the cost of virtual machines relative to using an elastic pool, and all sorts of these factors. I don't think all of those experiments made it to the paper. But in general, we tried to change as many things about the workload and the environment as possible to ensure that this CACL strategy was robust. you know, you can't cover every single situation. And of course there are, there's always edge cases and there's always kind of adversarial cases where you're not going to do the best. But in general, we try to make sure we cover as wide a range of reasonable scenarios as possible. Cool. Yeah. And what were the results from the analytical model? And they kind of give you good, good hope that this is the right thing to be doing before

Starting point is 00:25:41 then going and actually implementing it for real, right? I guess that was the, it was a litmus test in that sense, the analytical model. But yeah, what were the results from it? So the results essentially were that among the set of strategies that we compared against, so we compared against some, you know, kind of relatively simplistic strategies, like a fixed provisioning, which is probably what you would do if you're, if you're deciding how many virtual machines to run, you'll just kind of fix a number of virtual machines and continue that over time. We tried like kind of predictive,

Starting point is 00:26:08 simple predictive models, just like kind of linear fit to the last few minutes to try to show how predictiveness doesn't work particularly well. And what I want to convey is that if you take into account the important components, if you take into account workload shifts, the cost of these resources, which is one thing that is not often considered, like what is the cost differential of doing these things, then you probably will end up at a reasonably good strategy. I think there are improvements to CACL that could be made that would push us a little bit cheaper. But overall, the results were, compared with reasonable baselines, we found that Kekl was the best performing among those baselines. And furthermore, we also added an oracle. So an oracle which has perfect knowledge of the workload into the future and can allocate hardware resources with future knowledge when demand spikes. So Kekl was pretty close in a wide range of these scenarios to this oracle. So that's kind

Starting point is 00:27:07 of why we stopped the optimization there. We didn't want to kind of polish too hard. So did you compare against real systems, Matt? How did CACL fare against what's out there at the moment? Yeah. So it turns out that we're not the first people to notice that analytical workloads change over time. And the kind of the main commercial options in the space, people like, you know, Snowflake, Databricks, and Redshift, we, you know, we compared against these kinds of systems. And essentially what all of those systems did at the time of our evaluation was they'll kind of wait until queries back up into a queue and then they'll provision you additional clusters of compute in addition to that. So you kind of get this kind of stepwise

Starting point is 00:27:50 increase and then decrease in those systems. But the issue with that is because they wait until queries queue, the latency during those periods of queries can spike really rapidly. So what we find when we compare against these real systems is that, well, these commercial systems have the property that they're either really expensive because they'll be over-provisioning at times when they don't need to be, or when workload spikes, they tend to have very high latency, whereas Kackle is able to maintain a really consistent performance and cost across a range of these scenarios, especially as you increase your workload. So it doesn't suffer the problem of missing these

Starting point is 00:28:30 demand spikes because you can fill in these gaps. And furthermore, it doesn't keep a bunch of resources around when you don't actually need them. It tries to make a kind of reasonable provisioning decision. How likely do you think it could be that the CACO could sort of be integrated into one of these existing commercial systems? You know, all of these, the big providers in this space, people like Snowflake, Microsoft, Amazon Databricks, you know, they have hardware resources that are sitting around because they want to provide better experiences for their end users. So like, if you go to Snowflake today, and you hit start a cluster, that cluster will be available to you immediately, more or less. And it's not that it's not because, you know, virtual machines start up instantly for Snowflake, where they don't for you, it's that

Starting point is 00:29:13 they keep a warm set of instances running so that they can hand them to you when this demand spikes. So it's tackle is a question of how you go about using these multi tenant resources versus the kind of fixed dedicated and user resources that are like, how do you decide how much is dedicated to an individual user versus this kind of multi-tenant elastic compute that's sitting on the side? So each of them actually has these elastic pools of compute, but it's a question of how they are actually used to execute workloads. Yeah. So like, as long as you can assign dollars to these things, and as long as you can assign dollars to these things and as long as you can i think there are you know probably you don't want to just plop a research system into a

Starting point is 00:29:50 commercial system and kind of hope for the best i i have no doubt that there are additional problems that need solving but i do think that this is a solution that might be adapted into some of these commercial systems i mean back yourself, let's get it out there in a while. I'm sure I'll be just fine. Cool. Yeah. So we can talk about the implementation a little bit more there. How easy was it to implement this in terms of implementation effort? Was it quite an arduous task to sort of get this working from scratch? So we did have the benefit of, so we published this paper in 2000 called Starling. And so Starling, again, was a analytical query engine that was built exclusively on elastic resources.

Starting point is 00:30:31 So Amazon Lambda and Amazon S3. And we want, the goal was never provision anything in that system, like just leave it to these elastic pools. And so we had the benefit of the execution engine kind of already being around. But the kind of core implementation parts that were difficult was, you know, you have to interact with these cloud systems, you have to actually start up virtual machines. So we had to have this dedicated set of shuffling resources to significantly reduce costs. So there was an implementation in this kind of shuffling layer the execution engine was largely unchanged from starling although we did add a few features and then the the real kind of core of the work is the

Starting point is 00:31:10 controlling component the kind of primary component that decides how many vms to start and keeps track of which ones are alive and assigns tasks to individual workers either either on the virtual machine side or on the elastic pool side uh, et cetera. How long did it take you? Obviously, you had this existing code base you could build on top of and refactor a little bit to get here. But yeah, how long was the implementation, if out of interest, Matt? I mean, I think that once it was clear exactly what we needed to do, the implementation went reasonably fast, like a few months. But after that, the Starling paper, we went and tried several different things. And there's a lot of things that are interesting in elasticity. And we eventually chose one problem. And that's the thing that we pushed

Starting point is 00:31:56 out. So once it was clear exactly what the solution was, it was relatively easy to get that going. But the exploration of solutions took a really long time so i think what's what's helpful is to build analytical models of everything you know about the world so you can try out different solutions very simple solutions even to see how they perform because if you jump straight to the implementation it can take a lot longer yeah you can spend a long time yeah polishing a brick right that's the thing if you haven't kind of had that initial sort of like validation of okay this looks plausible then you don't want to spend two years building something that's going to turn out out to be like performed terribly right which you could have figured out straight away by putting together a an analytical uh model um cool yeah so i don't know if we

Starting point is 00:32:39 actually went into future directions i think we kind of put the brake on and change thingy so can we maybe talk a little bit more about future directions? And also as well, the name Cackle. I always like to know the naming of things and why they're named what they are. So yeah, what is Cackle and why was that the name that was chosen? So I have to go back to the Starling paper, which I think Starling is probably my favorite name for a system. Like the idea is that there's like, if you don't, if people are familiar, Starlings are these little birds and they fly in these giant murmurations that

Starting point is 00:33:06 form these kinds of beautiful flowing things. And so like you can make these big things happen with lots of these little, little components. And that's the, that was the idea of starling. Cackle was a placeholder name that I couldn't change after submission. No way really. But I chose Cackle for a a very good reason which is that i love hard k sounds like i just think like the hard consonant sounds very very like it's compelling there's something about this like that's very fun so cackle has two hard consonant sounds

Starting point is 00:33:38 that's really the only reason okay that's cool i mean because um i thought maybe like something like uh when you were saying that obviously like mean because um i thought maybe like something like uh when you were saying that obviously like talking about style i thought maybe the collective noun for a flock of styling is a cackle maybe but then no yeah but it's just a placeholder name that we can pretend yeah we can pretend that that's the that's the truth but really yeah i wanted to choose a more elegant name but i really just like the hard consonants that's why i became the placeholder name cool and yeah so future research directions then matt where do you where do you folks go next with with cackle so i should say

Starting point is 00:34:11 that for me the this kind of journey is i mean there's there's one more work in the pipeline which is not directly related to these prior works but i should say that i'm gonna i'm finishing up my phd i'm actually defending a week from recording. And then afterwards, I'm headed to Microsoft Research. So I don't know what the journey holds for me there. I might be working on similar problems. I have no idea. But in terms of future interesting directions in this kind of elasticity space, I think both Starling and Kackel make the assumption that whenever you read base table data, that you're going to pull it from cloud object storage, which to be fair is what a lot of these

Starting point is 00:34:50 systems do. At the end of the day, all of this data is stored in cloud object storage, whether it's visible to the end user or not. Systems like Redshift, internally, they're storing things in cloud object storage because then you can easily do elasticity. But what's missing from some of this work is revisiting caching or revisiting buffer pool management in the context of these elastic systems. So I mean, everyone knows that caching is efficient. And if you have repeated queries on the same data, it's probably going to be efficient to keep copies of that data on these instances. Interestingly, sometimes these things are compute bound and not IO bound. So in some cases, it doesn't end up

Starting point is 00:35:29 mattering that much, but in a lot of cases, it really does. And exactly how you balance the, like, you know, it's difficult to balance elasticity with caching because, you know, these resources may disappear underneath you. So figuring out how to do that well and to meet end user preferences is pretty important. So sometimes users really care about being low cost. Sometimes they really care about being performant. And what's really interesting in these elastic scenarios is it exposes the cost of these things in a way that I think provision systems typically do not. If you provision X number of virtual machines for virtual machines,

Starting point is 00:36:06 you have a fixed set of memory to work with and you wanna make best use of that memory all the time. But if you actually, if you're not executing anything and you have to pay for that memory to be available to you, then exactly what is the value of keeping that around is gonna depend on what the end user actually wants. So how do you manage those kinds of questions, end user preferences, performance, and cost in these elastic scenarios is a direction that I think is very interesting. And these commercial

Starting point is 00:36:35 systems are working towards trying to do some of these things, but I'm not sure they're fully there yet. So I think research, academic research can do a lot in this direction. Yeah. First thing, good luck for next week. I'm sure you're going to absolutely be fine. You're going to ace it. It'll be great, I'm sure. And congratulations on the position at Microsoft Research as well.

Starting point is 00:36:54 That'll be fantastic. And you never know. You could be working on analytical databases there as well. And you might be able to do some really cool stuff on kind of factoring in and user preferences as well and things like that as well that's fantastic yeah cool so i guess let's talk about impact then of of cackle and has there been any sort of impact in the short term that you've seen with cackle in terms of interactions with industry and oh yeah i kind of maybe also thinking longer term what sort of impact would you like cackle to have and if we look if we revisit this paper in 10 years time so for example i mean i think what i would like to happen is for these

Starting point is 00:37:31 systems to simply be more elastic to their end users while trying to meet their their needs so like one core idea of cackle is like the user is pretty probably pretty bad at estimating what their virtual like what their hardware resource requirements is. And Kekla is very workload-driven in the sense that whatever their hardware resource requirements are, you'll try to provision for them. And I think the current way that systems like Amazon Redshift and Databricks and Snowflake and the like, Microsoft and Google, they all have their own things. But balancing, figuring out how to make elasticity much easier to use and removing like the sense that, you know, you have this fixed set of resources that are kind of consistently

Starting point is 00:38:12 available to you and you can almost never spill out into them. Like occasionally you throw in a large query that suddenly needs a lot more hardware resources and you know, it may not go as fast as you actually want. So figuring out how to integrate this elasticity, given that we're in the cloud, is something that I think these systems could take away. As far as what developers or data engineers could leverage in the findings of my research, I think that for people who have used these systems before, people have complaints. If they're hard to use at the end of the day, and the more you complain to

Starting point is 00:38:46 your cloud, to the providers about the difficulty of using these systems, I think the easier things are going to be moving forward. So this work is not something that's probably easily usable by an end user, but the more people understand what is possible in these systems and can push the providers to make some of these adjustments. And I do think, to these systems and can push the providers to make some of these adjustments. And I do think, to be fair, I think the providers of these systems understand that things could be better, but they're slow-moving behemoths. And the more they're pushed, the more they're likely to try to make these systems easier for end users to use. So if you have problems with provisioning your analytical database start

Starting point is 00:39:25 complaining that's my message but yeah you hit the nail on the head there with that sort of usability angle and systems and it's such a underappreciated quality and especially maybe sometimes in from the academic perspective where focus more on let's make this go faster let's make this i don't get my throughput this bit higher latency a little bit lower but i mean that's all good and well but there's that that sort of that dimension of usability is really really important people who can kind of crack that whilst also getting the performance there as well and all that low cost in this case and whatnot and that's the secret that's the secret combination to success i think um but

Starting point is 00:40:05 yeah cool so yeah the next questions maybe we can do a bit of a now you're coming towards the end of your phd matt we can maybe do a little bit of a reflection on this this project so maybe your phd as a whole as well could be quite nice of what's there been the most interesting lesson you've learned maybe while working on cackle and then the same question over the journey of your PhD. Yeah I mean I think what I've learned through this CACL this whole CACL project is the value of de-risking things early and making sure you understand kind of all of the all of the relevant components so we after the Starling paper as I said there's there was 10 different directions we could go in like cost versus performance delaying work. We tried a lot of these things.

Starting point is 00:40:46 And we also tried to implement some of these ideas directly in a system. And it turned out that just getting that system up and running on the elastic pole was slow enough that it defeated all of the benefits of doing so. So that was a kind of wasted several months. Uh, so, you know, de-risking cutting off the right size of the, you know, cutting off a piece that you can actually digest rather than trying to bite off more than you can chew, de-risking things early, talking to smart people, like got to talk to those people early. Um, and then as far as the PhD, like a lot of people believe things are impossible or they believe, uh, you know, and they have and they may have good reasons for thinking,

Starting point is 00:41:25 you know, this is not the right way of going about things. So I don't want to throw too much shade on other research groups. But there was a kind of popular paper in a research conference that suggested that using cloud function services or these functions as a service, things like Amazon Lambda, was not a promising direction going forward. And at a high level, I agree with the sentiment that you probably don't want to rely on these third-party providers to do all of this provisioning for you.

Starting point is 00:41:53 But I think the lesson that they missed by ignoring these was how can you gain the benefits of elasticity and can we start exploring it as... These things are sources of elasticity that we can use. And they're still kind of interesting research things that you can do, even though that's not the right solution at the end of the day. So we started this project, the Starling project, this kind of whole serverless research direction on the assumption that it was a terrible idea. And in some respects, like, I don't think you should drop Starling in to your, you know, middle-sized enterprise and start using it. But there are research ideas that come out of that,

Starting point is 00:42:30 that are valuable. And yeah, those are lessons learned, like listen to the smart people, but don't, you know, smart people haven't thought about every problem in the world. So sometimes if you think something's interesting, just push a little bit on it and see if there's anything there yeah i think that's really nice advice i like what you're saying about um de-risking because i i had this thing i still do to this day i find it very hard to stay focused on one test like it's new shiny thing right so i can go do this can go do that but like i like that i like the fact that you had that sort of um awareness earlier on that early on okay like we need to sort of make things tractable bite off bite off exactly what at least what you can chew and don't have eyes bigger than your stomach um so you know that's uh i really really like that um

Starting point is 00:43:15 it's very easy to get distracted by alternative interesting problems or to try to think you're going to solve everything in one paper. But a lot of times, you know, you just can't do that. And it's, it's hard to make progress in these cases. So, you know, take, de-risk what you can push forward. And, you know, I think, despite CACL not solving every single problem in the world, I think it's a very, I mean, I think there are valuable research lessons there. So, and it was a hard hard fought lesson i will say yeah cool so my next question is normally around sort of the origin story of the of the paper and maybe we've covered this already but it'd be nice to sort of like um i guess go back to the original

Starting point is 00:43:59 sort of proposal for this sort of line of work how much of it how much how did that sort of evolve originally was it because of this you kind of saw this paper from this other group and be like hang on a minute we can we can we can take that assumption but like do something cool or was it how did the paper and like this line of research actually evolve in the first place yeah so um for starling it's pretty straightforward so we we had seen some you know research in the community, not really in the database community, but in systems networking kind of places, which would employ cloud function services, specifically Amazon Lambda, for kind of embarrassingly parallel or pretty close to embarrassingly parallel work where you can break things up into chunks. And so it's very great for these kind of bursty scenarios. But for analytical databases, you have more sophisticated communication patterns. So it wasn't totally clear that that was a good direction.

Starting point is 00:44:49 So it was kind of, I was intrigued by this elasticity or functions as a service using this. And the kind of crazy idea was, could we build an analytical database on top of these things? And the first thing that we did was, try to run the simplest queries that we did was try to run the simplest

Starting point is 00:45:05 queries that we could, try to find exactly where the problem spots were. So within, I don't know, a few weeks, it was clear that there was something there. We originally assumed it was a terrible idea. But again, we had the de-risking early. I should have learned it that day. But it took longer than that. But in any case, that's really the origin story. We thought it was a terrible idea. And it turns out that by polishing and getting the engineering right, there are benefits to pursuing this as a strategy. And again, if there's people at Snowflake, Databricks, or Amazon listening to this, I don't suggest that you start building on Amazon Lambda exactly. But the lessons of how to gain... What are the benefits of elasticity to end users was the main thing there. Yeah.

Starting point is 00:45:48 I mean, I think this is kind of an extension of that work, which broadens it out to a wider range of workloads. Like, the kind of CACL paper backstory was we knew at the end of Starling that for ad hoc analytics, Starling was great. Like, it kind of fits this niche perfectly where you're just an end user. You don't want to care about the resources ahead of time. You just want to submit queries and get results and you don't want to care about how much, what the underlying hardware is. I think there's been progress made since then.

Starting point is 00:46:17 But at the end of that paper, we knew that for a wider range of workloads, it was going to be cost prohibitive to do this. So that's really the origin of this CACL work where we started thinking about delaying work, the Pareto frontier of cost versus latency of queries, looking at what real query workloads look like. That's kind of where CACL came from. Exactly which problem in that space we chose to bite off took a while, but that's definitely... We wanted to focus on something in this elasticity workload management and hardware provisioning space nice it's always great to hear how these things are

Starting point is 00:46:51 these papers go out because i just arrived at it and it's at this beautiful end where it's there published in nice conference proceedings and yeah it's always nice to see how that how they um how the path that led to this um came about in the first place. I guess tangentially, tangentially, can't speak today, tangentially related, I'm going to go for closely related because I can't seem to say that word today, is the idea of being creative and generating ideas and then once you've generated those ideas, not going off in 10 different directions, right?

Starting point is 00:47:22 So like actually selecting the one to work on and do some de-risking. how do you approach that that process of idea idea generation and then selecting projects matt and do you have like a systematic way of doing it or is it more ad hoc i mean i think uh you know that the number of opportunities you have to choose projects is is somewhat limited like if you dedicate yourself to a project it's going to take you know it's going to take months minimum. So the number of chances you actually have to do this is somewhat limited.

Starting point is 00:47:50 So the best advice I can give is, if you can, surround yourself with people who are a lot smarter than you or know things about the space that you may not and that are open to new ideas. So if an idea pops into your head, you should express it. And maybe early in your PhD or early in your career, you may not know that this idea is

Starting point is 00:48:11 good or bad. And having people who can honestly and openly evaluate that idea that are around you is valuable. Or at least like, oh, people did this 10 years ago. It didn't work for X, Y, Z reasons. And then you can say, well, did something change in the last 10 years? Or is to those lessons still hold? So I think, personally, I think looking at kind of the frontier, what's, what's, what's missing in these kind of commercial space, like what are people wanting to do that they can't do now, is helpful.

Starting point is 00:48:41 So I like looking at kind of new technologies. I like looking at things that people are doing that are kind of a pain to do today, but maybe there's not like great technical reasons that they can't do it. That's just like people haven't worked on the engineering of this problem space enough. So that's the best thing I can think of. Like surround yourself with ideas, try to come up with a handful of your own, and then have a few people around you who are decent at evaluating those ideas and gather opinions from a lot of them, but are also open to being wrong about stuff sometimes. Acknowledging when, you know, you don't know something, that's the most valuable thing I've found in people. Like, people have very strong but ill-informed opinions,

Starting point is 00:49:18 or, like, they can't express why the opinion that they think is bad. Like, you want to surround yourself with, like like open-minded, creative people can around surround your ideas. And you know, this is like time tested tradition. Like if you're the smart, find yourself to be the smartest person in the room, you got to find a new room.

Starting point is 00:49:36 So I tried to do that in my career and I hope that I can do some moving forward. Yeah. I like to say as well, that kind of, well, this idea of being surrounded by smart people, because it because it especially early on in your career right it takes a lot of time to sort of develop that intuition of being able to evaluate whether an idea is good

Starting point is 00:49:53 or bad right like you just like you aren't born with that that's something you accumulate through experience and through conversation through discussion through trying and failing right and so yeah definitely having people around you can help you make that um that decision and hone that sort of skill of being like yeah this is good this is bad and yeah sure always if you're always if you're the smartest person in the room in the wrong room right it's a great quote yeah but yeah i mean and furthermore like aside from the technical stuff you know i'm at the end of my phd now you got to find places where you want to be with the people too like you know technical prowess is is one thing you know being around smart people is one thing, but if, you know,

Starting point is 00:50:29 those smart people are not very kind, then it's not going to be a very fun time. So if you're, you know, if some of you are listening and your early career PhD, or you're deciding where to do your PhD, like I can't emphasize enough how being around nice people, understanding people, but people who will still push you is is very valuable and uh yeah i can't i mean i've been very blessed in my career to be surrounded by those people so try to find those rooms that's my best advice yeah i mean that's absolutely great advice my eyes lovely um cool so yeah we're at the time for the for the last word i need some theme music for this like this last word like i dondum-tss, as they add now.

Starting point is 00:51:07 Cool, but anyway. So yeah, what's the one thing you want the listener to take away from this podcast today? That is a hard thing to answer. I think I want people to be dissatisfied. You know, it's not as though you're going to come up with the best solution tomorrow, but be dissatisfied with the world and try to find interesting fun solutions and even simple solutions

Starting point is 00:51:29 like the best solution out there is not the most complex one it's the one that everyone will adopt yeah love that great message to end on matt thank you so much for coming on the pod today it's been actually a great chat and i'm sure the listener will have will have loved it as well and we'll drop links to all the things we've chatted about, all the papers and whatnot in the show notes as well. And yeah, we'll see you all next time for some more awesome computer science research. Bye.

Your Ad Here

Disseminate: The Computer Science Research Podcast - Matt Perron | Analytical Workload Cost and Performance Stability With Elastic Pools | #57

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.