Disseminate: The Computer Science Research Podcast - Matt Perron | Analytical Workload Cost and Performance Stability With Elastic Pools | #57
Episode Date: July 22, 2024In this episode, we dive deep into the complexities of managing analytical query workloads with our guest, Matt Perron. Matt explains how the rapid and unpredictable fluctuations in resource demands p...resent a significant challenge for provisioning. Traditional methods often lead to either over-provisioning, resulting in excessive costs, or under-provisioning, which causes poor query latency during demand spikes. However, there's a promising solution on the horizon. Matt shares insights from recent research that showcases the viability of using cloud functions to dynamically match compute supply with workload demand without the need for prior resource provisioning. While effective for low query volumes, this approach becomes cost-prohibitive as query volumes increase, highlighting the need for a more balanced strategy.Matt introduces us to a novel strategy that combines the best of both worlds: the rapid scalability of cloud functions and the cost-effectiveness of virtual machines. This innovative approach leverages the fast but expensive cloud functions alongside slow-starting yet inexpensive virtual machines to provide elasticity without sacrificing cost efficiency. He elaborates on how their implementation, called Cackle, achieves consistent performance and cost savings across a wide range of workloads and conditions. Tune in to learn how Cackle avoids the pitfalls of traditional approaches, delivering stable query performance and minimizing costs even as demand fluctuates wildly.Links:Cackle: Analytical Workload Cost and Performance Stability With Elastic Pools [SIGMOD'24]Matt's Homepage Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hi everyone, Jack here from Disseminate the Computer Science Research Podcast.
Welcome to another episode in our ongoing Cutting Edge series.
Today's focus will be analytical query workloads,
which some of you may know are quite hard to predict in terms of their resource demands,
which makes provisioning a bit of a challenge.
And to solve this problem for us, we've got Matt Perron with me.
Matt is a PhD student in the MIT Data Systems Group, and his primary research is making analytical database systems more easy to
use. So welcome to the show, Matt. Thank you. Happy to be here.
Great stuff then. So in tradition of the podcast, we always start off by getting you to tell a
listener more about yourself and how you kind of ended up
interested in database research and yeah, and specifically why analytical databases as well,
I guess. Sure. Yeah. I think I have kind of a non-traditional computer science background.
I mean, none of the people in my family had done PhDs, so this was all kind of new to me.
So I started my career or I guess my education at Rochester Institute of Technology in
Rochester, New York. And I was kind of, I would describe myself as a generalist, but with a kind
of interest in systems, the kinds of problems. And furthermore, I had an interest in Japan. So I ended
up studying abroad for a year. And then after that, I actually worked at SoftBank in Tokyo for a few years. And I got plopped into a distributed key value
store project. And I felt wildly underqualified on all of these problems. So after a couple of
years of working there, I decided to go back and get a master's degree with absolutely no intention
of doing a PhD or research or anything. And then at Carnegie Mellon University, I started this
master's degree.
And there I bumped into Andy Pavlo, who's a great database systems researcher and a super nice guy.
And I took his database systems class. He had lecture series about these components that I had been using when I was working at SoftBank. And I just thought all
of this was great, but it was kind of a generally general database systems stuff at the time. Andy Pavlo convinced me to do a PhD and I thought like doing research is great. And I applied
around, I got into CMU of course, and MIT and a couple other places, but I ended up at MIT and
there, the kind of first projects I worked on were related to analytical database systems rather than
transaction processing or other things. So while I had like kind of my first taste in research at CMU, I think I really kind
of delved into analytics at MIT. And as you'll see, we'd like worked on this kind of elastic
pool technology or serverless systems that came a little bit later. And that's part of what we'll
talk about today. So I never really intended to do a PhD,
but people kept convincing me to move forward with my education
now that I'm at the end.
It's been a lot of fun.
Yeah, that's an awesome story.
It's always great to see how people kind of end up where they do.
And I'm super jealous about you doing the study abroad
and getting to live in Japan for some time
because that's kind of my one regret when i look back at my my own sort of um journey if i'd never i always wanted to live abroad at some point i never ever
did it as part i think kind of doing it's part of um state kind of studying isn't a really nice way
to do that obviously i've had like a few research visits here and there and stuff but yeah still
something on my bucket list to do that and japan's also very close to the top of my bucket list is
also very envious with you on that one there matt yeah i figured at the time i was like if i don't do this
now i'm never gonna do it so i might as well give it a shot so and i had made good friends there
when i studied abroad so it was nice to return to the city yeah yeah you've always got i guess uh
somewhere to crash when you go back over there then as well definitely cool right let's get back
on topic then so we've mentioned
a few sort of key terms today so for the um for the uninitiated or the uninitiated listener let's
set some context for for the chat then so can you start by telling us in some detail what analytical
databases and workloads are and yeah what are elastic pools sure so let's start with analytical
databases and workloads so
oftentimes if you've taken kind of an undergraduate database systems class they tend to focus on kind
of transaction processing workloads so kind of like bank workloads where the amount of data that
you touch with any individual query is tends to not be that much uh but you have a lot of
transactions that need to happen all at once. So sometimes in the millions of transactions for some systems,
whereas in analytics,
the focus is on kind of extracting value from large values of information or
large volumes of information rather.
So you could imagine terabytes or petabytes of information that gets kind of
joined table, different tables joined together.
And then you, you output a small number of aggregates.
So there's some processing that goes on. So you
think about the number of queries being lower, but that the work of each individual query
may be quite large. Because individual queries can consume a lot of resources,
if you kind of mix a lot of these things together, you might expect, well, it smooths out as you
increase the number of queries. But it turns out that individual queries can consume tons of resources all at once. So if you add a lot of analytical queries together, it's not true that
you just end up with kind of a smoothly changing demand curve of resources. You often get huge
spikes as individual queries could touch petabytes of data, require dozens of cores of compute,
lots of memory to process these things in an efficient way.
So analytical workloads have that property that they can be repetitive in that sometimes they're
used for dashboards or regular reporting tools where the workload is very consistent, but
sometimes they're used directly by end users or through tools in ways that will spike the resource
demands of the workload very,
very quickly. Cool. Yeah. So kind of with that, then where did these, tell us more about the
elastic pools and so these sort of resources that these cloud providers have and give us access to.
So yeah, where do they fit into this picture of provisioning resources? Sure. So I want to,
I want to draw kind of a contrast between what I'll call like a classic cloud or kind of like
the cloud from 10 years ago view of the world versus what I'll call
elastic pools. So typically when you started thinking about the cloud, say a decade ago,
you would, you know, the ability of you to provision additional hardware resources,
provision virtual machines, disks, those kinds of things was given to you and you could start
them up and down, you know, in, you know, an hour, few minutes, something like that.
But the burden of that provisioning or
that decision was left to the user. You decide how many VMs need to be used. In contrast to that,
I want to describe a set of systems, which I'll call elastic pools of resources. So if you're
familiar with cloud object storage systems like Amazon S3, you don't decide how much storage space
you need ahead of time
just by kind of interacting with the system. By doing gets and puts of the storage system,
the system provides you with the hardware resources that you need without you ever having
to decide ahead of time what you actually need. Similar to that, there are cloud function services,
which I'll also group in this kind of elastic pool category. So if you're not familiar with
these services,
essentially what they allow you to do is write some piece of code, upload that code to this Cloud Function service, and then you'll invoke the code through the service. And the cloud provider
takes care of all of this provisioning, deciding how many machines to run, exactly where to run.
They handle all of the provisioning and the assignment of tasks to that provisioned
hardware. I mean, this is not only available here. Like if you're a large enough service provider,
if you're someone like Snowflake, you could potentially run a set of virtual machines in
a multi-tenant way such that you could hand them out to users on demand rather than users having
to choose how much they need to do ahead of time.
And this is true of Databricks, Amazon, Microsoft, et cetera.
So given this picture, then we know sort of the challenges of analytic workloads, which
I found really interesting that you kind of, you assume almost like the law of large
numbers in the sense that kind of, it would become this nice smooth curve, but it is still
very best yet time, which is really fascinating and then this this this um this uh the infrastructure
we have available to ourselves in terms of these elastic poles you've published a paper at seymour
hence why we're hence why we're chatting why we're chatting called cackle and so can you kind of give
us the elevator pitch then for cackle, given this backdrop.
Sure.
And I just want to come back briefly to the description.
And I think with sufficient volume, even though individual queries can be large, you can smooth these things out, which is why these kind of multi-tenant elastic pools of resources
can still make sense or are still valuable.
But for individual users, if you have to decide how much hardware to provision for your workload, and it's not the biggest workload in the world, this can be really hard.
So the idea of CACL is that given that the resource demands of individual users or individual
firms' workloads on this are spiky, difficult to predict, and change pretty rapidly over the day,
how can you get the benefits of elasticity?
That is to say, you never have to decide how many hardware resources that you need ahead
of time at a minimal cost.
So it turns out that, and this is some work that we had done prior to this through a project
called Starling, where we just built an analytical database system on elastic pools.
And this works great for what I'll call ad hoc single user
analytical workloads where maybe there's someone sitting at their computer and they want to do
analysis of a moderately large data set and a terabyte, 10 terabytes of data. And they just
want to run a handful of queries every couple of hours or so. For that, if you just run on Elastic
Pool resources and not exactly in a naive way, you still have to make a bunch of optimizations that are described in that prior work of Starling.
It turns out that if you have anything except this kind of infrequent set of workloads,
or you have a more consistent workload, it can be very, very expensive to exclusively
rely on Elastic Pools of resources.
So the pitch of CACL is that we can mix lower cost virtual
machines. So you'll provision some virtual machines that are dedicated to a user. And these tend to be
relatively slow to start up. So say in the range of tens of seconds to minutes, but they tend to
be less expensive on a per unit time basis than an elastic pool is. And this is because there's
no magic in the world.
So you have to pay for this elasticity to be available. Someone has to pay. Exactly what the
gap between, the cost gap between a virtual machine and an elastic pool is, you know,
depends on a lot of factors, economics, how many users there are, and, you know, the needs of the
cloud provider at the time. but there is some kind of
cost difference there. And so how can you gain kind of the benefits of this, the illusion of
this completely elastic hardware while still maintaining kind of the low cost of virtual
machines or hardware resources that are dedicated to individual users? That was the kind of core
idea of Kackle. And it's actually very difficult to do this because all these components change.
Workloads change.
The costs of things can change.
Startup times of virtual machines can change over time.
Or even like end user preferences can change.
So the goal of Cackle was to find a way of gaining the illusion of rapid elasticity while still relying on kind of lower cost resources and then spilling out into the elastic elastic pools of resources to to fill the gaps when you
don't have enough hardware resources available well it means a lot of variables here to sort of
try and tackle right to do this effectively so let's get into the details of cackle and some
more how you went about trying to solve this problem so, maybe start off with the key components of CACL.
Sure. So if you're familiar with systems like Spark or DataWorks Spark or Presto,
in general, the way that you execute queries, especially these large analytics queries,
is a user submits some SQL that gets transformed eventually into an execution plan, which is
a directed acyclic graph of kind of stages of compute that are connected. And so you have a
number of these stages, which are connected in a graph, and each of those stages will have a number
of tasks. So essentially what, and in between those stages, I should say, you have to exchange
data in between like compute resources. It's not
embarrassingly parallel like some workloads are. You actually have to do this exchange.
So the execution part of CACL is divided into two pieces. One is the compute side,
where you break up the work that you have to do into manageable tasks that you can assign
to hardware resources.
And then for the communication parts, we have kind of shuffling mechanisms. And that's also a
mix of both virtual machines and an elastic pool. So in this case, Amazon S3 is what we implemented
on. So the kind of core components are a set of VMs that do compute, a set of VMs that are
responsible for shuffling, and then elastic
pool alternatives for both of those. So in the actual implementation of CACL, we use Amazon
Lambda as our source of elasticity. And for the shuffling, we use Amazon S3. But above that,
there's a controlling component. You can change the number of virtual machines over time. And
that's the kind of core thing that CACL does. It's trying to figure out how many virtual machines do you need for compute?
How many virtual machines do you need for shuffling over time? And that's the kind of
core mechanism of Kekl, deciding what the split of virtual machines versus elastic pools are.
And I want to add just one thing there. So in CACL versus if you're familiar with systems like Redshift or Snowflake or like Microsoft Fabric, these kind of things, in general, when you submit queries and you've submitted too many, then queries can back up into this queue and slow a lot
of things down. Whereas in Kekl, we're making the assumption that any query that you submit
needs to be run right now. You really care about the latency. So our goal is to never
delay work, never delay the execution of your query. Instead, we're going to spill out into
these elastic components rather than restrict ourselves to only using virtual machetes.
Nice. Cool. Yeah. So those systems have this admission control sort of scheme. So you kind of components rather than restrict ourselves to only using virtual machetes nice cool yeah so
those systems have this admission admission control sort of scheme so you kind of put things
in a queue and i'd like to maybe get into this a little bit at some later on maybe in the podcast
where we talk i'm the question i'm gonna probably gonna ask you at some point i need to get it out
my head now because um is if you hand if you allow if you allowed a user to assign priorities to the queries that are coming in and say, okay, this one's a lower priority, could that then change this whole mechanism of, okay, I don't need to schedule so much now?
And that would be another variable to factor in as well, or is that really not something?
I guess you're relaxing that constraint of this needs to execute now, which maybe makes things easier or harder.
I'm not really too sure.
So we don't cover this particular case of prioritization in CACL, and we can come back to that a little bit later.
But especially in the kind of future work sections, I have some thoughts on interesting problems in that space. But the reason that we defer this prioritization is the idea that
even though hardware resources in the cloud are somewhat slow to start up, we're talking about
queries in general that may last tens of seconds to a few minutes. So if you have to delay their
execution until hardware resources become available, that can significantly impact query
latency. So if you're not very latency
sensitive, you can always wait for virtual machines to start up. So we don't cover those
cases because you could just cover them by waiting for these resources to start up. That's not
something we explicitly cover in CACL, but I think it's an interesting direction for future work.
Cool. Sure. So you said then that the key secret source of CACLN
is deciding how many of these VMs and machines to provision for each of these different categories
you've got for compute shuffling and for the elastic pull. So yeah, tell us about the different
approaches you considered for solving this problem of allocation of resources then to these different
categories. Sure. So I want to say like upfront that, you know,
the first thing that you'd consider doing in these cases is just kind of predict what your workload
is going to be ahead of time and try to ensure that hardware resources are available. But I want
to come back to the kind of background that we talked about. And a lot of the queries that users
will submit are, you know, there's a human at the other end and they may consume a ton of resources.
So you might, and this, we actually observed this in real workloads that we gathered from
Microsoft, Alibaba, and Google.
Some of those are publicly available and some were just shared with us that the hardware
resource demands can spike two, three times, four times in the span of seconds.
So if these things are human generated at the end of the day, there's no predictive
algorithm that's going to tell you, oh, there's going to be a, the human's going to hit enter in the next five seconds. Like, it's just not going to happen. So, you can kind of throw out the perfect predictor assumption right away. You can never predict what your needs are going to be. I mean, you can't predict exactly what your needs are going to be ahead of time. So that's the first thing
that was kind of dismissed. So we tried a bunch of different approaches, but we found kind of
surprisingly that because in Kaggle we never delay work, you can measure essentially what your
hardware resource demands were looking back into the past. You can figure out exactly what your hardware resource demands were looking back into the past. You can figure
out exactly what your needs were. And we use that to kind of inform what the best allocation
strategies are going to be moving forward. So what I mean by an allocation strategy is you'll look
back into this workload history, see what your hardware resource demands were, and then you're
going to choose from a number of extremely simplistic strategies
to find one that minimizes the cost. So in the case of CACL, we're going to assume that the
elastic pool resources that we have available and the virtual machine resources that we have
available are going to be equal in performance, and we're only going to focus on cost. And we
try to design the system when we actually go to the implementation
such that that's as close to true as possible. So what we actually do in Kaggle is we take
strategies of the form, look back some number of minutes or seconds into the past,
try to take the nth percentile, some percentile of that workload, and then a multiplier.
And so the intuition there is that sometimes you want to over-provision. You don't want to only look like, what's the maximum
I've used in the past? Sometimes you actually want to over-provision in these cases. So we find,
and we just take whatever the minimum cost thing was over that period, and that's the strategy we
choose moving forward. There is a delay in starting new virtual machines, and there's some restrictions
on exactly how this operates.
But essentially, the mechanism is figure out what strategy worked in the past and choose that moving forward.
What's interesting about this is that even if you make a mistake, if you make a mistake and you didn't provision enough, this becomes a cost problem.
It's not a performance problem because you can always spill out into this elastic pool of resources.
We conceptualize it as something that's essentially infinitely elastic with a fixed cost.
The only difference is how much is each individual strategy going to cost.
So what I've described is kind of the simplistic version of what we do.
In fact, we use some kind of fancy randomized algorithm, this multiplicative weights algorithm,
which I probably don't want to talk about too much today. But essentially what it's doing is trying to find minimal cost
strategies over this workload history. And that way we can minimize the cost of workload going
forward. Yeah. There's a few kind of variables as you were talking that jump out. How do you
decide how far to look back, for example? So basically we just have a bag of strategies.
And so some strategies will look back one minute, some strategies will look back, for example? So basically, we just have a bag of strategies. And so some strategies
will look back one minute, some strategies will look back 10 minutes. And the intuition behind
this is that virtual machines can take variable amounts of time to start up. So if you assume that
virtual machines are going to start instantaneously, then you probably never have to use an elastic
pool ever because they'll be available as soon as you request them. But the reality is they usually
take a minute or they could take five minutes and they're like, exactly what that time is will impact
how far back into the past you probably want to look. Like if it takes a day to get new virtual
machines, then you probably want to decide based on looking back a week or so.
And the kind of intuition behind CACL is that no matter what of these factors change,
no matter how your workload changes or the cost change or virtual machine startup time change or
cost models even, or like minimum billing times, these kinds of things,
CACL will choose one of these very simplistic strategies that minimizes the cost over time.
And I think potentially you can improve by adding to your bag more sophisticated strategies, predictive strategies, etc.
But what was surprising about Kackle is that even with this kind of extremely simple bag of strategies, you can get pretty close to optimal.
And I'm sure we'll come back to that in the results, but not optimal, but let's say what an oracle could do.
Yeah, that's really cool.
I mean, so obviously maybe this,
it depends how fast you, how dynamic your workload is,
because I guess you're looking back a couple of minutes,
depending on the various strategy,
and you've got this work coming in.
How often is it the case that the one you've chosen
is immediately sort of rendered kind of
not the optimal choice essentially because the workloads just change so fast and you is there
a limit to sort of how fast your workloads because you're always looking back to make
a judgment of the future right and the future can you get these black swan events whatever right
how does that factor into it as well like is it just or can or is it i guess just most of the
time it is close enough to be
an optimal over the long term, it works out quite well.
So we tried with a number of different workloads.
We didn't play, we make an assumption that the workload is more or less consistent, but
you could imagine a situation where, you know, you're wildly under, under provisioning for
the workload that you have.
There's some big event that changes and suddenly you've, youprovisioned. That's going to start costing a lot of money because you're going to
spend money on these elastic pool resources that you would have been better served by having a
virtual machine. So quickly, a strategy that matches your new workload is going to become
the least expensive among those strategies. And
you'll switch to provisioning for this new world. That being said, in this paper, we didn't
investigate kind of these rapid workload shifts that much. I think there's probably some work
that needs to be done to ensure that for all of these kinds of changing situations, you improve.
But what was surprising is that for a wide range of workloads, even this very simplistic strategies worked. Yeah, cool. It's beauty in simplicity, Matt.
So yeah, cool. So let's talk about evaluation in some more detail then. So first of all,
how did you go about evaluating all these different strategies and everything? What
was your experimental setup? Sure. So we really broke the evaluation up into two separate parts. I mean, the kind of
conditions that exist in real cloud environments don't change that rapidly. Like they could change
over the course of the cost of, say, a spot instance virtual machine on AWS could change
over the course of a month or so, or even a couple of days. But you don't have control over
these things. So the first thing that we did is we built an analytical model. It's just something
basically we wrote in Python that tries to model all of the important components of the system.
And we took kind of an off-the-shelf set of queries, TPC-H queries, and also some TPC-DS
queries. So these are standard analytical database workload benchmark
queries. And then we changed, say, the arrival rates of these queries. So we wanted to model
what real workloads look like. So real workloads have kind of these unpredictable spikes, as we
described, as well as kind of regular components where every single day or maybe every single hour,
certain queries will arrive. So we tried to vary the workloads as much as possible, changing the period of these peaks,
changing how much randomness, changing how many queries were in there. And then we also changed
the environment in which we assumed that these things were executing. So changing the startup
time of virtual machines, the cost of virtual machines relative to using an elastic pool, and all sorts of these factors. I don't think all of those experiments made it to the paper. But in general, we tried to change as many things about the workload and the environment as possible to ensure that this CACL strategy was robust. you know, you can't cover every single situation. And of course there are, there's always edge cases and there's always kind of adversarial cases where
you're not going to do the best. But in general, we try to make sure we cover as wide a range of
reasonable scenarios as possible. Cool. Yeah. And what were the results from the analytical
model? And they kind of give you good, good hope that this is the right thing to be doing before
then going and actually implementing it for real, right? I guess that was the, it was a litmus test
in that sense, the analytical model.
But yeah, what were the results from it?
So the results essentially were that among the set of strategies that we compared against,
so we compared against some, you know, kind of relatively simplistic strategies, like
a fixed provisioning, which is probably what you would do if you're, if you're deciding
how many virtual machines to run, you'll just kind of fix a number of virtual machines and
continue that over time. We tried like kind of predictive,
simple predictive models, just like kind of linear fit to the last few minutes to try to show how
predictiveness doesn't work particularly well. And what I want to convey is that if you take
into account the important components, if you take into account workload shifts, the cost of these resources, which is one thing that is not often considered, like what is the
cost differential of doing these things, then you probably will end up at a reasonably good strategy.
I think there are improvements to CACL that could be made that would push us a little bit
cheaper. But overall, the results were, compared with reasonable baselines, we found that Kekl was the best performing among those baselines. And furthermore, we also added an oracle. So an oracle which has
perfect knowledge of the workload into the future and can allocate hardware resources
with future knowledge when demand spikes. So Kekl was pretty close in a wide range of these scenarios to this oracle. So that's kind
of why we stopped the optimization there. We didn't want to kind of polish too hard.
So did you compare against real systems, Matt? How did CACL fare against what's out there at
the moment? Yeah. So it turns out that we're not the first people to notice that analytical
workloads change over time. And the kind of the main commercial options in the space, people like,
you know, Snowflake, Databricks, and Redshift, we, you know, we compared against these kinds
of systems. And essentially what all of those systems did at the time of our evaluation was
they'll kind of wait until queries back up into a queue and then they'll provision
you additional clusters of compute in addition to that. So you kind of get this kind of stepwise
increase and then decrease in those systems. But the issue with that is because they wait until
queries queue, the latency during those periods of queries can spike really rapidly. So what we
find when we compare against these real systems is that, well,
these commercial systems have the property that they're either really expensive because they'll
be over-provisioning at times when they don't need to be, or when workload spikes, they tend to
have very high latency, whereas Kackle is able to maintain a really consistent performance and cost
across a range of these scenarios,
especially as you increase your workload. So it doesn't suffer the problem of missing these
demand spikes because you can fill in these gaps. And furthermore, it doesn't keep a bunch of
resources around when you don't actually need them. It tries to make a kind of reasonable
provisioning decision. How likely do you think it could be that the CACO could sort of be integrated into one of these existing commercial systems?
You know, all of these, the big providers in this space, people like Snowflake, Microsoft,
Amazon Databricks, you know, they have hardware resources that are sitting around because they
want to provide better experiences for their end users. So like, if you go to Snowflake today,
and you hit start a cluster, that cluster will be available to you immediately, more or less. And it's not that it's not because,
you know, virtual machines start up instantly for Snowflake, where they don't for you, it's that
they keep a warm set of instances running so that they can hand them to you when this demand spikes.
So it's tackle is a question of how you go about using these multi tenant resources versus the kind
of fixed dedicated and user resources that are like, how do you decide how much is dedicated to an
individual user versus this kind of multi-tenant elastic compute that's sitting on the side? So
each of them actually has these elastic pools of compute, but it's a question of how
they are actually used to execute workloads. Yeah. So like, as long as you can assign
dollars to these things, and as long as you can assign dollars to these things and as long as you
can i think there are you know probably you don't want to just plop a research system into a
commercial system and kind of hope for the best i i have no doubt that there are additional problems
that need solving but i do think that this is a solution that might be adapted into some of these
commercial systems i mean back yourself, let's get it out
there in a while. I'm sure I'll be just fine. Cool. Yeah. So we can talk about the implementation
a little bit more there. How easy was it to implement this in terms of implementation effort?
Was it quite an arduous task to sort of get this working from scratch?
So we did have the benefit of, so we published this paper in 2000 called Starling.
And so Starling, again, was a analytical query engine that was built exclusively on elastic resources.
So Amazon Lambda and Amazon S3.
And we want, the goal was never provision anything in that system, like just leave it to these elastic pools.
And so we had the benefit of the execution engine kind of already being around.
But the kind of core implementation parts that were difficult was, you know, you have to interact
with these cloud systems, you have to actually start up virtual machines. So we had to have this
dedicated set of shuffling resources to significantly reduce costs. So there was an
implementation in this kind of shuffling layer the execution engine was largely unchanged from
starling although we did add a few features and then the the real kind of core of the work is the
controlling component the kind of primary component that decides how many vms to start and
keeps track of which ones are alive and assigns tasks to individual workers either either on the
virtual machine side or on the elastic pool side uh, et cetera. How long did it take you? Obviously, you had this existing code base you could build
on top of and refactor a little bit to get here. But yeah, how long was the implementation,
if out of interest, Matt? I mean, I think that once it was clear exactly what we needed to do,
the implementation went reasonably fast, like a few months. But after that, the
Starling paper, we went and tried several different things. And there's a lot of things that are
interesting in elasticity. And we eventually chose one problem. And that's the thing that we pushed
out. So once it was clear exactly what the solution was, it was relatively easy to get that going.
But the exploration of solutions took a really long time so i think what's what's helpful is to build analytical models of everything you know about the world
so you can try out different solutions very simple solutions even to see how they perform because
if you jump straight to the implementation it can take a lot longer yeah you can spend a long time
yeah polishing a brick right that's the thing if you haven't kind of had that initial sort of like
validation of okay this looks plausible then you don't want to spend two years building something
that's going to turn out out to be like performed terribly right which you could have figured out
straight away by putting together a an analytical uh model um cool yeah so i don't know if we
actually went into future directions i think we kind of put the brake on and change thingy so can
we maybe talk a little bit more about future directions? And also as well, the name Cackle.
I always like to know the naming of things and why they're named what they are.
So yeah, what is Cackle and why was that the name that was chosen?
So I have to go back to the Starling paper, which I think Starling is probably my favorite
name for a system.
Like the idea is that there's like, if you don't, if people are familiar, Starlings are
these little birds and they fly in these giant murmurations that
form these kinds of beautiful flowing things.
And so like you can make these big things happen with lots of these little,
little components. And that's the, that was the idea of starling.
Cackle was a placeholder name that I couldn't change after submission.
No way really.
But I chose Cackle for a a very good reason which is that i
love hard k sounds like i just think like the hard consonant sounds very very like it's compelling
there's something about this like that's very fun so cackle has two hard consonant sounds
that's really the only reason okay that's cool i mean because um i thought maybe like something
like uh when you were saying that obviously like mean because um i thought maybe like something like uh when you
were saying that obviously like talking about style i thought maybe the collective noun for a
flock of styling is a cackle maybe but then no yeah but it's just a placeholder name that we
can pretend yeah we can pretend that that's the that's the truth but really yeah i wanted to
choose a more elegant name but i really just like the hard consonants that's why i became the
placeholder name cool and yeah so future research
directions then matt where do you where do you folks go next with with cackle so i should say
that for me the this kind of journey is i mean there's there's one more work in the pipeline
which is not directly related to these prior works but i should say that i'm gonna i'm finishing up
my phd i'm actually defending a week from recording. And
then afterwards, I'm headed to Microsoft Research. So I don't know what the journey holds for me
there. I might be working on similar problems. I have no idea. But in terms of future interesting
directions in this kind of elasticity space, I think both Starling and Kackel make the assumption
that whenever you read base table data,
that you're going to pull it from cloud object storage, which to be fair is what a lot of these
systems do. At the end of the day, all of this data is stored in cloud object storage,
whether it's visible to the end user or not. Systems like Redshift, internally,
they're storing things in cloud object storage because then you can easily do elasticity. But what's missing from some of this
work is revisiting caching or revisiting buffer pool management in the context of these elastic
systems. So I mean, everyone knows that caching is efficient. And if you have repeated queries
on the same data, it's probably going to be efficient to keep copies of that data on these
instances. Interestingly,
sometimes these things are compute bound and not IO bound. So in some cases, it doesn't end up
mattering that much, but in a lot of cases, it really does. And exactly how you balance the,
like, you know, it's difficult to balance elasticity with caching because, you know,
these resources may disappear underneath you. So figuring out how to do that
well and to meet end user preferences is pretty important. So sometimes users really care about
being low cost. Sometimes they really care about being performant. And what's really interesting
in these elastic scenarios is it exposes the cost of these things in a way that I think
provision systems typically do not. If you provision X number of virtual machines
for virtual machines,
you have a fixed set of memory to work with
and you wanna make best use of that memory all the time.
But if you actually, if you're not executing anything
and you have to pay for that memory to be available to you,
then exactly what is the value of keeping that around
is gonna depend on what the end user actually wants.
So how do you manage those kinds of questions, end user preferences, performance, and cost
in these elastic scenarios is a direction that I think is very interesting. And these commercial
systems are working towards trying to do some of these things, but I'm not sure they're fully
there yet. So I think research, academic research can do a lot in this direction.
Yeah.
First thing, good luck for next week.
I'm sure you're going to absolutely be fine.
You're going to ace it.
It'll be great, I'm sure.
And congratulations on the position at Microsoft Research as well.
That'll be fantastic.
And you never know.
You could be working on analytical databases there as well.
And you might be able to do some really cool stuff on kind of factoring in
and user preferences as well and things like that as well that's fantastic yeah cool so i guess let's talk about impact then of of cackle and
has there been any sort of impact in the short term that you've seen with cackle in terms of
interactions with industry and oh yeah i kind of maybe also thinking longer term what sort of
impact would you like cackle to have and if we look if we revisit this paper in 10 years time so for example i mean i think what i would like to happen is for these
systems to simply be more elastic to their end users while trying to meet their their needs so
like one core idea of cackle is like the user is pretty probably pretty bad at estimating what
their virtual like what their hardware resource requirements is.
And Kekla is very workload-driven in the sense that whatever their hardware resource requirements
are, you'll try to provision for them. And I think the current way that systems like Amazon
Redshift and Databricks and Snowflake and the like, Microsoft and Google, they all have their
own things. But balancing, figuring out how to make elasticity much easier to use and removing
like the sense that, you know, you have this fixed set of resources that are kind of consistently
available to you and you can almost never spill out into them.
Like occasionally you throw in a large query that suddenly needs a lot more hardware resources
and you know, it may not go as fast as you actually want.
So figuring out how to integrate
this elasticity, given that we're in the cloud, is something that I think these systems could
take away. As far as what developers or data engineers could leverage in the findings of my
research, I think that for people who have used these systems before, people have complaints.
If they're hard to use at the end of the day, and the more you complain to
your cloud, to the providers about the difficulty of using these systems, I think the easier things
are going to be moving forward. So this work is not something that's probably easily usable by
an end user, but the more people understand what is possible in these systems and can push
the providers to make some of these adjustments. And I do think, to these systems and can push the providers to make some
of these adjustments. And I do think, to be fair, I think the providers of these systems understand
that things could be better, but they're slow-moving behemoths. And the more they're pushed,
the more they're likely to try to make these systems easier for end users to use.
So if you have problems with provisioning your analytical database start
complaining that's my message but yeah you hit the nail on the head there with that sort of
usability angle and systems and it's such a underappreciated quality and especially maybe
sometimes in from the academic perspective where focus more on let's make this go faster let's make this i don't get
my throughput this bit higher latency a little bit lower but i mean that's all good and well but
there's that that sort of that dimension of usability is really really important people
who can kind of crack that whilst also getting the performance there as well and all that low
cost in this case and whatnot and that's the secret that's the secret combination to success
i think um but
yeah cool so yeah the next questions maybe we can do a bit of a now you're coming towards the end
of your phd matt we can maybe do a little bit of a reflection on this this project so maybe your
phd as a whole as well could be quite nice of what's there been the most interesting lesson
you've learned maybe while working on cackle and then the same question over the journey of your
PhD. Yeah I mean I think what I've learned through this CACL this whole CACL project is the value of
de-risking things early and making sure you understand kind of all of the all of the relevant
components so we after the Starling paper as I said there's there was 10 different directions
we could go in like cost versus performance delaying work. We tried a lot of these things.
And we also tried to implement some of these ideas directly in a system.
And it turned out that just getting that system up and running on the elastic pole was slow enough that it defeated all of the benefits of doing so.
So that was a kind of wasted several months. Uh, so, you know, de-risking cutting off the right size of the, you know, cutting off a
piece that you can actually digest rather than trying to bite off more than you can
chew, de-risking things early, talking to smart people, like got to talk to those people
early.
Um, and then as far as the PhD, like a lot of people believe things are impossible or
they believe, uh, you know, and they have and they may have good reasons for thinking,
you know, this is not the right way of going about things.
So I don't want to throw too much shade on other research groups.
But there was a kind of popular paper in a research conference that suggested that using
cloud function services or these functions as a service, things like Amazon Lambda, was
not a promising direction going forward.
And at a high level, I agree with the sentiment
that you probably don't want to rely on these third-party providers
to do all of this provisioning for you.
But I think the lesson that they missed by ignoring these
was how can you gain the benefits of elasticity
and can we start exploring it as...
These things are sources of elasticity that we can use.
And they're still kind of interesting research things that you can do, even though that's not the right solution at the end of the day.
So we started this project, the Starling project, this kind of whole serverless research direction on the assumption that it was a terrible idea.
And in some respects, like, I don't think you should drop Starling in to your, you know,
middle-sized enterprise and start using it. But there are research ideas that come out of that,
that are valuable. And yeah, those are lessons learned, like listen to the smart people, but
don't, you know, smart people haven't thought about every problem in the world. So sometimes
if you think something's interesting, just push a little bit on it and see if there's anything there yeah i think that's really nice advice i like what
you're saying about um de-risking because i i had this thing i still do to this day i find it very
hard to stay focused on one test like it's new shiny thing right so i can go do this can go do
that but like i like that i like the fact that you had that sort of um awareness earlier on that
early on okay like we need to sort of make things tractable bite off bite off exactly what at least what you can chew
and don't have eyes bigger than your stomach um so you know that's uh i really really like that um
it's very easy to get distracted by alternative interesting problems or to try to think you're
going to solve everything in one paper. But a lot of times,
you know, you just can't do that. And it's, it's hard to make progress in these cases. So,
you know, take, de-risk what you can push forward. And, you know, I think, despite CACL not solving
every single problem in the world, I think it's a very, I mean, I think there are valuable research
lessons there. So, and it was a hard hard fought lesson i will say yeah cool
so my next question is normally around sort of the origin story of the of the paper and maybe
we've covered this already but it'd be nice to sort of like um i guess go back to the original
sort of proposal for this sort of line of work how much of it how much how did that sort of evolve
originally was it because of this you kind of saw this paper from this other group and be like hang on a minute we
can we can we can take that assumption but like do something cool or was it how did the paper and
like this line of research actually evolve in the first place yeah so um for starling it's pretty
straightforward so we we had seen some you know research in the community, not really in the database community, but in systems networking kind of places, which would employ cloud function services, specifically Amazon Lambda, for kind of embarrassingly parallel or pretty close to embarrassingly parallel work where you can break things up into chunks.
And so it's very great for these kind of bursty scenarios.
But for analytical databases, you have more sophisticated communication patterns.
So it wasn't totally clear that that was a good direction.
So it was kind of,
I was intrigued by this elasticity
or functions as a service using this.
And the kind of crazy idea was,
could we build an analytical database
on top of these things?
And the first thing that we did was,
try to run the simplest queries that we did was try to run the simplest
queries that we could, try to find exactly where the problem spots were. So within, I don't know,
a few weeks, it was clear that there was something there. We originally assumed it was a terrible
idea. But again, we had the de-risking early. I should have learned it that day. But it took
longer than that. But in any case, that's really the origin story. We thought it was a terrible idea. And it turns out that by polishing and getting the engineering right, there are benefits to
pursuing this as a strategy. And again, if there's people at Snowflake, Databricks,
or Amazon listening to this, I don't suggest that you start building on Amazon Lambda exactly. But
the lessons of how to gain... What are the benefits of elasticity to end users was the main thing there.
Yeah.
I mean, I think this is kind of an extension of that work, which broadens it out to a wider range of workloads.
Like, the kind of CACL paper backstory was we knew at the end of Starling that for ad hoc analytics, Starling was great.
Like, it kind of fits this niche perfectly where you're just an end user.
You don't want to care about the resources ahead of time.
You just want to submit queries and get results
and you don't want to care about how much,
what the underlying hardware is.
I think there's been progress made since then.
But at the end of that paper,
we knew that for a wider range of workloads,
it was going to be cost prohibitive to do this.
So that's really the origin of this CACL work where we started thinking about delaying work, the Pareto frontier of cost
versus latency of queries, looking at what real query workloads look like. That's kind of where
CACL came from. Exactly which problem in that space we chose to bite off took a while, but
that's definitely... We wanted to focus on something in this elasticity workload
management and hardware provisioning space nice it's always great to hear how these things are
these papers go out because i just arrived at it and it's at this beautiful end where it's there
published in nice conference proceedings and yeah it's always nice to see how that how they um how
the path that led to this um came about in the first place. I guess tangentially, tangentially, can't speak today,
tangentially related, I'm going to go for closely related
because I can't seem to say that word today,
is the idea of being creative and generating ideas
and then once you've generated those ideas,
not going off in 10 different directions, right?
So like actually selecting the one to work on
and do some de-risking. how do you approach that that process of idea
idea generation and then selecting projects matt and do you have like a systematic way of doing it
or is it more ad hoc i mean i think uh you know that the number of opportunities you have to
choose projects is is somewhat limited like if you dedicate yourself to a project it's going to take
you know it's going to take months minimum.
So the number of chances you actually have to do this
is somewhat limited.
So the best advice I can give is,
if you can, surround yourself with people
who are a lot smarter than you
or know things about the space that you may not
and that are open to new ideas.
So if an idea pops into your head,
you should express it.
And maybe early in your PhD or early in your career, you may not know that this idea is
good or bad.
And having people who can honestly and openly evaluate that idea that are around you is
valuable.
Or at least like, oh, people did this 10 years ago.
It didn't work for X, Y, Z reasons.
And then you can say, well, did something change in the last 10 years? Or is to those lessons still hold? So I
think, personally, I think looking at kind of the frontier, what's, what's, what's missing in these
kind of commercial space, like what are people wanting to do that they can't do now, is helpful.
So I like looking at kind of new technologies. I like looking at things that
people are doing that are kind of a pain to do today, but maybe there's not like great technical
reasons that they can't do it. That's just like people haven't worked on the engineering of this
problem space enough. So that's the best thing I can think of. Like surround yourself with ideas,
try to come up with a handful of your own, and then have a few people around you who are
decent at evaluating those ideas and gather opinions from a lot of them, but are also open to being wrong about stuff sometimes.
Acknowledging when, you know, you don't know something, that's the most valuable thing
I've found in people. Like, people have very strong but ill-informed opinions,
or, like, they can't express why the opinion that they think is bad. Like, you want to surround
yourself with, like like open-minded,
creative people can around surround your ideas.
And you know,
this is like time tested tradition.
Like if you're the smart,
find yourself to be the smartest person in the room,
you got to find a new room.
So I tried to do that in my career and I hope that I can do some moving
forward.
Yeah.
I like to say as well,
that kind of,
well,
this idea of being surrounded by smart people, because it because it especially early on in your career right it takes
a lot of time to sort of develop that intuition of being able to evaluate whether an idea is good
or bad right like you just like you aren't born with that that's something you accumulate through
experience and through conversation through discussion through trying and failing right
and so yeah definitely having people around you can help you make that um that decision and hone
that sort of skill of being like yeah this is good this is bad and yeah sure always if you're
always if you're the smartest person in the room in the wrong room right it's a great quote yeah
but yeah i mean and furthermore like aside from the technical stuff you know i'm at the end of
my phd now you got to find places where you want to be with the people too like you know technical
prowess is is one thing you know being around smart people is one thing, but if, you know,
those smart people are not very kind, then it's not going to be a very fun time. So if you're,
you know, if some of you are listening and your early career PhD, or you're deciding where to do
your PhD, like I can't emphasize enough how being around nice people, understanding people,
but people who will still push you is is
very valuable and uh yeah i can't i mean i've been very blessed in my career to be surrounded
by those people so try to find those rooms that's my best advice yeah i mean that's absolutely great
advice my eyes lovely um cool so yeah we're at the time for the for the last word i need some
theme music for this like this last word like i dondum-tss, as they add now.
Cool, but anyway.
So yeah, what's the one thing you want the listener
to take away from this podcast today?
That is a hard thing to answer.
I think I want people to be dissatisfied.
You know, it's not as though you're going to come up
with the best solution tomorrow,
but be dissatisfied with the world and try to find interesting fun solutions and even simple solutions
like the best solution out there is not the most complex one it's the one that everyone will adopt
yeah love that great message to end on matt thank you so much for coming on the pod today
it's been actually a great chat and i'm sure the listener will have will have loved it as well and
we'll drop links to all the things we've chatted about, all the papers and whatnot in the show notes as well.
And yeah, we'll see you all next time
for some more awesome computer science research. Bye.