Disseminate: The Computer Science Research Podcast - Matt Perron | Analytical Workload Cost and Performance Stability With Elastic Pools | #57
Episode Date: July 22, 2024In this episode, we dive deep into the complexities of managing analytical query workloads with our guest, Matt Perron. Matt explains how the rapid and unpredictable fluctuations in resource demands p...resent a significant challenge for provisioning. Traditional methods often lead to either over-provisioning, resulting in excessive costs, or under-provisioning, which causes poor query latency during demand spikes. However, there's a promising solution on the horizon. Matt shares insights from recent research that showcases the viability of using cloud functions to dynamically match compute supply with workload demand without the need for prior resource provisioning. While effective for low query volumes, this approach becomes cost-prohibitive as query volumes increase, highlighting the need for a more balanced strategy.Matt introduces us to a novel strategy that combines the best of both worlds: the rapid scalability of cloud functions and the cost-effectiveness of virtual machines. This innovative approach leverages the fast but expensive cloud functions alongside slow-starting yet inexpensive virtual machines to provide elasticity without sacrificing cost efficiency. He elaborates on how their implementation, called Cackle, achieves consistent performance and cost savings across a wide range of workloads and conditions. Tune in to learn how Cackle avoids the pitfalls of traditional approaches, delivering stable query performance and minimizing costs even as demand fluctuates wildly.Links:Cackle: Analytical Workload Cost and Performance Stability With Elastic Pools [SIGMOD'24]Matt's Homepage Hosted on Acast. See acast.com/privacy for more information.
 Transcript
 Discussion  (0)
    
                                         Hi everyone, Jack here from Disseminate the Computer Science Research Podcast.
                                         
                                         Welcome to another episode in our ongoing Cutting Edge series.
                                         
                                         Today's focus will be analytical query workloads,
                                         
                                         which some of you may know are quite hard to predict in terms of their resource demands,
                                         
                                         which makes provisioning a bit of a challenge.
                                         
                                         And to solve this problem for us, we've got Matt Perron with me.
                                         
                                         Matt is a PhD student in the MIT Data Systems Group, and his primary research is making analytical database systems more easy to
                                         
                                         use. So welcome to the show, Matt. Thank you. Happy to be here.
                                         
    
                                         Great stuff then. So in tradition of the podcast, we always start off by getting you to tell a
                                         
                                         listener more about yourself and how you kind of ended up
                                         
                                         interested in database research and yeah, and specifically why analytical databases as well,
                                         
                                         I guess. Sure. Yeah. I think I have kind of a non-traditional computer science background.
                                         
                                         I mean, none of the people in my family had done PhDs, so this was all kind of new to me.
                                         
                                         So I started my career or I guess my education at Rochester Institute of Technology in
                                         
                                         Rochester, New York. And I was kind of, I would describe myself as a generalist, but with a kind
                                         
                                         of interest in systems, the kinds of problems. And furthermore, I had an interest in Japan. So I ended
                                         
    
                                         up studying abroad for a year. And then after that, I actually worked at SoftBank in Tokyo for a few years. And I got plopped into a distributed key value
                                         
                                         store project. And I felt wildly underqualified on all of these problems. So after a couple of
                                         
                                         years of working there, I decided to go back and get a master's degree with absolutely no intention
                                         
                                         of doing a PhD or research or anything. And then at Carnegie Mellon University, I started this
                                         
                                         master's degree.
                                         
                                         And there I bumped into Andy Pavlo, who's a great database systems researcher and a super nice guy.
                                         
                                         And I took his database systems class. He had lecture series about these components that I had been using when I was working at SoftBank. And I just thought all
                                         
                                         of this was great, but it was kind of a generally general database systems stuff at the time. Andy Pavlo convinced me to do a PhD and I thought like doing research is great. And I applied
                                         
    
                                         around, I got into CMU of course, and MIT and a couple other places, but I ended up at MIT and
                                         
                                         there, the kind of first projects I worked on were related to analytical database systems rather than
                                         
                                         transaction processing or other things. So while I had like kind of my first taste in research at CMU, I think I really kind
                                         
                                         of delved into analytics at MIT. And as you'll see, we'd like worked on this kind of elastic
                                         
                                         pool technology or serverless systems that came a little bit later. And that's part of what we'll
                                         
                                         talk about today. So I never really intended to do a PhD,
                                         
                                         but people kept convincing me to move forward with my education
                                         
                                         now that I'm at the end.
                                         
    
                                         It's been a lot of fun.
                                         
                                         Yeah, that's an awesome story.
                                         
                                         It's always great to see how people kind of end up where they do.
                                         
                                         And I'm super jealous about you doing the study abroad
                                         
                                         and getting to live in Japan for some time
                                         
                                         because that's kind of my one regret when i look back at my my own sort of um journey if i'd never i always wanted to live abroad at some point i never ever
                                         
                                         did it as part i think kind of doing it's part of um state kind of studying isn't a really nice way
                                         
                                         to do that obviously i've had like a few research visits here and there and stuff but yeah still
                                         
    
                                         something on my bucket list to do that and japan's also very close to the top of my bucket list is
                                         
                                         also very envious with you on that one there matt yeah i figured at the time i was like if i don't do this
                                         
                                         now i'm never gonna do it so i might as well give it a shot so and i had made good friends there
                                         
                                         when i studied abroad so it was nice to return to the city yeah yeah you've always got i guess uh
                                         
                                         somewhere to crash when you go back over there then as well definitely cool right let's get back
                                         
                                         on topic then so we've mentioned
                                         
                                         a few sort of key terms today so for the um for the uninitiated or the uninitiated listener let's
                                         
                                         set some context for for the chat then so can you start by telling us in some detail what analytical
                                         
    
                                         databases and workloads are and yeah what are elastic pools sure so let's start with analytical
                                         
                                         databases and workloads so
                                         
                                         oftentimes if you've taken kind of an undergraduate database systems class they tend to focus on kind
                                         
                                         of transaction processing workloads so kind of like bank workloads where the amount of data that
                                         
                                         you touch with any individual query is tends to not be that much uh but you have a lot of
                                         
                                         transactions that need to happen all at once. So sometimes in the millions of transactions for some systems,
                                         
                                         whereas in analytics,
                                         
                                         the focus is on kind of extracting value from large values of information or
                                         
    
                                         large volumes of information rather.
                                         
                                         So you could imagine terabytes or petabytes of information that gets kind of
                                         
                                         joined table, different tables joined together.
                                         
                                         And then you, you output a small number of aggregates.
                                         
                                         So there's some processing that goes on. So you
                                         
                                         think about the number of queries being lower, but that the work of each individual query
                                         
                                         may be quite large. Because individual queries can consume a lot of resources,
                                         
                                         if you kind of mix a lot of these things together, you might expect, well, it smooths out as you
                                         
    
                                         increase the number of queries. But it turns out that individual queries can consume tons of resources all at once. So if you add a lot of analytical queries together, it's not true that
                                         
                                         you just end up with kind of a smoothly changing demand curve of resources. You often get huge
                                         
                                         spikes as individual queries could touch petabytes of data, require dozens of cores of compute,
                                         
                                         lots of memory to process these things in an efficient way.
                                         
                                         So analytical workloads have that property that they can be repetitive in that sometimes they're
                                         
                                         used for dashboards or regular reporting tools where the workload is very consistent, but
                                         
                                         sometimes they're used directly by end users or through tools in ways that will spike the resource
                                         
                                         demands of the workload very,
                                         
    
                                         very quickly. Cool. Yeah. So kind of with that, then where did these, tell us more about the
                                         
                                         elastic pools and so these sort of resources that these cloud providers have and give us access to.
                                         
                                         So yeah, where do they fit into this picture of provisioning resources? Sure. So I want to,
                                         
                                         I want to draw kind of a contrast between what I'll call like a classic cloud or kind of like
                                         
                                         the cloud from 10 years ago view of the world versus what I'll call
                                         
                                         elastic pools. So typically when you started thinking about the cloud, say a decade ago,
                                         
                                         you would, you know, the ability of you to provision additional hardware resources,
                                         
                                         provision virtual machines, disks, those kinds of things was given to you and you could start
                                         
    
                                         them up and down, you know, in, you know, an hour, few minutes, something like that.
                                         
                                         But the burden of that provisioning or
                                         
                                         that decision was left to the user. You decide how many VMs need to be used. In contrast to that,
                                         
                                         I want to describe a set of systems, which I'll call elastic pools of resources. So if you're
                                         
                                         familiar with cloud object storage systems like Amazon S3, you don't decide how much storage space
                                         
                                         you need ahead of time
                                         
                                         just by kind of interacting with the system. By doing gets and puts of the storage system,
                                         
                                         the system provides you with the hardware resources that you need without you ever having
                                         
    
                                         to decide ahead of time what you actually need. Similar to that, there are cloud function services,
                                         
                                         which I'll also group in this kind of elastic pool category. So if you're not familiar with
                                         
                                         these services,
                                         
                                         essentially what they allow you to do is write some piece of code, upload that code to this Cloud Function service, and then you'll invoke the code through the service. And the cloud provider
                                         
                                         takes care of all of this provisioning, deciding how many machines to run, exactly where to run.
                                         
                                         They handle all of the provisioning and the assignment of tasks to that provisioned
                                         
                                         hardware. I mean, this is not only available here. Like if you're a large enough service provider,
                                         
                                         if you're someone like Snowflake, you could potentially run a set of virtual machines in
                                         
    
                                         a multi-tenant way such that you could hand them out to users on demand rather than users having
                                         
                                         to choose how much they need to do ahead of time.
                                         
                                         And this is true of Databricks, Amazon, Microsoft, et cetera.
                                         
                                         So given this picture, then we know sort of the challenges of analytic workloads, which
                                         
                                         I found really interesting that you kind of, you assume almost like the law of large
                                         
                                         numbers in the sense that kind of, it would become this nice smooth curve, but it is still
                                         
                                         very best yet time, which is really fascinating and then this this this um this uh the infrastructure
                                         
                                         we have available to ourselves in terms of these elastic poles you've published a paper at seymour
                                         
    
                                         hence why we're hence why we're chatting why we're chatting called cackle and so can you kind of give
                                         
                                         us the elevator pitch then for cackle, given this backdrop.
                                         
                                         Sure.
                                         
                                         And I just want to come back briefly to the description.
                                         
                                         And I think with sufficient volume, even though individual queries can be large, you can smooth these things out, which is why these kind of multi-tenant elastic pools of resources
                                         
                                         can still make sense or are still valuable.
                                         
                                         But for individual users, if you have to decide how much hardware to provision for your workload, and it's not the biggest workload in the world, this can be really hard.
                                         
                                         So the idea of CACL is that given that the resource demands of individual users or individual
                                         
    
                                         firms' workloads on this are spiky, difficult to predict, and change pretty rapidly over the day,
                                         
                                         how can you get the benefits of elasticity?
                                         
                                         That is to say, you never have to decide how many hardware resources that you need ahead
                                         
                                         of time at a minimal cost.
                                         
                                         So it turns out that, and this is some work that we had done prior to this through a project
                                         
                                         called Starling, where we just built an analytical database system on elastic pools.
                                         
                                         And this works great for what I'll call ad hoc single user
                                         
                                         analytical workloads where maybe there's someone sitting at their computer and they want to do
                                         
    
                                         analysis of a moderately large data set and a terabyte, 10 terabytes of data. And they just
                                         
                                         want to run a handful of queries every couple of hours or so. For that, if you just run on Elastic
                                         
                                         Pool resources and not exactly in a naive way, you still have to make a bunch of optimizations that are described in that prior work of Starling.
                                         
                                         It turns out that if you have anything except this kind of infrequent set of workloads,
                                         
                                         or you have a more consistent workload, it can be very, very expensive to exclusively
                                         
                                         rely on Elastic Pools of resources.
                                         
                                         So the pitch of CACL is that we can mix lower cost virtual
                                         
                                         machines. So you'll provision some virtual machines that are dedicated to a user. And these tend to be
                                         
    
                                         relatively slow to start up. So say in the range of tens of seconds to minutes, but they tend to
                                         
                                         be less expensive on a per unit time basis than an elastic pool is. And this is because there's
                                         
                                         no magic in the world.
                                         
                                         So you have to pay for this elasticity to be available. Someone has to pay. Exactly what the
                                         
                                         gap between, the cost gap between a virtual machine and an elastic pool is, you know,
                                         
                                         depends on a lot of factors, economics, how many users there are, and, you know, the needs of the
                                         
                                         cloud provider at the time. but there is some kind of
                                         
                                         cost difference there. And so how can you gain kind of the benefits of this, the illusion of
                                         
    
                                         this completely elastic hardware while still maintaining kind of the low cost of virtual
                                         
                                         machines or hardware resources that are dedicated to individual users? That was the kind of core
                                         
                                         idea of Kackle. And it's actually very difficult to do this because all these components change.
                                         
                                         Workloads change.
                                         
                                         The costs of things can change.
                                         
                                         Startup times of virtual machines can change over time.
                                         
                                         Or even like end user preferences can change.
                                         
                                         So the goal of Cackle was to find a way of gaining the illusion of rapid elasticity while still relying on kind of lower cost resources and then spilling out into the elastic elastic pools of resources to to fill the gaps when you
                                         
    
                                         don't have enough hardware resources available well it means a lot of variables here to sort of
                                         
                                         try and tackle right to do this effectively so let's get into the details of cackle and some
                                         
                                         more how you went about trying to solve this problem so, maybe start off with the key components of CACL.
                                         
                                         Sure. So if you're familiar with systems like Spark or DataWorks Spark or Presto,
                                         
                                         in general, the way that you execute queries, especially these large analytics queries,
                                         
                                         is a user submits some SQL that gets transformed eventually into an execution plan, which is
                                         
                                         a directed acyclic graph of kind of stages of compute that are connected. And so you have a
                                         
                                         number of these stages, which are connected in a graph, and each of those stages will have a number
                                         
    
                                         of tasks. So essentially what, and in between those stages, I should say, you have to exchange
                                         
                                         data in between like compute resources. It's not
                                         
                                         embarrassingly parallel like some workloads are. You actually have to do this exchange.
                                         
                                         So the execution part of CACL is divided into two pieces. One is the compute side,
                                         
                                         where you break up the work that you have to do into manageable tasks that you can assign
                                         
                                         to hardware resources.
                                         
                                         And then for the communication parts, we have kind of shuffling mechanisms. And that's also a
                                         
                                         mix of both virtual machines and an elastic pool. So in this case, Amazon S3 is what we implemented
                                         
    
                                         on. So the kind of core components are a set of VMs that do compute, a set of VMs that are
                                         
                                         responsible for shuffling, and then elastic
                                         
                                         pool alternatives for both of those. So in the actual implementation of CACL, we use Amazon
                                         
                                         Lambda as our source of elasticity. And for the shuffling, we use Amazon S3. But above that,
                                         
                                         there's a controlling component. You can change the number of virtual machines over time. And
                                         
                                         that's the kind of core thing that CACL does. It's trying to figure out how many virtual machines do you need for compute?
                                         
                                         How many virtual machines do you need for shuffling over time? And that's the kind of
                                         
                                         core mechanism of Kekl, deciding what the split of virtual machines versus elastic pools are.
                                         
    
                                         And I want to add just one thing there. So in CACL versus if you're familiar with systems like Redshift or Snowflake or like Microsoft Fabric, these kind of things, in general, when you submit queries and you've submitted too many, then queries can back up into this queue and slow a lot
                                         
                                         of things down. Whereas in Kekl, we're making the assumption that any query that you submit
                                         
                                         needs to be run right now. You really care about the latency. So our goal is to never
                                         
                                         delay work, never delay the execution of your query. Instead, we're going to spill out into
                                         
                                         these elastic components rather than restrict ourselves to only using virtual machetes.
                                         
                                         Nice. Cool. Yeah. So those systems have this admission control sort of scheme. So you kind of components rather than restrict ourselves to only using virtual machetes nice cool yeah so
                                         
                                         those systems have this admission admission control sort of scheme so you kind of put things
                                         
                                         in a queue and i'd like to maybe get into this a little bit at some later on maybe in the podcast
                                         
    
                                         where we talk i'm the question i'm gonna probably gonna ask you at some point i need to get it out
                                         
                                         my head now because um is if you hand if you allow if you allowed a user to assign priorities to the queries that are coming in and say, okay, this one's a lower priority, could that then change this whole mechanism of, okay, I don't need to schedule so much now?
                                         
                                         And that would be another variable to factor in as well, or is that really not something?
                                         
                                         I guess you're relaxing that constraint of this needs to execute now, which maybe makes things easier or harder.
                                         
                                         I'm not really too sure.
                                         
                                         So we don't cover this particular case of prioritization in CACL, and we can come back to that a little bit later.
                                         
                                         But especially in the kind of future work sections, I have some thoughts on interesting problems in that space. But the reason that we defer this prioritization is the idea that
                                         
                                         even though hardware resources in the cloud are somewhat slow to start up, we're talking about
                                         
    
                                         queries in general that may last tens of seconds to a few minutes. So if you have to delay their
                                         
                                         execution until hardware resources become available, that can significantly impact query
                                         
                                         latency. So if you're not very latency
                                         
                                         sensitive, you can always wait for virtual machines to start up. So we don't cover those
                                         
                                         cases because you could just cover them by waiting for these resources to start up. That's not
                                         
                                         something we explicitly cover in CACL, but I think it's an interesting direction for future work.
                                         
                                         Cool. Sure. So you said then that the key secret source of CACLN
                                         
                                         is deciding how many of these VMs and machines to provision for each of these different categories
                                         
    
                                         you've got for compute shuffling and for the elastic pull. So yeah, tell us about the different
                                         
                                         approaches you considered for solving this problem of allocation of resources then to these different
                                         
                                         categories. Sure. So I want to say like upfront that, you know,
                                         
                                         the first thing that you'd consider doing in these cases is just kind of predict what your workload
                                         
                                         is going to be ahead of time and try to ensure that hardware resources are available. But I want
                                         
                                         to come back to the kind of background that we talked about. And a lot of the queries that users
                                         
                                         will submit are, you know, there's a human at the other end and they may consume a ton of resources.
                                         
                                         So you might, and this, we actually observed this in real workloads that we gathered from
                                         
    
                                         Microsoft, Alibaba, and Google.
                                         
                                         Some of those are publicly available and some were just shared with us that the hardware
                                         
                                         resource demands can spike two, three times, four times in the span of seconds.
                                         
                                         So if these things are human generated at the end of the day, there's no predictive
                                         
                                         algorithm that's going to tell you, oh, there's going to be a, the human's going to hit enter in the next five seconds. Like, it's just not going to happen. So, you can kind of throw out the perfect predictor assumption right away. You can never predict what your needs are going to be. I mean, you can't predict exactly what your needs are going to be ahead of time. So that's the first thing
                                         
                                         that was kind of dismissed. So we tried a bunch of different approaches, but we found kind of
                                         
                                         surprisingly that because in Kaggle we never delay work, you can measure essentially what your
                                         
                                         hardware resource demands were looking back into the past. You can figure out exactly what your hardware resource demands were looking back into the past. You can figure
                                         
    
                                         out exactly what your needs were. And we use that to kind of inform what the best allocation
                                         
                                         strategies are going to be moving forward. So what I mean by an allocation strategy is you'll look
                                         
                                         back into this workload history, see what your hardware resource demands were, and then you're
                                         
                                         going to choose from a number of extremely simplistic strategies
                                         
                                         to find one that minimizes the cost. So in the case of CACL, we're going to assume that the
                                         
                                         elastic pool resources that we have available and the virtual machine resources that we have
                                         
                                         available are going to be equal in performance, and we're only going to focus on cost. And we
                                         
                                         try to design the system when we actually go to the implementation
                                         
    
                                         such that that's as close to true as possible. So what we actually do in Kaggle is we take
                                         
                                         strategies of the form, look back some number of minutes or seconds into the past,
                                         
                                         try to take the nth percentile, some percentile of that workload, and then a multiplier.
                                         
                                         And so the intuition there is that sometimes you want to over-provision. You don't want to only look like, what's the maximum
                                         
                                         I've used in the past? Sometimes you actually want to over-provision in these cases. So we find,
                                         
                                         and we just take whatever the minimum cost thing was over that period, and that's the strategy we
                                         
                                         choose moving forward. There is a delay in starting new virtual machines, and there's some restrictions
                                         
                                         on exactly how this operates.
                                         
    
                                         But essentially, the mechanism is figure out what strategy worked in the past and choose that moving forward.
                                         
                                         What's interesting about this is that even if you make a mistake, if you make a mistake and you didn't provision enough, this becomes a cost problem.
                                         
                                         It's not a performance problem because you can always spill out into this elastic pool of resources.
                                         
                                         We conceptualize it as something that's essentially infinitely elastic with a fixed cost.
                                         
                                         The only difference is how much is each individual strategy going to cost.
                                         
                                         So what I've described is kind of the simplistic version of what we do.
                                         
                                         In fact, we use some kind of fancy randomized algorithm, this multiplicative weights algorithm,
                                         
                                         which I probably don't want to talk about too much today. But essentially what it's doing is trying to find minimal cost
                                         
    
                                         strategies over this workload history. And that way we can minimize the cost of workload going
                                         
                                         forward. Yeah. There's a few kind of variables as you were talking that jump out. How do you
                                         
                                         decide how far to look back, for example? So basically we just have a bag of strategies.
                                         
                                         And so some strategies will look back one minute, some strategies will look back, for example? So basically, we just have a bag of strategies. And so some strategies
                                         
                                         will look back one minute, some strategies will look back 10 minutes. And the intuition behind
                                         
                                         this is that virtual machines can take variable amounts of time to start up. So if you assume that
                                         
                                         virtual machines are going to start instantaneously, then you probably never have to use an elastic
                                         
                                         pool ever because they'll be available as soon as you request them. But the reality is they usually
                                         
    
                                         take a minute or they could take five minutes and they're like, exactly what that time is will impact
                                         
                                         how far back into the past you probably want to look. Like if it takes a day to get new virtual
                                         
                                         machines, then you probably want to decide based on looking back a week or so.
                                         
                                         And the kind of intuition behind CACL is that no matter what of these factors change,
                                         
                                         no matter how your workload changes or the cost change or virtual machine startup time change or
                                         
                                         cost models even, or like minimum billing times, these kinds of things,
                                         
                                         CACL will choose one of these very simplistic strategies that minimizes the cost over time.
                                         
                                         And I think potentially you can improve by adding to your bag more sophisticated strategies, predictive strategies, etc.
                                         
    
                                         But what was surprising about Kackle is that even with this kind of extremely simple bag of strategies, you can get pretty close to optimal.
                                         
                                         And I'm sure we'll come back to that in the results, but not optimal, but let's say what an oracle could do.
                                         
                                         Yeah, that's really cool.
                                         
                                         I mean, so obviously maybe this,
                                         
                                         it depends how fast you, how dynamic your workload is,
                                         
                                         because I guess you're looking back a couple of minutes,
                                         
                                         depending on the various strategy,
                                         
                                         and you've got this work coming in.
                                         
    
                                         How often is it the case that the one you've chosen
                                         
                                         is immediately sort of rendered kind of
                                         
                                         not the optimal choice essentially because the workloads just change so fast and you is there
                                         
                                         a limit to sort of how fast your workloads because you're always looking back to make
                                         
                                         a judgment of the future right and the future can you get these black swan events whatever right
                                         
                                         how does that factor into it as well like is it just or can or is it i guess just most of the
                                         
                                         time it is close enough to be
                                         
                                         an optimal over the long term, it works out quite well.
                                         
    
                                         So we tried with a number of different workloads.
                                         
                                         We didn't play, we make an assumption that the workload is more or less consistent, but
                                         
                                         you could imagine a situation where, you know, you're wildly under, under provisioning for
                                         
                                         the workload that you have.
                                         
                                         There's some big event that changes and suddenly you've, youprovisioned. That's going to start costing a lot of money because you're going to
                                         
                                         spend money on these elastic pool resources that you would have been better served by having a
                                         
                                         virtual machine. So quickly, a strategy that matches your new workload is going to become
                                         
                                         the least expensive among those strategies. And
                                         
    
                                         you'll switch to provisioning for this new world. That being said, in this paper, we didn't
                                         
                                         investigate kind of these rapid workload shifts that much. I think there's probably some work
                                         
                                         that needs to be done to ensure that for all of these kinds of changing situations, you improve.
                                         
                                         But what was surprising is that for a wide range of workloads, even this very simplistic strategies worked. Yeah, cool. It's beauty in simplicity, Matt.
                                         
                                         So yeah, cool. So let's talk about evaluation in some more detail then. So first of all,
                                         
                                         how did you go about evaluating all these different strategies and everything? What
                                         
                                         was your experimental setup? Sure. So we really broke the evaluation up into two separate parts. I mean, the kind of
                                         
                                         conditions that exist in real cloud environments don't change that rapidly. Like they could change
                                         
    
                                         over the course of the cost of, say, a spot instance virtual machine on AWS could change
                                         
                                         over the course of a month or so, or even a couple of days. But you don't have control over
                                         
                                         these things. So the first thing that we did is we built an analytical model. It's just something
                                         
                                         basically we wrote in Python that tries to model all of the important components of the system.
                                         
                                         And we took kind of an off-the-shelf set of queries, TPC-H queries, and also some TPC-DS
                                         
                                         queries. So these are standard analytical database workload benchmark
                                         
                                         queries. And then we changed, say, the arrival rates of these queries. So we wanted to model
                                         
                                         what real workloads look like. So real workloads have kind of these unpredictable spikes, as we
                                         
    
                                         described, as well as kind of regular components where every single day or maybe every single hour,
                                         
                                         certain queries will arrive. So we tried to vary the workloads as much as possible, changing the period of these peaks,
                                         
                                         changing how much randomness, changing how many queries were in there. And then we also changed
                                         
                                         the environment in which we assumed that these things were executing. So changing the startup
                                         
                                         time of virtual machines, the cost of virtual machines relative to using an elastic pool, and all sorts of these factors. I don't think all of those experiments made it to the paper. But in general, we tried to change as many things about the workload and the environment as possible to ensure that this CACL strategy was robust. you know, you can't cover every single situation. And of course there are, there's always edge cases and there's always kind of adversarial cases where
                                         
                                         you're not going to do the best. But in general, we try to make sure we cover as wide a range of
                                         
                                         reasonable scenarios as possible. Cool. Yeah. And what were the results from the analytical
                                         
                                         model? And they kind of give you good, good hope that this is the right thing to be doing before
                                         
    
                                         then going and actually implementing it for real, right? I guess that was the, it was a litmus test
                                         
                                         in that sense, the analytical model.
                                         
                                         But yeah, what were the results from it?
                                         
                                         So the results essentially were that among the set of strategies that we compared against,
                                         
                                         so we compared against some, you know, kind of relatively simplistic strategies, like
                                         
                                         a fixed provisioning, which is probably what you would do if you're, if you're deciding
                                         
                                         how many virtual machines to run, you'll just kind of fix a number of virtual machines and
                                         
                                         continue that over time. We tried like kind of predictive,
                                         
    
                                         simple predictive models, just like kind of linear fit to the last few minutes to try to show how
                                         
                                         predictiveness doesn't work particularly well. And what I want to convey is that if you take
                                         
                                         into account the important components, if you take into account workload shifts, the cost of these resources, which is one thing that is not often considered, like what is the
                                         
                                         cost differential of doing these things, then you probably will end up at a reasonably good strategy.
                                         
                                         I think there are improvements to CACL that could be made that would push us a little bit
                                         
                                         cheaper. But overall, the results were, compared with reasonable baselines, we found that Kekl was the best performing among those baselines. And furthermore, we also added an oracle. So an oracle which has
                                         
                                         perfect knowledge of the workload into the future and can allocate hardware resources
                                         
                                         with future knowledge when demand spikes. So Kekl was pretty close in a wide range of these scenarios to this oracle. So that's kind
                                         
    
                                         of why we stopped the optimization there. We didn't want to kind of polish too hard.
                                         
                                         So did you compare against real systems, Matt? How did CACL fare against what's out there at
                                         
                                         the moment? Yeah. So it turns out that we're not the first people to notice that analytical
                                         
                                         workloads change over time. And the kind of the main commercial options in the space, people like,
                                         
                                         you know, Snowflake, Databricks, and Redshift, we, you know, we compared against these kinds
                                         
                                         of systems. And essentially what all of those systems did at the time of our evaluation was
                                         
                                         they'll kind of wait until queries back up into a queue and then they'll provision
                                         
                                         you additional clusters of compute in addition to that. So you kind of get this kind of stepwise
                                         
    
                                         increase and then decrease in those systems. But the issue with that is because they wait until
                                         
                                         queries queue, the latency during those periods of queries can spike really rapidly. So what we
                                         
                                         find when we compare against these real systems is that, well,
                                         
                                         these commercial systems have the property that they're either really expensive because they'll
                                         
                                         be over-provisioning at times when they don't need to be, or when workload spikes, they tend to
                                         
                                         have very high latency, whereas Kackle is able to maintain a really consistent performance and cost
                                         
                                         across a range of these scenarios,
                                         
                                         especially as you increase your workload. So it doesn't suffer the problem of missing these
                                         
    
                                         demand spikes because you can fill in these gaps. And furthermore, it doesn't keep a bunch of
                                         
                                         resources around when you don't actually need them. It tries to make a kind of reasonable
                                         
                                         provisioning decision. How likely do you think it could be that the CACO could sort of be integrated into one of these existing commercial systems?
                                         
                                         You know, all of these, the big providers in this space, people like Snowflake, Microsoft,
                                         
                                         Amazon Databricks, you know, they have hardware resources that are sitting around because they
                                         
                                         want to provide better experiences for their end users. So like, if you go to Snowflake today,
                                         
                                         and you hit start a cluster, that cluster will be available to you immediately, more or less. And it's not that it's not because,
                                         
                                         you know, virtual machines start up instantly for Snowflake, where they don't for you, it's that
                                         
    
                                         they keep a warm set of instances running so that they can hand them to you when this demand spikes.
                                         
                                         So it's tackle is a question of how you go about using these multi tenant resources versus the kind
                                         
                                         of fixed dedicated and user resources that are like, how do you decide how much is dedicated to an
                                         
                                         individual user versus this kind of multi-tenant elastic compute that's sitting on the side? So
                                         
                                         each of them actually has these elastic pools of compute, but it's a question of how
                                         
                                         they are actually used to execute workloads. Yeah. So like, as long as you can assign
                                         
                                         dollars to these things, and as long as you can assign dollars to these things and as long as you
                                         
                                         can i think there are you know probably you don't want to just plop a research system into a
                                         
    
                                         commercial system and kind of hope for the best i i have no doubt that there are additional problems
                                         
                                         that need solving but i do think that this is a solution that might be adapted into some of these
                                         
                                         commercial systems i mean back yourself, let's get it out
                                         
                                         there in a while. I'm sure I'll be just fine. Cool. Yeah. So we can talk about the implementation
                                         
                                         a little bit more there. How easy was it to implement this in terms of implementation effort?
                                         
                                         Was it quite an arduous task to sort of get this working from scratch?
                                         
                                         So we did have the benefit of, so we published this paper in 2000 called Starling.
                                         
                                         And so Starling, again, was a analytical query engine that was built exclusively on elastic resources.
                                         
    
                                         So Amazon Lambda and Amazon S3.
                                         
                                         And we want, the goal was never provision anything in that system, like just leave it to these elastic pools.
                                         
                                         And so we had the benefit of the execution engine kind of already being around.
                                         
                                         But the kind of core implementation parts that were difficult was, you know, you have to interact
                                         
                                         with these cloud systems, you have to actually start up virtual machines. So we had to have this
                                         
                                         dedicated set of shuffling resources to significantly reduce costs. So there was an
                                         
                                         implementation in this kind of shuffling layer the execution engine was largely unchanged from
                                         
                                         starling although we did add a few features and then the the real kind of core of the work is the
                                         
    
                                         controlling component the kind of primary component that decides how many vms to start and
                                         
                                         keeps track of which ones are alive and assigns tasks to individual workers either either on the
                                         
                                         virtual machine side or on the elastic pool side uh, et cetera. How long did it take you? Obviously, you had this existing code base you could build
                                         
                                         on top of and refactor a little bit to get here. But yeah, how long was the implementation,
                                         
                                         if out of interest, Matt? I mean, I think that once it was clear exactly what we needed to do,
                                         
                                         the implementation went reasonably fast, like a few months. But after that, the
                                         
                                         Starling paper, we went and tried several different things. And there's a lot of things that are
                                         
                                         interesting in elasticity. And we eventually chose one problem. And that's the thing that we pushed
                                         
    
                                         out. So once it was clear exactly what the solution was, it was relatively easy to get that going.
                                         
                                         But the exploration of solutions took a really long time so i think what's what's helpful is to build analytical models of everything you know about the world
                                         
                                         so you can try out different solutions very simple solutions even to see how they perform because
                                         
                                         if you jump straight to the implementation it can take a lot longer yeah you can spend a long time
                                         
                                         yeah polishing a brick right that's the thing if you haven't kind of had that initial sort of like
                                         
                                         validation of okay this looks plausible then you don't want to spend two years building something
                                         
                                         that's going to turn out out to be like performed terribly right which you could have figured out
                                         
                                         straight away by putting together a an analytical uh model um cool yeah so i don't know if we
                                         
    
                                         actually went into future directions i think we kind of put the brake on and change thingy so can
                                         
                                         we maybe talk a little bit more about future directions? And also as well, the name Cackle.
                                         
                                         I always like to know the naming of things and why they're named what they are.
                                         
                                         So yeah, what is Cackle and why was that the name that was chosen?
                                         
                                         So I have to go back to the Starling paper, which I think Starling is probably my favorite
                                         
                                         name for a system.
                                         
                                         Like the idea is that there's like, if you don't, if people are familiar, Starlings are
                                         
                                         these little birds and they fly in these giant murmurations that
                                         
    
                                         form these kinds of beautiful flowing things.
                                         
                                         And so like you can make these big things happen with lots of these little,
                                         
                                         little components. And that's the, that was the idea of starling.
                                         
                                         Cackle was a placeholder name that I couldn't change after submission.
                                         
                                         No way really.
                                         
                                         But I chose Cackle for a a very good reason which is that i
                                         
                                         love hard k sounds like i just think like the hard consonant sounds very very like it's compelling
                                         
                                         there's something about this like that's very fun so cackle has two hard consonant sounds
                                         
    
                                         that's really the only reason okay that's cool i mean because um i thought maybe like something
                                         
                                         like uh when you were saying that obviously like mean because um i thought maybe like something like uh when you
                                         
                                         were saying that obviously like talking about style i thought maybe the collective noun for a
                                         
                                         flock of styling is a cackle maybe but then no yeah but it's just a placeholder name that we
                                         
                                         can pretend yeah we can pretend that that's the that's the truth but really yeah i wanted to
                                         
                                         choose a more elegant name but i really just like the hard consonants that's why i became the
                                         
                                         placeholder name cool and yeah so future research
                                         
                                         directions then matt where do you where do you folks go next with with cackle so i should say
                                         
    
                                         that for me the this kind of journey is i mean there's there's one more work in the pipeline
                                         
                                         which is not directly related to these prior works but i should say that i'm gonna i'm finishing up
                                         
                                         my phd i'm actually defending a week from recording. And
                                         
                                         then afterwards, I'm headed to Microsoft Research. So I don't know what the journey holds for me
                                         
                                         there. I might be working on similar problems. I have no idea. But in terms of future interesting
                                         
                                         directions in this kind of elasticity space, I think both Starling and Kackel make the assumption
                                         
                                         that whenever you read base table data,
                                         
                                         that you're going to pull it from cloud object storage, which to be fair is what a lot of these
                                         
    
                                         systems do. At the end of the day, all of this data is stored in cloud object storage,
                                         
                                         whether it's visible to the end user or not. Systems like Redshift, internally,
                                         
                                         they're storing things in cloud object storage because then you can easily do elasticity. But what's missing from some of this
                                         
                                         work is revisiting caching or revisiting buffer pool management in the context of these elastic
                                         
                                         systems. So I mean, everyone knows that caching is efficient. And if you have repeated queries
                                         
                                         on the same data, it's probably going to be efficient to keep copies of that data on these
                                         
                                         instances. Interestingly,
                                         
                                         sometimes these things are compute bound and not IO bound. So in some cases, it doesn't end up
                                         
    
                                         mattering that much, but in a lot of cases, it really does. And exactly how you balance the,
                                         
                                         like, you know, it's difficult to balance elasticity with caching because, you know,
                                         
                                         these resources may disappear underneath you. So figuring out how to do that
                                         
                                         well and to meet end user preferences is pretty important. So sometimes users really care about
                                         
                                         being low cost. Sometimes they really care about being performant. And what's really interesting
                                         
                                         in these elastic scenarios is it exposes the cost of these things in a way that I think
                                         
                                         provision systems typically do not. If you provision X number of virtual machines
                                         
                                         for virtual machines,
                                         
    
                                         you have a fixed set of memory to work with
                                         
                                         and you wanna make best use of that memory all the time.
                                         
                                         But if you actually, if you're not executing anything
                                         
                                         and you have to pay for that memory to be available to you,
                                         
                                         then exactly what is the value of keeping that around
                                         
                                         is gonna depend on what the end user actually wants.
                                         
                                         So how do you manage those kinds of questions, end user preferences, performance, and cost
                                         
                                         in these elastic scenarios is a direction that I think is very interesting. And these commercial
                                         
    
                                         systems are working towards trying to do some of these things, but I'm not sure they're fully
                                         
                                         there yet. So I think research, academic research can do a lot in this direction.
                                         
                                         Yeah.
                                         
                                         First thing, good luck for next week.
                                         
                                         I'm sure you're going to absolutely be fine.
                                         
                                         You're going to ace it.
                                         
                                         It'll be great, I'm sure.
                                         
                                         And congratulations on the position at Microsoft Research as well.
                                         
    
                                         That'll be fantastic.
                                         
                                         And you never know.
                                         
                                         You could be working on analytical databases there as well.
                                         
                                         And you might be able to do some really cool stuff on kind of factoring in
                                         
                                         and user preferences as well and things like that as well that's fantastic yeah cool so i guess let's talk about impact then of of cackle and
                                         
                                         has there been any sort of impact in the short term that you've seen with cackle in terms of
                                         
                                         interactions with industry and oh yeah i kind of maybe also thinking longer term what sort of
                                         
                                         impact would you like cackle to have and if we look if we revisit this paper in 10 years time so for example i mean i think what i would like to happen is for these
                                         
    
                                         systems to simply be more elastic to their end users while trying to meet their their needs so
                                         
                                         like one core idea of cackle is like the user is pretty probably pretty bad at estimating what
                                         
                                         their virtual like what their hardware resource requirements is.
                                         
                                         And Kekla is very workload-driven in the sense that whatever their hardware resource requirements
                                         
                                         are, you'll try to provision for them. And I think the current way that systems like Amazon
                                         
                                         Redshift and Databricks and Snowflake and the like, Microsoft and Google, they all have their
                                         
                                         own things. But balancing, figuring out how to make elasticity much easier to use and removing
                                         
                                         like the sense that, you know, you have this fixed set of resources that are kind of consistently
                                         
    
                                         available to you and you can almost never spill out into them.
                                         
                                         Like occasionally you throw in a large query that suddenly needs a lot more hardware resources
                                         
                                         and you know, it may not go as fast as you actually want.
                                         
                                         So figuring out how to integrate
                                         
                                         this elasticity, given that we're in the cloud, is something that I think these systems could
                                         
                                         take away. As far as what developers or data engineers could leverage in the findings of my
                                         
                                         research, I think that for people who have used these systems before, people have complaints.
                                         
                                         If they're hard to use at the end of the day, and the more you complain to
                                         
    
                                         your cloud, to the providers about the difficulty of using these systems, I think the easier things
                                         
                                         are going to be moving forward. So this work is not something that's probably easily usable by
                                         
                                         an end user, but the more people understand what is possible in these systems and can push
                                         
                                         the providers to make some of these adjustments. And I do think, to these systems and can push the providers to make some
                                         
                                         of these adjustments. And I do think, to be fair, I think the providers of these systems understand
                                         
                                         that things could be better, but they're slow-moving behemoths. And the more they're pushed,
                                         
                                         the more they're likely to try to make these systems easier for end users to use.
                                         
                                         So if you have problems with provisioning your analytical database start
                                         
    
                                         complaining that's my message but yeah you hit the nail on the head there with that sort of
                                         
                                         usability angle and systems and it's such a underappreciated quality and especially maybe
                                         
                                         sometimes in from the academic perspective where focus more on let's make this go faster let's make this i don't get
                                         
                                         my throughput this bit higher latency a little bit lower but i mean that's all good and well but
                                         
                                         there's that that sort of that dimension of usability is really really important people
                                         
                                         who can kind of crack that whilst also getting the performance there as well and all that low
                                         
                                         cost in this case and whatnot and that's the secret that's the secret combination to success
                                         
                                         i think um but
                                         
    
                                         yeah cool so yeah the next questions maybe we can do a bit of a now you're coming towards the end
                                         
                                         of your phd matt we can maybe do a little bit of a reflection on this this project so maybe your
                                         
                                         phd as a whole as well could be quite nice of what's there been the most interesting lesson
                                         
                                         you've learned maybe while working on cackle and then the same question over the journey of your
                                         
                                         PhD. Yeah I mean I think what I've learned through this CACL this whole CACL project is the value of
                                         
                                         de-risking things early and making sure you understand kind of all of the all of the relevant
                                         
                                         components so we after the Starling paper as I said there's there was 10 different directions
                                         
                                         we could go in like cost versus performance delaying work. We tried a lot of these things.
                                         
    
                                         And we also tried to implement some of these ideas directly in a system.
                                         
                                         And it turned out that just getting that system up and running on the elastic pole was slow enough that it defeated all of the benefits of doing so.
                                         
                                         So that was a kind of wasted several months. Uh, so, you know, de-risking cutting off the right size of the, you know, cutting off a
                                         
                                         piece that you can actually digest rather than trying to bite off more than you can
                                         
                                         chew, de-risking things early, talking to smart people, like got to talk to those people
                                         
                                         early.
                                         
                                         Um, and then as far as the PhD, like a lot of people believe things are impossible or
                                         
                                         they believe, uh, you know, and they have and they may have good reasons for thinking,
                                         
    
                                         you know, this is not the right way of going about things.
                                         
                                         So I don't want to throw too much shade on other research groups.
                                         
                                         But there was a kind of popular paper in a research conference that suggested that using
                                         
                                         cloud function services or these functions as a service, things like Amazon Lambda, was
                                         
                                         not a promising direction going forward.
                                         
                                         And at a high level, I agree with the sentiment
                                         
                                         that you probably don't want to rely on these third-party providers
                                         
                                         to do all of this provisioning for you.
                                         
    
                                         But I think the lesson that they missed by ignoring these
                                         
                                         was how can you gain the benefits of elasticity
                                         
                                         and can we start exploring it as...
                                         
                                         These things are sources of elasticity that we can use.
                                         
                                         And they're still kind of interesting research things that you can do, even though that's not the right solution at the end of the day.
                                         
                                         So we started this project, the Starling project, this kind of whole serverless research direction on the assumption that it was a terrible idea.
                                         
                                         And in some respects, like, I don't think you should drop Starling in to your, you know,
                                         
                                         middle-sized enterprise and start using it. But there are research ideas that come out of that,
                                         
    
                                         that are valuable. And yeah, those are lessons learned, like listen to the smart people, but
                                         
                                         don't, you know, smart people haven't thought about every problem in the world. So sometimes
                                         
                                         if you think something's interesting, just push a little bit on it and see if there's anything there yeah i think that's really nice advice i like what
                                         
                                         you're saying about um de-risking because i i had this thing i still do to this day i find it very
                                         
                                         hard to stay focused on one test like it's new shiny thing right so i can go do this can go do
                                         
                                         that but like i like that i like the fact that you had that sort of um awareness earlier on that
                                         
                                         early on okay like we need to sort of make things tractable bite off bite off exactly what at least what you can chew
                                         
                                         and don't have eyes bigger than your stomach um so you know that's uh i really really like that um
                                         
    
                                         it's very easy to get distracted by alternative interesting problems or to try to think you're
                                         
                                         going to solve everything in one paper. But a lot of times,
                                         
                                         you know, you just can't do that. And it's, it's hard to make progress in these cases. So,
                                         
                                         you know, take, de-risk what you can push forward. And, you know, I think, despite CACL not solving
                                         
                                         every single problem in the world, I think it's a very, I mean, I think there are valuable research
                                         
                                         lessons there. So, and it was a hard hard fought lesson i will say yeah cool
                                         
                                         so my next question is normally around sort of the origin story of the of the paper and maybe
                                         
                                         we've covered this already but it'd be nice to sort of like um i guess go back to the original
                                         
    
                                         sort of proposal for this sort of line of work how much of it how much how did that sort of evolve
                                         
                                         originally was it because of this you kind of saw this paper from this other group and be like hang on a minute we
                                         
                                         can we can we can take that assumption but like do something cool or was it how did the paper and
                                         
                                         like this line of research actually evolve in the first place yeah so um for starling it's pretty
                                         
                                         straightforward so we we had seen some you know research in the community, not really in the database community, but in systems networking kind of places, which would employ cloud function services, specifically Amazon Lambda, for kind of embarrassingly parallel or pretty close to embarrassingly parallel work where you can break things up into chunks.
                                         
                                         And so it's very great for these kind of bursty scenarios.
                                         
                                         But for analytical databases, you have more sophisticated communication patterns.
                                         
                                         So it wasn't totally clear that that was a good direction.
                                         
    
                                         So it was kind of,
                                         
                                         I was intrigued by this elasticity
                                         
                                         or functions as a service using this.
                                         
                                         And the kind of crazy idea was,
                                         
                                         could we build an analytical database
                                         
                                         on top of these things?
                                         
                                         And the first thing that we did was,
                                         
                                         try to run the simplest queries that we did was try to run the simplest
                                         
    
                                         queries that we could, try to find exactly where the problem spots were. So within, I don't know,
                                         
                                         a few weeks, it was clear that there was something there. We originally assumed it was a terrible
                                         
                                         idea. But again, we had the de-risking early. I should have learned it that day. But it took
                                         
                                         longer than that. But in any case, that's really the origin story. We thought it was a terrible idea. And it turns out that by polishing and getting the engineering right, there are benefits to
                                         
                                         pursuing this as a strategy. And again, if there's people at Snowflake, Databricks,
                                         
                                         or Amazon listening to this, I don't suggest that you start building on Amazon Lambda exactly. But
                                         
                                         the lessons of how to gain... What are the benefits of elasticity to end users was the main thing there.
                                         
                                         Yeah.
                                         
    
                                         I mean, I think this is kind of an extension of that work, which broadens it out to a wider range of workloads.
                                         
                                         Like, the kind of CACL paper backstory was we knew at the end of Starling that for ad hoc analytics, Starling was great.
                                         
                                         Like, it kind of fits this niche perfectly where you're just an end user.
                                         
                                         You don't want to care about the resources ahead of time.
                                         
                                         You just want to submit queries and get results
                                         
                                         and you don't want to care about how much,
                                         
                                         what the underlying hardware is.
                                         
                                         I think there's been progress made since then.
                                         
    
                                         But at the end of that paper,
                                         
                                         we knew that for a wider range of workloads,
                                         
                                         it was going to be cost prohibitive to do this.
                                         
                                         So that's really the origin of this CACL work where we started thinking about delaying work, the Pareto frontier of cost
                                         
                                         versus latency of queries, looking at what real query workloads look like. That's kind of where
                                         
                                         CACL came from. Exactly which problem in that space we chose to bite off took a while, but
                                         
                                         that's definitely... We wanted to focus on something in this elasticity workload
                                         
                                         management and hardware provisioning space nice it's always great to hear how these things are
                                         
    
                                         these papers go out because i just arrived at it and it's at this beautiful end where it's there
                                         
                                         published in nice conference proceedings and yeah it's always nice to see how that how they um how
                                         
                                         the path that led to this um came about in the first place. I guess tangentially, tangentially, can't speak today,
                                         
                                         tangentially related, I'm going to go for closely related
                                         
                                         because I can't seem to say that word today,
                                         
                                         is the idea of being creative and generating ideas
                                         
                                         and then once you've generated those ideas,
                                         
                                         not going off in 10 different directions, right?
                                         
    
                                         So like actually selecting the one to work on
                                         
                                         and do some de-risking. how do you approach that that process of idea
                                         
                                         idea generation and then selecting projects matt and do you have like a systematic way of doing it
                                         
                                         or is it more ad hoc i mean i think uh you know that the number of opportunities you have to
                                         
                                         choose projects is is somewhat limited like if you dedicate yourself to a project it's going to take
                                         
                                         you know it's going to take months minimum.
                                         
                                         So the number of chances you actually have to do this
                                         
                                         is somewhat limited.
                                         
    
                                         So the best advice I can give is,
                                         
                                         if you can, surround yourself with people
                                         
                                         who are a lot smarter than you
                                         
                                         or know things about the space that you may not
                                         
                                         and that are open to new ideas.
                                         
                                         So if an idea pops into your head,
                                         
                                         you should express it.
                                         
                                         And maybe early in your PhD or early in your career, you may not know that this idea is
                                         
    
                                         good or bad.
                                         
                                         And having people who can honestly and openly evaluate that idea that are around you is
                                         
                                         valuable.
                                         
                                         Or at least like, oh, people did this 10 years ago.
                                         
                                         It didn't work for X, Y, Z reasons.
                                         
                                         And then you can say, well, did something change in the last 10 years? Or is to those lessons still hold? So I
                                         
                                         think, personally, I think looking at kind of the frontier, what's, what's, what's missing in these
                                         
                                         kind of commercial space, like what are people wanting to do that they can't do now, is helpful.
                                         
    
                                         So I like looking at kind of new technologies. I like looking at things that
                                         
                                         people are doing that are kind of a pain to do today, but maybe there's not like great technical
                                         
                                         reasons that they can't do it. That's just like people haven't worked on the engineering of this
                                         
                                         problem space enough. So that's the best thing I can think of. Like surround yourself with ideas,
                                         
                                         try to come up with a handful of your own, and then have a few people around you who are
                                         
                                         decent at evaluating those ideas and gather opinions from a lot of them, but are also open to being wrong about stuff sometimes.
                                         
                                         Acknowledging when, you know, you don't know something, that's the most valuable thing
                                         
                                         I've found in people. Like, people have very strong but ill-informed opinions,
                                         
    
                                         or, like, they can't express why the opinion that they think is bad. Like, you want to surround
                                         
                                         yourself with, like like open-minded,
                                         
                                         creative people can around surround your ideas.
                                         
                                         And you know,
                                         
                                         this is like time tested tradition.
                                         
                                         Like if you're the smart,
                                         
                                         find yourself to be the smartest person in the room,
                                         
                                         you got to find a new room.
                                         
    
                                         So I tried to do that in my career and I hope that I can do some moving
                                         
                                         forward.
                                         
                                         Yeah.
                                         
                                         I like to say as well,
                                         
                                         that kind of,
                                         
                                         well,
                                         
                                         this idea of being surrounded by smart people, because it because it especially early on in your career right it takes
                                         
                                         a lot of time to sort of develop that intuition of being able to evaluate whether an idea is good
                                         
    
                                         or bad right like you just like you aren't born with that that's something you accumulate through
                                         
                                         experience and through conversation through discussion through trying and failing right
                                         
                                         and so yeah definitely having people around you can help you make that um that decision and hone
                                         
                                         that sort of skill of being like yeah this is good this is bad and yeah sure always if you're
                                         
                                         always if you're the smartest person in the room in the wrong room right it's a great quote yeah
                                         
                                         but yeah i mean and furthermore like aside from the technical stuff you know i'm at the end of
                                         
                                         my phd now you got to find places where you want to be with the people too like you know technical
                                         
                                         prowess is is one thing you know being around smart people is one thing, but if, you know,
                                         
    
                                         those smart people are not very kind, then it's not going to be a very fun time. So if you're,
                                         
                                         you know, if some of you are listening and your early career PhD, or you're deciding where to do
                                         
                                         your PhD, like I can't emphasize enough how being around nice people, understanding people,
                                         
                                         but people who will still push you is is
                                         
                                         very valuable and uh yeah i can't i mean i've been very blessed in my career to be surrounded
                                         
                                         by those people so try to find those rooms that's my best advice yeah i mean that's absolutely great
                                         
                                         advice my eyes lovely um cool so yeah we're at the time for the for the last word i need some
                                         
                                         theme music for this like this last word like i dondum-tss, as they add now.
                                         
    
                                         Cool, but anyway.
                                         
                                         So yeah, what's the one thing you want the listener
                                         
                                         to take away from this podcast today?
                                         
                                         That is a hard thing to answer.
                                         
                                         I think I want people to be dissatisfied.
                                         
                                         You know, it's not as though you're going to come up
                                         
                                         with the best solution tomorrow,
                                         
                                         but be dissatisfied with the world and try to find interesting fun solutions and even simple solutions
                                         
    
                                         like the best solution out there is not the most complex one it's the one that everyone will adopt
                                         
                                         yeah love that great message to end on matt thank you so much for coming on the pod today
                                         
                                         it's been actually a great chat and i'm sure the listener will have will have loved it as well and
                                         
                                         we'll drop links to all the things we've chatted about, all the papers and whatnot in the show notes as well.
                                         
                                         And yeah, we'll see you all next time
                                         
                                         for some more awesome computer science research. Bye.
                                         
