The Data Stack Show - 135: Database Knob Tuning and AI with Andy Pavlo and Dana Van Aken of OtterTune

Starting point is 00:00:00 So what's up man? Cooling man. Chilling, chilling. Yo, you know I had to call. You know why, right? Why? Because, yo, I never ever called and asked you to play something, right? Yeah.

Starting point is 00:00:11 You know what I want to hear, right? What you want to hear? I want to hear that Wu-Tang joint. Wu-Tang again? Ah, yeah, again and again. Welcome to the Data Stack Show. Each week, we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies.

Starting point is 00:00:36 The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at Rudderstack.com. Welcome back to the Data Stack Show. We have a huge treat for you. Andy Pavlo and Dana Van Aken of Otter2 are going to be on the show. We're going to talk about database tuning and optimization. But what a privilege, Kostas, to have someone like Andy Pavlo

Starting point is 00:01:06 on the show. Andy Pavlo, Yeah, absolutely. I mean, first of all, it's always interesting to have people from the academia, especially when we're talking about data systems. And Andy has done amazing work in research and his team and his students over there. But it's also even more interesting when you see people from the academia start companies, right? So taking whatever the research is and turning it into a product and and like a business it's always like super super fascinating so i think we are going to enjoy like the episodes a lot we have like many things to

Starting point is 00:01:52 talk about from what makes a database system and what nodes exist out there to what it means like to from being like teaching a class on database systems like to and do a PhD, like Dana was doing, to go and start a business that sells software to companies that they run databases. I agree. And if we're lucky and we have enough time, I want to hear where the name Ottertoon came from. So I'm going to try and sneak that in if we can.

Starting point is 00:02:22 Yeah, let's do it. Let's do it. Dana, Andy, welcome to the Data Sack Show. Such a privilege to have you on. And we cannot wait to talk about all things databases and Ottertoon. Thanks for having us. Yeah, thank you. All right.

Starting point is 00:02:39 Well, let's start where we always do. Could y'all give just a brief background and then we'll dig into a bunch of Ottertoon stuff. Andy, do you want to start? Right. So my name is Andy Pavlo. I'm an associate professor with indefinite tenure in the computer science department at Carnegie Mellon University. I've been here since 2013. I did my PhD at Brown University with Stan Zanonick and Mike Sturmbaker. And I am the CEO and co-founder of OtterTune, which is a research project we started at CMU that was Dana's PhD dissertation, which I'll talk about next. Yeah, I'm Dana Van Aken. I'm also a co-founder and CTO of OtterTune. As Andy just mentioned, I started down the path of, you know, working in database tuning and machine learning methods.

Starting point is 00:03:31 In 2014, when I did my PhD at Carnegie Mellon University, advised by Andy. That's where the academic version of the product originated and now working on it commercially. Very cool. Well, Andy, I have to ask this. Studying under the likes of Michael Stonebraker is really incredible. Is there any anecdote or sort of maybe lesson you learned that you still think about, you know, or that still comes up, you know, pretty often as you, you know, have had, you know, a hugely successful teaching career and are now doing a startup?

Starting point is 00:04:14 So Mike's philosophy, I think you got to build a real system and get people to use it when it's an academic, even when you're in academia. That's the best way to guide your research. So not just for Autotune, for other things we've done at CMU, that's been my main guiding principle.

Starting point is 00:04:27 Try to build real software. Because you just know where it's going to take you and some interesting ideas come out. If you're just writing stuff to run stuff in a lab, then you don't see the big picture. Yeah. Well, I can see that influence very clearly on your projects. Well, take us back to the beginning of OtterToon. So it started as an academic project, but you were saying before the show, sort of, you know, in the idea phase, you know, early on, it was sort of gestating earlier than it became an academic project formally. Yes. So my PC dissertation, Brown, was based on a system called HStore, which was commercialized as VaultDB. And in addition to sort of helping build the system out,

Starting point is 00:05:19 my research year was in sort of automated methods for optimizing and tuning the database system. So I was looking at how to automatically pick partitioning keys for the database and other things. And the challenge I faced as a student was

Starting point is 00:05:35 getting real workloads to use for experiments and analysis. We just run the standard open source benchmarks that everyone did, but again, we wanted to get real customer data, which is not easy to do when you're in academia. So when I started at Carnegie Mellon, I wanted to

Starting point is 00:05:51 continue down this path of looking at automated methods for optimizing database systems right when machine learning was becoming a hot thing and certainly CMU was a pioneer in this area. And so I was trying to look at a problem in databases that I don't think people have really

Starting point is 00:06:11 covered in the previous work because database optimization is an old problem that goes back to the 1970s. And so I was looking for something that wasn't just like index tuning. It's important, but I was looking for something different. And I was looking at something where I try to think about how could I do the problem without actually having real access to the real database or the real workloads. And so in the case of the problem that we looked at, knob tuning seemed like the obvious thing because it wasn't until maybe the last 20 years that these systems have gotten so

Starting point is 00:06:39 complex with all these different knobs that people have to tune that you need to do something that was automated. And it seemed like machine learning could solve the problem where that you didn't need access to the workload in the database because if you just looked at the runtime metrics, the telemetry of the database system, could use that as a signal to figure out what was going on. So the initial version of Autotune, even before it was called Autotune because that was dana come with that name the initial version was just doing some you know basic like

Starting point is 00:07:12 decision tree thing you know method to try to pick out you know how to tune like three or four knobs and it was pretty similar to what had been done 10 years prior by like oracle and ibm they had similar tools that were like a rule base I'm running a LTV workload with these kind of resources, and it would spit out some basic recommendations. So when Dana started, it was when we really pushed hard on this machine learning problem and tried to understand how much can we actually tune and optimize. Also, to back up here,

Starting point is 00:07:45 the knobs are these parameters that control the behavior of the database system. So like memory buffer sizes, caching policies, log file sizes. And so you can think of the database system as like a generic car frame where you can put big tires on it and make a truck to haul things,

Starting point is 00:08:02 or you can put a race engine on it to make it be a fast race car. So the database system is sort of this general purpose software where they expose these knobs that they want a user to tune based on what workload they think they're going to run because that'll change the runtime behavior of the system.

Starting point is 00:08:18 And so obviously if you have a write heavy workload, you tune it one way. If you have a read heavy workload, you tune it another way. But the problem is that there's just so many of these knobs that it's beyond what Hume is going to kind of reason about. That's the high level problem that the original AutoTune project was trying to solve. And then Dana started, I think in 2014, and sort of pushed the machine learning story about it. One quick question, and this is maybe more of a slightly philosophical question, on the number of knobs as complexity has increased. Do you think

Starting point is 00:08:47 that that complexity is healthy or even necessary, or are there more knobs than are helpful? So it's a byproduct, which is how people build software. So if you're like the database system developer, you're adding a new feature, you know, at some point you have to make a decision about how to do something, how much memory you should allocate for a hash table. And rather than putting a pound to find in the source code that tries to be good enough for everyone, they just kick it out and expose it as a configuration knob because they said, because you just assume someone else is going to come along that knows what the workload wants to do, and they'll be in a better position to make that decision on how to set that. But of course, that never happens, right? People have no idea.

Starting point is 00:09:32 And so is that, to your question, is this the right thing to do? Like, suppose everything is knobs. I mean, it's a, I mean, from an end user's perspective, no. You want the data system to figure it out for you. But from a software engineer perspective, I think it's a reasonable assumption or reason why someone would want to do that. Yeah, absolutely. Dana, can you pick up the story and tell us how ML entered the picture and how Ottertoon became Ottertoon?

Starting point is 00:10:05 Yeah, absolutely. So I believe that we really started looking into, you know, doing research on sort of tuning methods, you know, machine learning based tuning methods starting in 2015. And we were lucky enough to get one of the wonderful machine learning professors, Jeff Gordon, to help out and provide us some advice about the machine learning piece of it. Since primarily Andy and I started as more systems and have learned a lot along the way, but we were primarily systems researchers. So, you know, after some discussions with Jeff, he recommended Bayesian optimization as being, you know, a really good method, you know, for this problem, primarily because, you know, we were trying to have, we wanted a generalizable, you know, method so that we could apply the same method to different databases.

Starting point is 00:11:10 The Bayesian optimization is an OPOC or black box model. So it provided us with the means to do this. The other big consideration is that collecting training data for configuration tuning is super expensive in terms of like, you know, the collection time. And also just like, I guess, mostly in some of the other methods that are out there. So, yeah, so that was kind of like the direction for AutoTune initially. There was previous work using Bayesian optimization for configuration tuning. It was called iTuned and it was Shibnath Bhavu's work, I think a while ago. I think it was published back in late 2009. So we were really excited because, you know, we added this important element to it, which is automating, you know, a few different pieces. So I can touch on

Starting point is 00:12:25 those really quickly. One thing that was different was that we applied some different methods to try and figure out which are the most important ops to tune. And that's really important for reducing the search space. So we provided some, you know, state-of-the-art methods there. And then the other piece is reusing data collected from you know different database instances to help tune new databases so we provided you know some of the like generalizability methods between how you would do that transfer so after okay so we published that paper in 2017. And then, actually, I'm going to let Andy explain this part because he does such a good job explaining this part

Starting point is 00:13:11 where we kind of, you know, we became... We need a smoother transition then. Like, I'll let Andy talk about it. Oh, sorry. Well, I love it when you tell this part. I always get excited. Okay. So as Dana said, we published the first version of Autotune, the idea of the Autotune project in 2017 in SIGMOD, the ACM SIGMOD, which is the top conference and research conference in field databases.

Starting point is 00:13:39 And when it came out, the other academics were like, oh, this seems kind of cute, right? They looked at the paper, but nobody in the industry really paid attention to it. And then I met the guy that runs all of Amazon's machine learning division. He now runs all of Amazon's database and AI division. This guy, Swami, through a former colleague of mine, Alex Samola, who was a professor here at CMU and then went to go work on Amazon on their like auto glue on stuff. Anyway, so he introduced me to Swami.

Starting point is 00:14:13 I got five minutes of his time just to thank him for giving Dana a couple thousand dollars to run her experiments in EC2. And he was like, great. Can you write a blog article for us? We just started the new Amazon AI blog. We need material. So we converted Dana's paper into a blog article

Starting point is 00:14:27 that they published on the Amazon website. It's still there. And when it came out, that's when everybody saw it. And we started getting a ton of emails saying, we have this exact problem. We'll give you money to fly a student out and set it up for us. And so we're very appreciative of that experience.

Starting point is 00:14:43 We got a lot of crazy people. And Dana being a student didn't realize you don't respond to the crazy people that email you. Which we quickly learned. But anyway, so since 2017 is when we realized, okay, there is something here. And we then sort of tried to make an approved version of Autotune that tried to encompass more information about the hardware and so forth. And that one, the challenge

Starting point is 00:15:12 there was, it was for not having real workloads, for running with synthetic benchmarks, the basic version of Autotune could do a really good job. But we thought that we wanted to run this on real workloads and push it and see what happened. And so right before the pandemic,

Starting point is 00:15:28 we actually partnered up with some people that run the database division at a major bank in Paris in France who were all keen about using OtterToon. And so we did a collaboration with them of actually running OtterToon on a real workload and real databases.

Starting point is 00:15:44 This was on Oracle at the time, but it was in its own prem. And we ended up publishing another paper that was a final chapter in Dana's thesis. The main finding we found out there was it's actually the machine learning algorithm doesn't make that big of a difference. Dana talked about using Bayesian optimization. We tried with deep neural networks, reinforcement learning. In the end, they all do about the same job. The real challenge is not the ML side of things,

Starting point is 00:16:09 but it's actually interfacing with the database itself and collecting the data and reasoning about the state of the system when you make changes that are invalid or trying to understand how is it actually reacting and responding and incorporating that information or feedback back into the algorithms and the models. That's even both in the academic version

Starting point is 00:16:29 and since we've gone out to the real world, that we've since learned that's the hardest challenge here, not the ML stuff. Fascinating. I want to dig into that. Dana, could you just give us a brief overview of tuning? I don't know how many of our listeners have a lot of experience tuning a database. I kind of think about it, to me, using an algorithm to do this is intuitive because it's, I almost

Starting point is 00:16:56 think about it, this is a horrible analogy, but like a carburetor. There are people who have a skill set for tuning a carburetor, but if you hook a computer up to it, it takes care of like most of the hard stuff that, you know, that is difficult. What's the skill set around tuning and sort of where does, you know, where does the human skill stop and the algorithm begin? Absolutely. So I would say, you know, in addition to, you know, automated tuning tools database administrator, one or more of them to the team. And they would use a very manual process to tune the database, right, depending on, you know,

Starting point is 00:17:55 whether they were trying to improve performance or, you know, whatever the goal was there. So what this looked like, you know, I'll just give an example for configuration tuning, is that you basically like want to test a different, like a single configuration parameter. You want to, you know, potentially tune the setting and then observe the workload after you've made that or, you know, changed the setting. So it's a very iterative process because it's often recommended that you only change one configuration op at a time. And then the other reason is just because, like, you have to then observe the workload again, and that takes, you know, a long time. So the DBA is, you know, making these minor changes to the settings continuously until they're happy with the performance. Also, for any companies that might not be able to afford a DBA or don't bring a DBA on for other reasons,

Starting point is 00:18:57 what is, aside from automated tuning tools, the sort of method there is, well, you go read the docs and you go read blog articles. You go find resources and you do it yourself. To be very clear here, when you say who's doing this and they don't have a DBA, a lot of times in auditing, it's developers.

Starting point is 00:19:18 It's the DevOps people. It's people that aren't database experts. And it's whoever set up MySQL Postgres or whatever database you're using at the last job, they draw the MySQL Postgres or whatever database using it the last job. They draw the short straw and they have to maintain it for the new job. As Dana said, they go read the docs and like, good luck. That's right. And they're not doing this proactively, right?

Starting point is 00:19:36 It's like something goes wrong with the database system. And then they're like, oh, you know, oh, crap. And that's when they begin the research to figure out what to do. And then the process looks very much, you know, like what a DBA would do with these manual changes. But, you know, it might take even longer because you've got to figure it out along the way. And, I mean, this is, you know, forgive my ignorance, but it's an interconnected system, right? So even though changing one knob at a time helps an individual isolate the impact, it stands to reason that actually, you know,

Starting point is 00:20:14 understanding that multiple things need to change is where you can get significant gains, especially around like the speed at which you can optimize performance. Is that accurate? That's correct. So the benefit of the machine learning models is, you know, they can learn complex functions, right, and complex behavior and understand it. So it definitely expedites the process of, you know, helping to understand which, you know, knobs are related to one another

Starting point is 00:20:41 and the interactions between them. So I do want to talk about another aspect of that, how people do tuning, like what you're supposed to do versus what people like really do. And this is actually one of the things that we've learned that like we made an assumption in academia and then we went in the real world. It just turned out to be not correct. What you're supposed to do is take a snapshot of the database, capture our workload trays, run on spare hardware, do all your tuning exercises on that spare hardware. Once you think you're satisfied with the improvement, then go apply the changes to the production database and obviously watch to see if that was correct or not. Very few people can do that. The French bank we talked about before, they could do that because they had a very massive infrastructure team.

Starting point is 00:21:25 They were using Oracle, which people may not want to hear this. Oracle had really good tools better than Postgres and MySQL to do this kind of workload testing. So they could do this. And we thought, okay, when we go out and we're a startup in the commercial version, people

Starting point is 00:21:41 would be able to do this. And it's not been the case. People need to run in production, directly in production databases because they don production databases because even if they have a staging or dev database, it's not the same workload. They can't push it as hard as they can in the production database. So any optimizations you make on the staging database may actually be insufficient on the production database. So again, the thing that Dana mentioned in her, when it was a research project about reusing this data across other databases, that matters more now in the commercial world because people aren't going to have staging databases that it can run a lot of experiments on. In some cases, we can't always reuse that training data. And so we just need to be more careful in the production environment, like what we're actually changing when we do the search. That makes a lot of sense.

Starting point is 00:22:33 I have a question and actually I'm super happy that I have like two people here coming from academia because I can satisfy my curiosity around like term definitions, usually we didn't like to use terminology and if you start because I can satisfy my curiosity around term definitions. We then like to use terminology. And if you start going deeper, everyone is using a little bit of different meaning around them. And I think it's important to have a common understanding of what we mean when we use some terms. And both of you have used the term

Starting point is 00:23:06 workload, right? And real workload. So what is a workload? When we're talking about database systems, what defines the workload? Yeah, so our view of the workload would be, what are the SQL queries that the application executes

Starting point is 00:23:22 on the database system to do something, right? But it's more than just the SQL queries that the application executes on the database system to do something, right? But it's more than just the SQL queries. It's also if you're looking at OTP workloads, it's the transaction boundaries as well. So like the begin, the commit, the aborts. So you'd

Starting point is 00:23:37 want to use a tool that does workload capture. So when we say workload capture, that would be literally collecting a trace of here's all the SQL commands that the application sent at different timestamps from these client threads and so forth. It's also important to capture a period if you're using that method of high demand. That's typically what you want to optimize for. Okay, that makes a ton of sense. And then you mentioned at some point

Starting point is 00:24:05 about observing the system, the database system. What I've seen in my experience when it comes to trying to figure out performance and collect data to use for optimizing the database system, usually what I've seen engineers collecting are the results of the query optimizer, like what query optimizer creates as a plan, together, of

Starting point is 00:24:31 course, with some measurements around latencies and how long it takes and how much data has been processed, and some statistics around the tables, right? But observability is, especially when we're talking about systems in general, it's much more generic. There's much more generic. There's much more stuff that someone can observe out there, right? So what information you are seeking to observe on the database system as it works to fit this data into your algorithms? So typically when a user starts, you know, begins the tuning process, they tell us what they want to optimize for, you know.

Starting point is 00:25:10 And it might be a couple of things at a high level. It's, you know, maybe performance or cost. So that could be latency, that could be CPU utilization or, you know, the cost usage that you can collect from, you know, the cloud through APIs. So in addition to that, you know, that's the primary metric that we're going to use to help guide the optimization. But there's a lot of other really important things that you have to take into consideration, which is why we collect a lot of additional information, including like all of the runtime metrics, you know, in the system. We collect the configuration opposite of each step. And then the performance schemas in both MySQL and

Starting point is 00:25:52 Postgres expose just a ton of information there. And we try to collect as much of it. And also at different levels, right? So you can collect statistics at the database level, index statistics, table statistics. And, you know, like, what do we look at when guiding the tuning process? Well, a lot of these other metrics provide a good signal also for performance. And in addition, I actually want to mention, in addition to those metrics, we also incorporate some domain knowledge to also make decisions about the settings that we're recommending. So what came to mind here is, for example, one important parameter that you can tune is changing the blog file size in the system. So if you increase the size, it typically improves performance up until a certain point. But, you know, as you increase the log size, you're also increasing, you know, the time it takes to, you know, replay the redo log or, you know, recover the database

Starting point is 00:26:58 in, you know, in the case that it goes down. So we also have to take these practical considerations or practical aspects of tuning into consideration. And how do you perform this observability on the database system? Like, how do you collect? Now, I want to get a little bit more into the product conversation, right? Yeah. Because I get what AlterStream does,

Starting point is 00:27:23 but how it does that, right? Like how do I go to my RDS database and set up Autotune there to collect all this data? Sure. So the way that Autotune works, and I'm going to discuss it in terms of the current product, because I think it's just a little bit more intuitive of how it works, given that a lot of people use are on the cloud now. So for example, we're going to collect, well, we support AWS, RDS, or MySQL Postgres. So in addition to collecting the internal, you know, metrics from the database system, we're also going to collect like cloud match

Starting point is 00:28:04 metrics from AWS. So we're getting multiple sources here. So at the very beginning, you know, metrics from the database system. We're also going to collect like cloud match metrics from AWS. So we're getting multiple sources here. So at the very beginning, you know, like I mentioned, the user is going to go in, you know, pick what they want to optimize for and, you know, maybe a few other settings. And then the next thing that they're going to do is provide us permissions, you know, both to, you know, access this data from the cloud APIs and the database system. And so for the cloud APIs, that's pretty straightforward. But for collecting internal metrics from the database system, nobody wants to make their database publicly accessible, right?

Starting point is 00:28:38 Which means the auditor will be able to directly connect to it and grab the information. So we provide an open source agent for people to deploy in their environment. It's a similar setup to like Datadog, New Relic, other observability companies. And so they deploy that and then that's able to directly connect to the database, collect all the information that I mentioned previously, and then send that back to OtterTune in a more secure manner. So once we have the proper permissions to collect all this data, the way it works is we observe the database for a given configuration for 24 hours at least, because a lot of our workloads, or we see a lot of diurnal workloads, potentially like e-commerce sites. There's a number of industries, but they're kind of busy starting at 9 a.m., and then they hit their their peak demand and then it kind of drops off in the evening. So, you know, capturing 24 hours worth of data, just make sure that we're being consistent. So it's just in the very first iteration, we collect the current configuration and also observe and then begin observing the database for 24 hours. We see that information. Well,

Starting point is 00:30:03 we started, you know, we store all of this data in our database and then using all the data we've collected so far, as well as some other data, we generate machine or we build machine learning models that then generate better configurations. And I just like to add also to that, like the data we're collecting is this runtime telemetry of the data system or through CloudWatch. CPU utilization, pages read, pages written, things like that. It's all benign information. We've done deployments at the French bank and other ones in Europe and their infosec people look at what we're collecting and there has not been any GDPR issues. So, you know, Autotune doesn't need, we don't care about your data.

Starting point is 00:30:44 Like we don't care about your data. Like we don't care what your user did, you use your data and anything we collect, like a query plan to send back for, you know, query tuning feature, we strip out anything that's identifiable because again, we don't want, we don't care. Right. And we also make it really easy for users to kind of switch on and off what information we can collect and kind of adjust our recommendations accordingly. And is the product like generating recommendations or is able like to go back and automatically tune the database? There's the current version and there's the new version. The current version, it can automatically configure it, configure knobs and that's, and then we have additional health checks provide high-low recommendations about other things.

Starting point is 00:31:28 Like, here's some unused indexes. You should drop them. Here's some auto-vacuum settings and Postgres that are messed up. You should go fix those things. But they're not as precise or specific as what we want them to be. So the new version is taking a broader view about the lifecycle of the database system

Starting point is 00:31:47 and providing the automated nod recommendations, index recommendations, and so forth, but also looking at the overall behavior of the data system over longer periods of time and try to provide guidance and recommendations so that people know that they're running the best practices. And so the new version of Autotune is, like as Dana said, if you tell us we want to optimize performance, we'll still do that. But we also provide guidance about what's the overall health of the system. And are you running the best practices that may not always lead to the best performance for some of our recommendations? Like if backups have turned off, you should turn them on.

Starting point is 00:32:27 That could potentially make you run slower, but it's the right thing to do. So the newer version of Auditune is trying to, again, looking at a broader view of a database, not just like, how can I make it run super, super fast? And I have to admit,

Starting point is 00:32:39 coming from academia, like all we cared about was like, do we make the graph go up? Do we make it go faster? And then in the real world, what we found is like, yeah, that matters. Honestly, people come to us and say they just don't know what they should. They don't know what they don't know.

Starting point is 00:32:50 They don't know what they should be doing. Yeah. And we see enough databases that I think we're in a position to provide recommendations along those lines. Yeah. Yeah. That's actually a very interesting point. Does your perception of what performance is changed through your experience with OtterTune and going out to the real, I mean, in the market out there and seeing like compared to how you

Starting point is 00:33:10 were perceiving both of you like performance as like academia people? I mean, for raw data performance, no, like the end of the day, like it's your throughput go up, Is your P99 going down? CP utilization is probably the one thing that we didn't think about that matters a lot also in some cases. One anecdote, we did a deployment at booking.com and they wanted to reduce their CP utilization so they can do consolidation.

Starting point is 00:33:43 And so even after a human expert in-house, the MySQL expert optimized the data system, Autotune was able to squeak out another 25% reduction. And so I think it was a cluster of like 20 machines. And, you know, if you shave off 25%, now, you know, you're turning off three or four machines. And they had a ton of these clusters. So for them, that was a big deal.

Starting point is 00:34:04 So in terms of raw performance, I still think the things that we focused on academia still makes sense, but it's these other, you know, mushy, fuzzy feeling things about databases that it's hard to measure in academia and write a paper about like, oh yeah, someone feels better about their database. Okay. How do we measure that? And Autotune is basically the new version is doing that for them. We tell them up front, like, this is the healthier database. Here's the things you need to take care of. And we're not recommending them because, you know, Dana has a PC in databases or Andy's and he read the textbook, whatever.

Starting point is 00:34:39 It's things that we see in databases that we know this is the right thing for you to be doing. I'll also add a quick note. What we've learned from our customers that, you know, is much different from back in academia is they're really looking for peace of mind and also stability in their recommendations. So, you know, it's not just about optimizing for, you know, the absolute peak performance over a small period of time. Like some configuration knobs can provide some peak performance, but then they're sort of unstable as different background processes kick in or something else happens in the

Starting point is 00:35:11 system. Yeah, 100%. There's this thing called on-call that nobody's happy to have to do, right? And the last thing that you want is getting a page-to-duty message that, oh, now I have like to go and like figure out what's wrong with my database or like my system or whatever. And so piece of mind is like super, super important. It's like a hundred percent. I totally understand that. And all right.

Starting point is 00:35:38 So one question that has to do with like the with the systems that you're working with. I've heard you talking both for on-prem installations, you talked about the database that they had, their own infrastructure there, and you've talked a lot about AWS and RTS. Is there a difference between the two? Have you seen deploying the products so far? Like, is there a difference between like trying

Starting point is 00:36:07 to optimize like workloads like on-prem and like trying to optimize like workloads on the cloud? From the machine learning algorithms perspectives, there is no difference,

Starting point is 00:36:17 right? It's just numbers, right? The challenge though is what I was saying before. It's actually interacting with the database and its environment

Starting point is 00:36:24 is the harder part. And so we only support RDS right now. And what that provides us is a standardized API or an environment that we can connect to, retrieve the data we need, and make changes accordingly. You can't change everything, but

Starting point is 00:36:41 in terms of knobs, you can modify the parameter groups in AWS through a standard API. You've been asked to support on-prem, and then when you start talking to the customer about what do they actually want, how we actually apply the changes we're recommending, and everything's always different. It's always like, oh, you have to write to this Terraform file on GitHub, and that fires off an action, or you got to write to this other thing. And it's all this one-off bespoke custom stuff that people implement. And it's not that we couldn't support it. We expose an API,

Starting point is 00:37:11 which we eventually will do. It's just it'd be a bunch of engineering stuff that we have to, may not be easily reusable across different customers. So for that reason, we only focus on AWS because it's a standardized environment. Yeah, 100%. And when we're talking about a standardized environment. Yeah, yeah, 100%. And when we're talking about services like RBAs,

Starting point is 00:37:28 like, okay, marketing is usually overpromised, but the idea was you don't have to monitor your database. AWS is going to do that. But apparently, that's not exactly the case, right? There's still a lot of things that need to happen. We've had customers tell us they thought Amazon tuned their database for them already. And Jeff Bezos is not doing that for you, trust me. So why is this happening?

Starting point is 00:37:53 Why is AWS not doing that? You mentioned at the beginning, for example, the access to the workloads data that they have. They have access to all this information to do this. So why are they have, right? Like they have access like to all this information, like do this. So why they are not doing it? So AWS has, and other cloud providers have something that they call the shared responsibility model, which kind of means that like certain, you know, as far as managing the database, they'll handle some parts of it, but not other parts and more specifically the parts that they won't handle are typically, you know, looking at customer data or anything where they have to, you know, read or interact with customer data.

Starting point is 00:38:33 You know, I'm not saying that this could change in the future, but I think that a lot of the, you know, sort of recommendations that they do provide, you know, configuration tuning or other types of tuning, they're able to do it without really looking at customer data. So for example, they'll just, you know, they'll improve the default setting of a Postgres or MySQL or other database knob just based on the hardware, right? So you can get a little bit of bang for your buck on that because a lot of the default settings are meant for like sort of the minimal hardware requirements of the system. But they don't do tuning, to the best of my knowledge, based on the workload. Yeah, yeah. are like, let's say the most, how to say that, most used for like going against increasing like performance.

Starting point is 00:39:30 Like I'm pretty sure you're running like some statistics on your data or metadata. I don't know how to call it. Like the data that you collect for like the workload optimization that you are doing, right? So what have you seen like for the systems that you are working with, like Postgres and MySQL, for example? So with Postgres and MySQL, for both of these sort of disk-based systems, the buffer pool is going to be very important.

Starting point is 00:39:58 The log file size is going to be important. Certain parameters around checkpointing tend to be important. Certain, you know, parameters around checkpointing tend to be important. You know, I can kind of name off a few, like those are, you know, targets that you'll very frequently see in like blog articles. Also those that impact, you know, the quality of the optimizer. And then you have ones that are, you know, really important, specific to a database system. So some knobs that are really important there for Postgres are the auto vacuum knobs, tuning those correctly to make

Starting point is 00:40:32 sure your tables don't get bloated. And those are some examples. That makes sense. So we've done our own benchmarks and when you compare against the default config Amazon gives you with RDS and Postgres MySQL versus like what Autotune optimizes, we can still get 2x better performance. And again, going back to academia, one of the challenges that Dana and I were facing were like, we didn't know what the starting point should be, right? For like, okay, how much better is Autotune? And we would have this debate of like, okay, well, people, you know, people would do some tuning.

Starting point is 00:41:08 It wouldn't be this, you know, really bad configuration. And we overestimated what people would have actually in the real world. So autotune is better in the real world than we thought it would be in academia. The other interesting thing too about Amazon RDS is that they obviously have Postgres and MySQL Aurora. And in that case, Amazon has cut off the bottom half of these database systems and replaced the storage layer with their own proprietary infrastructure. And so this removes a bunch of the knobs that oftentimes you see a major improvement in performance for vanilla or stock Postgres MySQL running on RDS. So that's been one of the challenges. But also, I can't prove this, but the Aurora knobs actually look like they've done some tuning,

Starting point is 00:41:59 much better than the default for RDS. And so we think that people go from RDS and Aurora and they see this big bump in performance and think, oh my god, Aurora is amazing. No, they actually just tune the knobs better for you. And they charge 20% more. That's super interesting. It's like what I wanted to ask next because

Starting point is 00:42:15 we see more systems that are let's say serverless, right? As they are called, let's say something like Planet or if we go to the OLAP system, like something like Snowclay. So in these environments where there's like an even more abstraction that is happening, like between the developer and the database system itself, like what's the space there for auto-tuning, right?

Starting point is 00:42:41 Like what like Auto-Tune is doing. So, I mean, all of these systems have knobs. Snowflake doesn't expose the knobs. They're there, I know, because they told me, right? It's like 600, whatever they have.

Starting point is 00:42:55 And basically what happens if you're a customer of Snowflake and you have problems, you call, you know, their support people and then the support people talk to the engineer and the engineer says,

Starting point is 00:43:02 oh yeah, tune these knobs. So the knobs still matter. It's just whether or not you have access or are exposed to them. And so I agree with you. If you abstract away enough of the system and the knobs aren't there, then, you know, there's nothing to tune. But oftentimes, again, there's other things you still want to tune in a database system. And this is what the commercial version of Autotune does that the academic version didn't, right?

Starting point is 00:43:28 We can tune indexes, we're starting to ask for tuning queries, and again, there's other cloud stuff that you should just be doing that you know, like, again, I wouldn't call it a knob as a contingent behavior

Starting point is 00:43:43 system, but it's the right thing to do. So the newer version of Autotune that we're working on now starts to look at the life cycle of the database beyond

Starting point is 00:43:52 a single database by itself and so what I mean that that is oftentimes people have multiple database instances and

Starting point is 00:44:00 you know they may not actually be like I'm gonna say physically connected but they not be like you know Amazon doesn't know that they're replicas of each other. Or they don't know, Amazon doesn't know that here's the US version of the database and here's the European one. And it's the same application, the same schema.

Starting point is 00:44:15 It's just they're disconnected. So this is where we're going next of looking at the database, in addition to the schema and all the other metrics that we're collecting and understanding trying to understand what's above it in the application stack and starting to make recommendations and advise users about how they should be using their database or what to expect coming up so an example would be we had a customer that had a database in europe and a database in in the u.s and a database in the US. And it was actually the same schema because it was the same application, just running two different versions of it. And what happened was Autotune identified that the US database was 10x faster than the European one. And at the version of Autotune at that time, we couldn't figure out why. And the

Starting point is 00:45:04 customer eventually figured out, oh, because they forgot to build an index on the European database. It was the same schema. Just someone forgot to do migration and add the index. And so that's where we're going next with this is like, okay, now I understand.

Starting point is 00:45:15 Here's two databases. They have the same schema, roughly the same workload, but they're physically distinct instances. And so we can start making recommendations like, okay, these should be in sync or you see this, something I've owned here. You a make the same change over there. You can also start doing the same thing for staging, testing and production databases. So for example, people oftentimes do schema migrations on the staging database, and then a week later, two week later, some later point, they apply the change to the production database. So again, Amazon's not going to know that these,

Starting point is 00:45:46 you know, the staging and the production database are logically linked together, that the user knows that the customer knows that. But you can start doing things like, okay, well, I see that you've done a schema migration on the staging database. And I know the things that you've done, you've added an index, you've created a table and so forth. And so our recommendations could be things like, okay, well, you're going to make this change to the production database because you've already made it on staging database. And for these changes, like renaming a column,

Starting point is 00:46:12 you can do that right away. That's cheap to do. But for adding a column or dropping a column or something, those are more expensive operations. You should be doing that during your maintenance window at this time. Like start making those recommendations that like if someone can see the entire view of the fleet of the databases that they have and know how the customer is actually using them, you can start making recommendations that a human actually wouldn't even be able to do now because it's just at scale.

Starting point is 00:46:36 It's not possible. That's the big new vision of what the new version of Autotune is going to start doing this year. Yeah, that's very interesting. And I'd like to, because Dana, you mentioned earlier that these cloud providers have this model of like, okay, there are some stuff that I'm going to manage for you and some stuff the customer should manage, right? And I'd like to take from both of you, if it's still, your opinion on these new serverless models.

Starting point is 00:47:07 I agree, everyone agrees that having hundreds of configurations for a system out there, it's probably not the best way to expose this functionality to a human user, right? But is there, let's say, instead of going to the other extreme of having everything, let's say, completely abstracted, is there some kind of balance that is at the end better? Like expose some nodes, let some users through an API or something like that, on something like PlanetScale for example, to go and do the tune if you want to, and leave this to the user and some other nodes that it should be.

Starting point is 00:47:50 It's better for the infrastructure provider at the end to go and manage. And it will always be a better option to have the infrastructure provider to do that. What's your take on that? Because then in this industry, you go through extremes always. Let's build only UIs or only CLIs, you know, like, but isn't like truly like somewhere in between. So I'd love to hear like your opinion on that. Sure.

Starting point is 00:48:12 Yeah, I guess I can start. So, you know, you had mentioned serverless. We actually support Amazon Aura serverless right now and it further reduces the knobs. So just kind of to your point. But what you're really asking, I think, is, you know, what's the right balance like for knobs that you don't expose to the user versus knobs that you do expose? And, you know, ultimately, how do you handle the configuration of those?

Starting point is 00:48:41 That's a really hard question to answer both in terms of just like you know methodology but also in terms of like practical reasons i'll go over both really quickly so at a high level you can imagine i think that this is kind of the route that snowflake snowflake takes you know which as andy mentioned they don't expose knobs to users is you know those you know, which is Andy mentioned, they don't expose knobs to users is, you know, those, you know, the values that they've chosen, you know, behind the scenes work well for most of their workloads. However, it's definitely true because as Andy said, spoke, you know, with them about this at some length in the past, that's not always the case there are definitely customers where the configuration values are inappropriate so you know what happens then well they have you know sort of a database you know sort of an administration team that goes in and will

Starting point is 00:49:35 configure on a you know case-by-case basis you know and that if there's like a you know a big performance issue so it's kind of like trying to find the right balance between if you're not going to expose a knob to users, it needs to be really generalizable. And so maybe some of those configuration knobs where you can automatically tune it in the database system itself might rely on the hardware. I can imagine that they would be able to do that. I think it's much more difficult for what I would say would be a lot of the knobs that rely, that should be tuned for the workload as well.

Starting point is 00:50:19 And so it's kind of this balancing act. As far as just like the practical implications, just managing a ton of knobs is really difficult just like from, you know, like an engineering perspective. So you have to deal with like deprecating knobs in the system as, you know, different like components change and you end up adding new knobs, some get deprecated. These knobs no longer have any impact on the system. There's just a lot of management that goes along with it, which I think just adds to the complication of trying to kind of split up and expose some, but not all knobs.

Starting point is 00:51:03 Yeah, that makes sense. So, you know, Dana's focusing on knobs here. As I say, there's other stuff to tune. And the question is, what is exposed to you as a developer? Like one extreme would be like, I have raw access to the box. I can SSH into it, do whatever I want. Yep. Nobody does that anymore, right?

Starting point is 00:51:22 The other extreme would be like, I've only exposed an API where I can do like gets and sets, basic things. And therefore I can't even write a raw SQL query. The future is going to be somewhere obviously in the middle. And even if it's serverless, I'm fully in the relational model camp. So like it is going to be a relational database.

Starting point is 00:51:44 Most people are going to run a relational database through SQL. And if you have SQL, then that's a declarer of language that abstracts away the underlying physical plan of what the system is going to use to execute your query. So that means someone's going to have to tune that accordingly or know what the physical design of the database is going to be.

Starting point is 00:52:02 And so I think there's always going to be a need for an automated tuning service, something like Autotune in the future, because that's how most people are going to interact with databases. SQL was here before we were born. It's going to be here when we die. It's not going away. And because it's declared that you

Starting point is 00:52:18 need someone to actually tune things. Yeah. I totally agree. The reason I'm saying that is because from my experience, I have experienced a little bit of extremes, like seeing something like Trino, for example, which there's a lot of configuration that happened

Starting point is 00:52:34 there on many different levels. Going to the cloud version of Starburst Galaxy, which is completely, let's say, opaque to the user. And at the end, I still get, as a product person, I'm not talking about

Starting point is 00:52:51 an engineering person here, that the user still needs to have some control. It doesn't mean that you have to oversaturate the user with too much control, as we usually do in the enterprise. But going to the other extreme, it's also bad at the end for the experience that the user has and causes a lot of

Starting point is 00:53:11 problems and frustration there. And I think it's a matter of figuring out exactly what are the right nodes to put out there to the users. And it is a part of a product conversation about. And I don't have to, I think it's, it is important when we are talking about like developer experience and that's, I think like what's differentiated like to the user experience. Like you don't have like to completely abstract and make everything like completely opaque, right? Like user still needs to have like some kind of control over it when they're like developing and like engineering solutions.

Starting point is 00:53:47 Anyway, we can talk more about that. But I want to ask something else. You mentioned about all this data, like all these models that you are building. And these are like about systems that are like very complicated. Like teams have been working on them like for a very long time. Like Postgres has been there like for like decades. Do you see like some kind of synergy with these teams? Like do you work like with them?

Starting point is 00:54:11 Do you see like the data that you collect or like the experience that you're building, like helping them like to build like even better systems at the end? Like have you seen something, something happening or an opportunity in the future? I mean, we have not interacted with the Postgres MySQL community. I think there was, I mean, we did a deployment once where we think there was a MySQL knob of adaptive hash index where it's on by default and it actually shouldn't be. And there was, I think we brought that up on Twitter. I think some MySQL people looked at, you know, investigating whether we should, up on Twitter and I think some of my SQL people looked at

Starting point is 00:54:45 investigating whether they should make that be off by default. I think we have not interacted a lot directly with the developers based on our findings in Autotune. Where we do want to

Starting point is 00:55:02 go, we haven't really got there yet, is actually interact more with the developer communities for some of the major application frameworks that people are using, like Django, Ruby on Rails, Node.js stuff. Because, again, because we see the schema and understand what the queries look like, we know what, in some cases, what application framework you're using so we haven't completed this analysis yet but we want to sort of identify what are some common mistakes we see about how these orms are using the database and then either you know some things we can fix like oh you're missing index add those some things might be more fundamental with what the application is generating the sql query um. So we think that's later this year where we want to go next.

Starting point is 00:55:47 And again, reach out to these communities and say, hey, look guys, if you're running a Django application, you're going to hit these problems. Autotune can solve some of them, but other things, I think you guys have to fix in the library. And there was a going back to the university, I did have a major research project. We were building a database system from

Starting point is 00:56:03 scratch based on trying to be completely automated to remove all the things that Dana's work had attuned. Could we just build all that stuff internally and have machine learning components on the inside of that? That project has been, we abandoned that project because of the pandemic. It became too hard to build a system for graduate students during lockdown.

Starting point is 00:56:29 But I think there was a lot of the things we learned in Autotune fed into the design decisions we were making in that system. And it's something we probably would revisit. I might just quickly add, as far as collaborating or just even talking and working with Postgres or MySQL. One interesting story is we were talking about or talking with the former product manager of MySQL who was basically in charge of, you know, working on configuration knobs. And MySQL added a configuration knob and I'm forgetting the name of it, but essentially... Dedicated server. Dedicated server.

Starting point is 00:57:07 Thank you, Andy. And basically what happens is you enable this and then it sets four knobs, like, you know, according to your hardware and some other metrics that they collect. So this, you know, this is, you know, a big deal and potentially this is why they reached out and providing them advice or insights from the data that we collect could be helpful here, but they also

Starting point is 00:57:34 mentioned that just this single change took hundreds of engineering hours to implement. I think it was hundreds. It was at least dozens. These systems are so complex that it's really difficult to make the changes internally. And with the open source communities, it's hard to know whether they'd want to prioritize

Starting point is 00:57:52 something like that. 100%. 100%. All right. One last question from me, and then I'll give the microphone back to Eric. And I'll go back to the beginning of the conversation that we had. And I'll ask Andy about that. Because you mentioned at the beginning that it was always really hard to go and find real workloads out there to use in research and drive the things that we were doing.

Starting point is 00:58:17 And as a person, personally, I have pretty much memorized the TPCH at least a quiz of. I wrote TPCC four times that i was in grad school oh wow yeah did you see an opportunity for the whole industry like to move forward and have like more tooling for everyone out there was like trying to build, to be able to benchmark or measure performance or just have like data, right? To go and build the system on top, becoming a reality. Do you see there's a chance to escape from the standoffs like the PCH and have something more meaningful out there? I mean, it's... There isn't a consortium that people put together, like, hey, here's this great

Starting point is 00:59:11 treasure trove of data that everyone can use. It shows a lot of different pockets. Again, here at CMU, we have our bench-based framework. It's a bunch of data benchmarks. Some are synthetic. Some are based on real workloads that it's a single platform that people can use to run experiments.

Starting point is 00:59:28 The DuckDB guys at a CWI have a public BI benchmark data set that they collected with Tableau. So there's bits and pieces of it that are out there. The one thing that we think for transactional workloads, really understanding the amount of concurrency and that like you see,

Starting point is 00:59:47 like, you know, in a real workload, that's really hard to get. And that you need a raw trace or something like that. And so even that in auto-tune, we don't have that. You can't,

Starting point is 00:59:58 unless you sit in front of the data server and see all the queries coming in, like we don't even have that. Makes total sense. All right, Eric, see all the queries coming in. Like, we don't even have that. Makes total sense. Alright, Eric, I monopolized the conversation here, so I'm sure the mic won't beat you. One question that's

Starting point is 01:00:15 so interesting is the datasets that you can use to speed up the cycle of, you know, tuning a database, right? So you have data sets from other processes where you're in tuning. I'm assuming, at least the way that we've talked about it, that those are actual data sets from real-world optimizations that you've done previously that you can apply moving forward. Is there a

Starting point is 01:00:45 role for synthetic data? Can you use those datasets to actually generate additional synthetic datasets that could sort of take that even further? Is that part of your vision? As we say synthetic, you mean like take the real datasets we've gotten and then convert that to reverse that back to SQL? Sorry, I meant sort of creating training datasets that are sort of manipulated through machine learning, like creating a synthetic dataset that is based on real-world datasets so that you have a larger sort of repository of training data. Yes. So that's essentially what we're doing now, right?

Starting point is 01:01:30 We don't take TPCC and run experiments and then use that to figure out, like for a real-world customer, how to tune them. Like we look directly at the real-world database. The challenges though, and Dana sort of mentioned this, why you have to look at 24 hour periods like by default some cases you can turn this down the challenge is that auto-tune can make recommendations again just focusing on knobs we make recommendations on how to tune your knobs

Starting point is 01:01:58 and then we apply the change the measure performance it's very hard if you're looking at the production database to determine determine whether a change in performance is something that Otterstein did or something that occurs upstream. So if you change a bunch of your knobs and the next day the queries are 20% faster, is that because we did something or is that because they deleted a bunch of data

Starting point is 01:02:22 and therefore the indexes are smaller? They're running queries, they're running queries or they added an index. This is why you have to have a holistic view of the database in a way that we didn't appreciate in academia. So we know what changes they've made. You know, we obviously can't,

Starting point is 01:02:38 how to say this, like they make certain changes on the application that it calls them the queries you've never seen before. We at least can see that and identify that, okay, something has changed, but we can't attribute that to being the production performance. So this goes back to what I was

Starting point is 01:02:54 saying in the beginning. In academia, we assumed that people could capture the workload on a run on a spare hardware that was the same as the production database. So that way, you always have a baseline to compare against and it's the same workload over and over again. In the real world, you don't have that.

Starting point is 01:03:10 So you just have to do additional statistical methods and be able to figure out, okay, things have changed and I've seen enough data to attribute that, the benefit that they're getting to us. Yeah, that makes total sense. Is the optimization of those sort of like, let's call them contextual factors that aren't, you know, directly related to the knobs themselves, you know, so upstream changes, or like you said, you know, the context of

Starting point is 01:03:36 dev versus production database, etc. Are those factors more fragmented? Like, is it a more difficult problem? Like if the knobs are themselves are fairly well-defined, is that context more fragmented? And what's the approach just sort of solve for that if it is more fragmented? What do you mean by fragmented? Sorry. Yeah. The, you know, it's, there's less, you know, if we think about knobs, there can be like consistent definitions across the database, right? So cache size or something like that, right? But the difference between dev and prod maybe is more subjective and not necessarily like a setting that you can observe technically. Yeah.

Starting point is 01:04:19 So what makes a dev database? In some cases, there's a tag in RDS. They can tell us the name is dev. The name is prod. We see that. So the current version doesn't do any sort of deep inference based on that. Like those tags. Where we want to get to, we don't do yet.

Starting point is 01:04:40 We come back and ask the user, hey, we think this is a test database. Is this true? And they say yes or no. And then we know something about it and not it. And then we can make recommendations accordingly. So an example would be in our current version, we identify unused indexes. Like if you've never run a query on an index, or at least never run a read query on an index, then you probably don't need it.

Starting point is 01:05:01 But if it's a staging database and they haven't run any queries because it's only when the developer wants to test something that a bunch of indexes look unused because you haven't used them. Yeah. Right? So we need to be more mindful

Starting point is 01:05:16 of those kind of things. We don't ask the user yet, hey, please let us know what the database is. Where we want to go next is also this too, in addition to asking whether it's

Starting point is 01:05:25 dev testing staging whatever is is that when I said that it was a logical link like the schemas are the same we're seeing the same queries

Starting point is 01:05:35 are these guys are these two databases are they brother and sister are they related right and again that's a prompt we'd have to ask

Starting point is 01:05:42 the user it'd be very difficult to reason about this stuff. And this gets into another big challenge in machine learning. It's just the external cost factors that you just can't know automatically. You have to be told these things. It's a limitation of machine learning.

Starting point is 01:05:57 And so, you know, how do you say this? You know, if Dana mentioned that there's sort of a bunch of guardrails we put in the algorithms to make sure that it doesn't make certain decisions or optimizations that could have problems that we don't see when just measuring performance. If you turn off writing things to disk, you're going to go a lot faster. But now if you crash and lose data, people are going to be upset.

Starting point is 01:06:23 The algorithms can't reason about that because that's an external cost. We have to put in our domain knowledge about database people to know that these are the things we should be doing. The same idea applies to the staging versus dev linking. That makes total sense. Two more questions. One of them is about

Starting point is 01:06:39 the context in which AutorTune enters the picture. Dan, I think you mentioned previously, which makes total sense, like, okay, you're tuning when there's some sort of problem, right? Do you see OtterTune moving the conversation to a place where you're implementing this ahead of time to avoid those problems in the first place and sort of changing tuning from a conversation about we have a problem, you know, in performance costs, whatever it is, to, hey, we're going to use this to actually sort of, you know, you mentioned this earlier,

Starting point is 01:07:17 Annie, like you're running a Django app, like you should just do these things so that you get, you know, really good performance out of the box? Yes, definitely. I think that proactive tuning benefits, you know, all of our customers, essentially. So I would put our customers in kind of two buckets. Some, like a portion of our customers are, you know, ones that, like Andy mentioned, they're developers. This is a lot of, you know, companies, maybe these are small, small medium-sized companies and so nobody's like directly you know managing the database and then the other you know sort of group would be those with the dedicated you know database administrator one or more that are or you know doing performance tuning sometimes. But like you mentioned, it's not like proactive tuning. So in both cases, it's beneficial.

Starting point is 01:08:09 You know, in the developer case, you just kind of, you want to write code. You want to do your engineering job. You don't want to be pulled back into this database, you know, to always solving performance problems. So it's super beneficial there. Just like Andy mentioned, we're moving more towards

Starting point is 01:08:25 a holistic view of the database where it's just, you know, peace of mind. So that's really helpful in that case. In the other case, when you have a large fleet of databases or even a medium fleet, the amount of, you know, time that you can spend tuning them is just not very significant. So this is also very beneficial to people even who have dedicated DBA groups. Absolutely. Okay, last question, because I know we went a little long here, but this has been such a great conversation. State of the art, how can you not, right?

Starting point is 01:09:00 Yeah, exactly. Exactly. And our producer's gone, so we're definitely going to go along. The name Otterton. So Andy, you mentioned that Dana came up with it. I love the theme. I love the brand and I love the stories around naming things. So give us the backstory on the name. Dana, do the name and I'll talk the vision. Okay. Okay, great. Yeah. So Andy and I were sitting in his office. In fact, we were sitting in the same place where he's sitting right now in person.

Starting point is 01:09:33 This was, I think, before the pandemic, of course. And we were trying to come up with a name, which is just a huge deal in research. You know, what are you going to name your system? So we really wanted to come up with something that that we both liked and my husband and I had recently visited San Francisco my favorite part of it was the otters because they were doing really cute stuff so we bought me a t-shirt bought me an otter t-shirt and I was wearing it and so it just occurred to me like, what about auto-tune? Because, you know, animal names are kind of fun. Like, it's always nice to have like an animal mascot. So I didn't know it was just kind of came together in a weird way. But of course, it's, you know, kind of a play on auto-tune. So it's fun. I love it. So yeah, so that's the origin of the name and then when we formed the company

Starting point is 01:10:26 during the pandemic I was, you know, sending emails to like investors and so forth and I was watching the Wu-Tang Clan documentary the Showtime one

Starting point is 01:10:35 and the RZA's like, you know, we're going to do the first Wu-Tang album that we're going to do the solo albums and the record label and the clothing line

Starting point is 01:10:43 and I was like, oh man, we should do the same thing with Autotune, right? So there is an Autotune record label. We haven't label and the clothing line and i was like oh man we should do the same thing with autotune right so there is an autotune record label we haven't put out the clothing line yet oh yes so like like that's so we want to go for like this hip-hop theme for the branding of the company because also too when i was looking at all like you know you look at vc's websites you look at all the logos and they all have like these like thin fonts and these pastel colors and everything just looked the same and they all have like these like thin fonts and these pastel

Starting point is 01:11:05 colors and everything just looked the same. And I was like, I want to do something where it was not clear again, whether we were like a record label, an art studio, or like a tech startup. That was the angle there. I love it so much. That is marketing brilliance at its best. So this has been so wonderful. We have covered so much ground, but we'd love to have you back on to hear about the next version of Otter Toon when it comes out and you get it into production. Yeah, I'd love to do it. fascinating conversation with Andy Pavlo and Dana Van Aken of Ottertoon. Costas, where do I begin? I mean, of course, maybe the best part of the show was hearing about the name Ottertoon and the influence of the Wu-Tang Clan on their brand. So I think listeners are going to love the show just for that, obviously. But it was also really interesting for me to learn about tuning.

Starting point is 01:12:12 And, you know, the complexity of tuning and the skill of tuning, I thought was a really interesting conversation and really informed why something like Ottertoon is so powerful because it can take so many more things into consideration than you know a human changing one knob and then you know waiting to see the result on the entire system yeah 100% I think like we okay we learned the tone like through the the conversation with Andy and Dana. A few things I want to keep from the conversation is hearing from them that the tuning problem is not just, let's say, an algorithmic problem. The algorithms that you use to do the machine learning or whatever is one thing,

Starting point is 01:13:04 but it's equally an observability problem. And maybe it's even harder to actually figure out like how to obtain the data that you need, how to collect this data, like from live database systems, making sure that like they have the right data and like all these parts, which I think is like super interesting. And how much of like, okay, having the technology, again, it's like one thing, building a product is another thing, but like figuring out like the right, let's say, balance between like the technology itself, what the technology can do and how to involve the human factor

Starting point is 01:13:43 in it, right? By providing recommendation or, best practices. That was super interesting to hear from both Dana and Adi that other tools that are not just optimized for, let's say, based on the metrics that we collect and the knobs that we have access, but we also combine that domain knowledge and best practices from running database systems to inform the user on how to go and do the right thing at the end. And I think the example that they gave was,

Starting point is 01:14:19 yeah, sure, if you go and turn off backups, it will be faster. But is this what you will be faster. Yeah. But is this what you want to do? Yeah. Yeah. It's like saying you can take the airbags and seatbelts out of your car and it will weigh less. Yeah.

Starting point is 01:14:34 Yeah. Do you want to do that? Yeah. So yeah. Amazing conversation. I would like encourage everyone like to listen to it and hopefully we'll have them again in the future, like to talk more about database systems and what it means to start the company and music label, right?

Starting point is 01:14:57 Because yes, a company and a music label. So we'll definitely have to have them back on. Thanks for joining us again on the Data Stack Show. Subscribe if you haven't. Tell a friend. Post on Hacker News so we can try to get on the first page. And we'll catch you on the next one.

Starting point is 01:15:15 We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

Pet Camera - EBO Air 2

The Data Stack Show - 135: Database Knob Tuning and AI with Andy Pavlo and Dana Van Aken of OtterTune

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Pet Camera - EBO Air 2

The Data Stack Show - 135: Database Knob Tuning and AI with Andy Pavlo and Dana Van Aken of OtterTune

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.