Orchestrate all the Things - OtterTune sets out to auto tune all the databases. Featuring CEO and co-founder Andy Pavlo

Episode Date: May 12, 2021

Tuning databases is key to application performance and stability, but it's a hard job. Auto-tuning helps, but it was reserved for the Oracles and Microsofts of the world till now. OtterTune wants... to democratize this capability Databases are the substrate on which most applications run. Although different applications have different needs served by different databases, they all have one thing in common: they are complex systems that need continuous fine tuning to work optimally. Databases come with a plethora of parameters that can be tuned by "turning knobs". Traditionally, this has been the job of Database Administrators (DBAs). Their job is a hard one, as they need to know the specifics of the database, the hardware it's running on, and the workloads it serves. Some database vendors like IBM, Microsoft and Oracle have taken steps to automate this work. OtterTune is a startup that wants to democratize this capability. Today OtterTune is announcing the private beta of its new automatic database tuning service, as well as an initial $2.5 million seed funding round led by Accel. Article published on ZDNet

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Orchestrate All the Things podcast. I'm George Amadiotis and we'll be connecting the dots together. Databases are the substrate on which most applications run. Although different applications have different needs served by different databases, they all have one thing in common. They are complex systems that need continuous fine-tuning to work optimally. Databases come with a plethora of parameters that can be tuned by turning knobs. Traditionally, this has been the job of the database administrators.
Starting point is 00:00:32 Their job is a hard one, as they need to know the specifics of the database, the hardware it's running on, and the workloads it serves. Some database vendors have taken steps to automate this work. AutoTune is a startup that wants to democratize this capability. Today, AutoTune is announcing the private beta of its new automatic database tuning service, as well as an initial 2.5 million seed funding round led by Accel. We caught up with AutoTune CEO and co-founder Andy Pavlo to find out more. I hope you will enjoy the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook.
Starting point is 00:01:12 My name is Andy Pavlo. I am an associate professor of computer science at Carnegie Mellon University. I'm also the co-founder and CEO of the OtterTune company. And AutoTune is an automated database tuning service that typically targets configuration knobs. So my background is in databases. I did my PhD at Brown University under Stan Zadonik and Mike Stonebreaker, and since 2013 I've been at Carnegie Mellon. So my main research is sort of focused on two tracks. And it's all within this umbrella of autonomous databases. The idea is that databases are complex pieces of software.
Starting point is 00:01:57 They are the sort of the bedrock of almost every modern application that's out there today. But they're really challenging pieces of software. It's not something that there's a lot of moving parts going on inside of the system, and it's not something that the average programmer, the average administrator is not going to know how to work with. So my research is divided into two thrusts. One is what we'll call sort of the black box optimization for database systems, where you're if I build a database system from scratch with the idea that I want it to be autonomous how would I design the system and what will be different than
Starting point is 00:02:50 sort of the the traditional approach that that people build database systems today so autotune falls under the uh the first category it's a black box optimizer and we specifically focus on knob configuration tuning. A knob would be a runtime parameter that the database system exposes to you that allows you to change the behavior of the system like caching policies, buffer sizes and so forth. And the reason why these knobs exist is because you know as the developer was building the database system, they, you know, they add a new feature and then there's some decision they have to make about how should they allocate memory for something or how should the new
Starting point is 00:03:34 feature operate. And then rather than hard coding the value, they expose it out as a knob because they assume someone's going to come along and be more knowledgeable how to tune it. And of course, over time, all these knobs are accumulating and it becomes very difficult to manage. So that's the odd tune sort of falls under the black box optimization approach. But for the white box optimization, I have another research project that is separate from the company, a noise page. Again, we're trying to build a system from scratch, assuming that we want it to be entirely autonomous.
Starting point is 00:04:05 So we're pitching this as a self-driving system because the goal is to have it completely hands-off and that you don't need to have any human to make a decision. So now your question is, why do a startup, given that I don't have any startup experience? So as I said, one of my PhD advisors was Mike Stonemaker and Stan was involved in this as well. Mike is the perpetual entrepreneur and I think he always did a good job of having one foot in academia and one foot in industry. And I'm sort of following in his footsteps and trying to go down the same path.
Starting point is 00:04:44 Again, what's nice about doing a startup separate from the research is it exposes you to more problems, more challenges, more interesting use cases than, you know, you may just get while sitting in academia. So that's why we decided to do it. Yeah, that's absolutely true. What people in venture capitals actually look for and what I also personally found surprising in that case is that, well, you may have people such as yourself or your co-founders who are, you know, really knowledgeable about the issue they want to address, but obviously running a business and a startup for that matter requires different skill set and it's not probably the best use
Starting point is 00:05:31 of your time to do the business administration and business development and all of those things. So typically what happens when you get VC investment is they get, they appoint someone at your board or they give you advisors and this thing and I wonder if this is also the case here so not yet so this we're announcing our seed round and so this is you know this is a smaller amount of money from from Excel they did not get a board seat, but we'll be raising a larger round later this year. Then yes, at some point
Starting point is 00:06:07 we'll bring on more professional management to help me out with the administrative things. The thing that was interesting about Autotune was that we had the SIGMOD paper come out in 2017. SIGMOD is one of the top peer-reviewed conferences
Starting point is 00:06:23 in the field of databases. When that paper came out, the other academics were other academics like oh this seems like a cool idea but nobody in industry really paid attention to it and then we uh we got invited to write a guest blog article for amazon's new machine learning blog that just started at the time and when that and that was basically a summarization of the academic paper. But when they came out, everybody started emailing us and say, Hey, we have this exact problem. We'll give you money to fly a student out and help set it up for us and get out and working and not for our
Starting point is 00:06:55 databases. And this happened so many times that like, one, it was overwhelming as a new professor that I just couldn't handle it. Like, you know, I was not positioned to manage basically a consulting company, while you know, having the student try to graduate. And so, but this was a clear signal that while we were doing the research that we were on the right path that we thought something isn't something that we could turn into a startup.
Starting point is 00:07:20 Yeah, and this is love that as well. Customer driven demand. Yes. So I have a really kind of simple and maybe a little bit stupid question. So why autotune? What's in the name? So autotune is a play on word of the term autotune. Autotuning is a common technique used in the music industry to make someone sound better or a singer be more on pitch than they normally would.
Starting point is 00:07:47 So we just thought it'd be kind of cute to say, hey, we like otters. They look cute. Let's call it OtterTune. It turns out otters are actually kind of vicious animals. They look fuzzy and cute, but they're actually not very friendly. Well, I hope that doesn't hold for the company. And so far, I'm pretty sure it doesn't hold for you, at least. Yes.
Starting point is 00:08:11 And well, you mentioned that you had all this demand, basically, after you published the paper. I presume you're talking about the paper I saw you referenced in an article in Sigmund 2018, right? Yes. Yes. Okay. So I wonder if you have any paying customers already. And ideally, if you could mention a few names or if not names, maybe some investors, that would be great.
Starting point is 00:08:42 So as of right now, we don't have any paying customers. So what we're announcing on Wednesday and Thursday this week is the new commercial version of AutoTune running as a staff, a software service in the cloud. What we've had before and what we've had people running on is the previous academic version that we use, you know, that came out of the university that my, that one of our co-founders, Dana Van Aken, was building as part of her dissertation research. So our, probably our biggest deployment at Ottertoon
Starting point is 00:09:16 that we can publicly talk about was that Societe Generale. It's a major French bank, and we have actually a BOD paper published in March of 2020 or 2021 this year that describes what it took to actually get Ottertoon running on premise for their deployment. And that's been our biggest point that we can publicly talk about. We have another deployment at a major travel website that we haven't come out publicly talked about yet. But that's something we hope to announce pretty soon. I'm happy to talk about what's going on there as well.
Starting point is 00:09:52 Okay. Okay. Cool. All right. I think we've covered the peripherals, let's say, in a way. So maybe it's a good time to start actually getting into how does it all work. And just by reading what is publicly available, what you do was quite clear from the onset. And you did a very good job of explaining yourself,
Starting point is 00:10:17 even for people who haven't had the chance to be exposed to that before. But how do you actually do that? Even though you have a bunch of published research, as you pointed out, was not equally clear to me, at least. So initially, I thought that, well, it kind of made sense that you probably use some kind of machine learning approach to, well, to find, you know all those knobs that you refer to. But then I read something that kind of threw me off track. And it was the fact that you mentioned on your website that you don't need to examine an application's data or queries to work.
Starting point is 00:10:59 And that got me scratching my head, to be honest, because that would be, I would presume at least, that that would be the way that you would actually attach to specific workloads and optimize for those. So I wonder if you can elaborate a little bit on how you actually do what you do. Absolutely, yes. So the way AutoTune works is that we connect to the database and through standard SQL commands, we retrieve the current configuration of the system, like the current knob settings, and the metrics of the system. So the metrics are these internal performance counters
Starting point is 00:11:38 that every database system maintains to keep track of what work they're actually doing. So things like pages read, pages written, locks held, number queries, latency, and things like that. Every database system maintains these, it's their own set of metrics. The names from one system to the next might be different. Like it might be pages read versus bytes read, but at a high level, they're all doing the same thing.
Starting point is 00:12:01 Or they're basically telling you what the database system did while it executed the applications queries and so the reason why we don't need see any user data or any of the user queries is because those metrics themselves we have found to be representative or emblematic of what the queries are actually doing and therefore that's enough of a signal for our machine learning models to then figure out how to optimize it. So we record these metrics at different frequencies, and then we store this in our internal repository that keeps track of all the training data
Starting point is 00:12:35 or all the metrics and configurations we've collected from every previous training session. Now we segment that data based on the database system type. Is it MySQL? Is it Oracle? Is it Postgres? As well as the version. So, you know, MySQL 5.7 versus 8, those have to be separate. And then we also include the context about the hardware that's running, you know, number of cores, amount of RAM, disk type, disk speed, so forth. And so with that, we then train statistical models on that segmented data that can predict how the database system is going to perform as you change the values for
Starting point is 00:13:17 the configuration knobs and these metrics are just a signal telling you what actually is happening, like if I tune the knob a certain way, so that calls the amount of disk reads I'm doing to go up or down, right? And then we can map that to an objective function to determine whether we were actually making the system better or not. And that's the objective function is just another metric.
Starting point is 00:13:40 Where is executed per second, 99 percentile latency or whatever you want. Then we train those models based on these metrics we've collected, and then we run a recommendation algorithm that can then generate a new set of values for the configuration knobs that the system thinks is going to improve your objective function. We then can take the new knob recommendation, apply to the database, uh, and then repeat the process over again. Like it's sort of a loop that keeps going, uh, you know, trying to
Starting point is 00:14:11 recommendations and seeing whether they make sense and then over time, the models can then converge and you can learn that, okay, here's the optimal setting for my configuration. But that at a high level, that's what we're doing. It's obviously a bit more complicated because there's other nuances of both machine learning algorithms and the database that you have to be mindful of that makes this sort of non-trivial to do. Yeah, it definitely is non-trivial.
Starting point is 00:14:37 And especially considering how big the combinatorial space of all these databases, all these versions, all these hardware configurations and so on and so forth. So I wonder, well, I have a ton of questions, but let's start with two. Let's line up two for you.
Starting point is 00:14:56 So A, you mentioned a few database systems that you work with, and I wonder if you can give us like the entire list and B well even though it sounds like you use a kind of mixed approach let's say so not purely machine learning but you also have like heavily inductive bias use I would say so you kind of apply, it sounds like you definitely apply domain knowledge in what you do as well. But how did you get, where do you get the data sets to train your machine learning models to begin with? Okay, so the auto-tune, the commercial
Starting point is 00:15:44 version auto-tune that we're announcing this week will support MySQL and Postgres running on Amazon RDS. And the reason why we chose RDS is because it reduces the number of, uh, the amount of different, reduces the dimensionality of the hardware we have to deal with because there's a fixed number of instance types. Yes, you can add additional creature events to your instance by increasing the amount of provision IOPS, but it's not like any, you know, it's arbitrary hardware vision from a bunch of different vendors, but It's a fixed number of instances. So that reduces the sort of scope of the hardware landscape we have to deal with.
Starting point is 00:16:31 The academic version supports Oracle, and that's on our roadmap to support later in the summer. We've also been asked a lot to support Amazon Aurora, their flavors of MySQL and Postgres running on Aurora. And that's something we hope to add in the next month or two. We don't think that would be a major change because the front end of those systems is still, you know, like Postgres and MySQL, just there's some additional knobs that we'd have to add.
Starting point is 00:16:57 So there's nothing in the machine learning algorithms, as you said, that is specific to a database system, right? There's the algorithms, they didn't see numbers. They don't know like, oh, it's MySQL versus MongoDB, right? And this sort of leads into your second question. The thing that sort of was tricky and takes some time on our part is setting up the infrastructure
Starting point is 00:17:21 to collect the data that you need in a certain way from these different database systems and being aware of what are the some of the effects of some of these knobs and are they the right thing to tune. But what I mean by this there are some knobs that will act if you set them a certain way, you will get amazing performance, but it could jeopardize the durability or the integrity of the data. And therefore, it's not something we want a machine learning algorithm to tune automatically because it doesn't know what the external costs are for sending those knobs.
Starting point is 00:18:02 So for example, in MySQL and most databases, there's a knob to turn off flushing the disk, uh, disk rights when a transaction commits. So your transaction commits, you write to the log, to the depot fsync to make sure that everything is durable on disk before you report back to the application that your transaction is committed. So autotune can easily learn that if you turn off that disk right, your database is going to go amazingly fast because you're not waiting on the disk. But now if you now crash, you may end up losing the last five to 20 milliseconds of data that was sitting in your disk buffer before it got written to disk. And autotune doesn't know whether that's OK for you to lose the last 10 milliseconds of data. So that's an external cost that we don't want our algorithm to tune automatically.
Starting point is 00:18:50 A human has to make a value judgment about what's the right thing. So we basically put those kind of knobs on the deny list for the service. So for every new database system we want to bring along, we have to sort of go through a due diligence process and understand what exactly every knob is doing. the service. So for every new database system we want to bring along, we have to sort of do have, you know, go through a due diligence process and understand what exactly every knob is doing. And are there unforeseen implications that autotune may not be aware about, be aware about, and then we have to therefore make sure that we don't cause, you know, we don't cause people problems that, you know, I'm not immediate,
Starting point is 00:19:24 but happened later on. So that's what, that's sort of the challenge of bringing on a new database system. There's other things too, of like, if you set a knob incorrectly, what's the behavior of the database system, some systems will crash, some systems will refuse to start, some systems will revert it back to a proper value. So it takes a little bit of time and as you said, domain knowledge about what the data, how the data system actually works. I wanted to onboard it and get it, you know, properly hooked in, safely hooked into what, you know, to Otterton's tuning process. That's the sort of challenge we face
Starting point is 00:20:01 in bringing on a new system. And so we've done this for Oracle, we've done this for Postgres, we've done this for MySQL. And so we're comfortable with those systems. And the commercial version will support those in the beginning or fairly soon after. Bringing on other systems will be mostly driven by demand and what people ask us to support. And as I said, Aurora is probably the most, Aurora is the most common one that people are asking us to support. Yeah. Yeah. Thanks.
Starting point is 00:20:29 That was really, you had really great foresight there because obviously I was going to ask you about the onboarding process and presumably part of your future plans would be to onboard more databases. And so you did a very good job of explaining the hard part in doing that. So it sounds like you're mostly going to stick for the time being at least with SQL databases and I guess a lot of it has probably to do with the fact that each NoSQL database is kind of its own beast so the onboarding course would
Starting point is 00:21:07 be considerable but again it's I don't think it'd be more considerable than that a sequel database right there's um they're giving a table we've actually we've modified extended auto-tune actually support tuning the Linux kernel parameters. Um, you know, it's certainly not, not a database, uh, for Postgres, uh, you know, there is the, you know, the installation guide and recommendations for Postgres is that you tune some of the Linux kernel parameters a certain way. And so we had a master student at the university, uh, then auto-tune to support tuning some of the kernel parameters. And it makes a difference. And again, from the algorithm's perspective,
Starting point is 00:21:47 it doesn't know that it's tuning something for the kernel and not the database system. It's just treated as another config knob. So we think extending autotune to support these other systems, it's not gonna be too bad. It's just, like I said, it's the onboarding process we wanna be very careful. The one thing that autotune currently cannot support and so something we want to try to
Starting point is 00:22:11 pursue in the future is tuning knobs that are sort of table specific or database specific. So right now it can only tune global knobs. And for my SQL and Postgres and a bunch of other databases, that's okay. In some systems like my rocks, like my SQL using rocks, DB, there are knobs you can tune for a specific tables. And then the challenge is that you can't easily reuse that training data across different databases. So one of the big advantages that we're pushing with auto-tune, and this goes back to your question about how do I get the training data, is that we set up the algorithm such
Starting point is 00:22:47 that we can reuse the training data from previous tuning sessions to tune new databases. This cuts down the amount of training time significantly. So the idea is that if you come along with your database today, we've never seen it before, but our algorithms can determine that it's very similar to a database that we tuned last week, because we know how to tune that one, we can now know how to tune yours more, more efficiently and more quickly. The thing with some of the, the, the no SQL systems and other systems that are out there, if they support knobs that you can have on individual entities or objects in the database, like a table and index or the database itself, then it's very
Starting point is 00:23:23 difficult to reuse that training data from other databases because say your database has three tables, my database has four tables. I can't use the training data for your three tables on my four table one because the feature vector will be different. And that's sort of the tricky thing we haven't quite figured out yet. Yeah, it definitely is tricky. And it does make sense the way you've chosen to approach it to kind of rely on those global metametrics, let's say, is probably the best way to approach it, you can already fine-tune, is Oracle. And for most people, myself included, when they hear about autonomous databases,
Starting point is 00:24:12 I guess Oracle is like the prime example, what first comes to mind. Well, apparently there's a couple of more, and I just found out myself, I didn't know it was the case, but it seems like SQL Server also has some kind of auto-tuning going on. And so I wonder how your solution compares to those, and especially in the case of Oracle, because it seems like it's almost like a head-to-head comparison. So people using Oracle already have the option
Starting point is 00:24:43 of leveraging their autonomous database. So why would they already have the option of leveraging their autonomous database. So why would they choose to use AutoTune instead? Right, let's first talk about like let's say what the what Oracle and Microsoft and actually IBM is in this space as well what those guys are doing and then we'll see how it I'll describe sort of how why autotune is different. So in the early 2000s there was a big movement towards what they were sort of called self-tuning or self-managing systems. It's actually this problem has been worked on for decades. It actually goes back to the 1970s. There was a big push into what were then called self-adaptive systems because people recognize that with the relational model,
Starting point is 00:25:25 if you extract away through a logical layer, what the actual physical implementation of the physical manifestation of the database is, then someone needs to make a decision on how to optimize that system. So there was a bunch of early work in the 1970s on doing auto-indexing, some auto-partitioning work. Actually, one of my advisors wrote a paper on this in 1974.
Starting point is 00:25:52 This is not a new area. These people have been working on this for a long time. What's different now is that people are applying machine learning techniques to try to automate this. So the work that I've done in the 2000s from Oracle, IBM, and Microsoft was really about advisory tools for human DBAs. So there's the auto admin work from Microsoft, which I still think is the vanguard in this research area. Oracle has similar things. And what they're basically doing now is, what they had in the 2000s were these tools would say, hey, I think you should build these indexes,
Starting point is 00:26:32 or hey, I think you should partition your data a certain way. And then they relied on a human to make a judgment to decide, yes, that's what I want to do, and click the OK button to go ahead and deploy it. Because the human knew when was the right time to do it and what the workup was actually looking like. Again, they were starting doing recommendations. What they have now is basically the same methods or the same tools at a high level that they had before. It's just now instead of a human clicking okay, the software itself clicks okay and
Starting point is 00:27:04 applies the changes. And so Oracle has taken a lot of the same tools that they and applies the changes. And so Oracle has taken a lot of the same tools that they developed for the DBAs back in the 2000s and now they're running them for you in a managed environment. They've extended them and improved them by replacing some of the core algorithms with machine learning models,
Starting point is 00:27:19 which I think is the right approach. But at a high level, they're doing the same thing. And I'm not going to denigrate them, I think it's definitely a cool project. But I don't really see it as being sort of, you know, massively game changer, right? It's just, they're doing some of the things that for you, and sort of, sort of close the loop, But it's very reactive rather than proactive. It's like fixing things as they occur or after they occur rather than saying, this is what's going to happen in the future.
Starting point is 00:27:53 Let me go ahead and make these changes. So Autotune falls into this category as well. Autotune is a reactive system. They look at what's happened in the past and try to make things better now, assuming that the future is going to look similar to the past. And so OtterToon is independent of the database system itself. Like I said, we support Oracle, we support PostgreSQL, MySQL. We can extend it to support other databases that are not going to have a large budget
Starting point is 00:28:23 or the large team working on it and building these kind of tools, the way Oracle and Microsoft have. So we sort of see this as serving the, you know, the part of the database landscape that doesn't, are not going to have these technologies. So I think we are complementary and we are adjacent to what the other guys have done, but we're supporting systems that they're simply not going to support. So that's how I sort of see this. Yeah, that makes sense.
Starting point is 00:28:55 Well, thanks. And we've covered a lot of ground and you went into quite some detail, which I greatly appreciate and actually you already kind of spoke to your future plans as well you mentioned raising more money and I guess with that comes also growing the headcount of the company the other specific future plans that you'd like to refer to um no I think that the,
Starting point is 00:29:27 I mean, at least for the next year, it's building on our strength. It's going to expand the team that we have. It was surprisingly easy to hire. We grew very, very quickly. Everyone was warning me, oh, you can do a startup. It's going to be impossible to hire.
Starting point is 00:29:41 It's going to be the biggest challenge. I just went and hired all my best former students uh from carnegie mellon university like it was it was fantastic like uh so we we're we're about 12 people now uh we hired eight people in a single month uh and so i think you know what we're doing the launch this week we'll see what the feedback is we get from the community. We plan to raise a larger round this year and then we'll just grow bigger from then. I think that, I think it's going to be a,
Starting point is 00:30:15 I think there's a lot of tooling and other things we have to start building in the next year to help make it easier for people to start using Autotune even more so than it is now. But then beyond that, whether we start dabbling or get looking into doing other, other things like auto admin does like physical design, like picking indexes, doing query tuning, that's something we can, we can assess later down the line. But in the short term, we're focusing on knob tuning because we think that
Starting point is 00:30:44 is that's a large enough market. There's a surprisingly large number of people that are still running with the Amazon RDS default configuration, which is Amazon is already tuned, but it's sort of not tailored for each application. So again, we think there's enough meat for us on the bone in the short term to focus on just knob tuning. And we think that'll be okay. I hope you enjoyed the podcast. If you like my work, you can follow Link Data Orchestration
Starting point is 00:31:13 on Twitter, LinkedIn, and Facebook.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.