Disseminate: The Computer Science Research Podcast - High Impact in Databases with... Ryan Marcus

Starting point is 00:00:00 Hello and welcome to Disseminate the Computer Science Research Podcast. As usual, I'm your host, Jack Wardby. The podcast is brought to you by Pomtree. Pomtree are the developers behind Raftree, the open source temporal graph analytics engine for Python and Rust. Raftree supports time traveling, multi-layer modeling, and comes out of the box with advanced analytics like community evolution, dynamic scoring and temporal motifs mining. It is blazingly fast, scales to hundreds of millions of edges on your laptop and connects directly to all your data science tooling including Pandas, PyG and Langchain. Go check out what the Pomfrey guys are doing at www.rafty.com where you could dive into their tutorial for the new 0.80 release. Today is a little bit different to the usual episode, which normally focuses around a specific paper. We're actually going to be talking to an individual about their

Starting point is 00:01:11 career. And this has been inspired by a blog post called The Most Influential DB Papers, which breaks the most influential work in the DB community across paper year and the individuals. And I'd like to say that I'm joined by the author of that blog post today, Ryan Marcus. So Ryan, welcome to the show. Thanks. Appreciate being here. Great. So if I could tell you a little bit more about Ryan before I'm sure he tells us his story. But Ryan is an assistant professor at the University of Pennsylvania, which was actually where the first episode of this podcast was filmed. So you should go back

Starting point is 00:01:44 and listen to episodes one through to five. They were recorded live in the convention center in Philadelphia. Yeah, now Ryan said, you can correct me if I'm wrong here, Ryan, but I have your sort of research interest down as primarily being ML for systems, right? So with a focus on databases and clouds. So yeah, very short introduction through there. But yeah, tell us more about yourself. Thanks. I mean, yeah, that's about the extent of it. New faculty at UPenn, I just started

Starting point is 00:02:10 this year. My lab is trying to do what we've been calling next generation data systems, trying to build data systems that can adapt to changes in environments and data layouts, trying to build data systems that can invent new algorithms or approaches to solve problems autonomously, and data systems that are intentional, that go beyond kind of just SQL query in, result out, but have a deeper understanding of user requirements and especially performance. So, you know, it's a young lab. We're still getting started.

Starting point is 00:02:39 I haven't done too much yet, but it's an exciting time. It's great to be started. Yeah, it's a bright future ahead for sure. So let's jump off. Let's talk about this blog post. And so the most influential database papers, what was the motivation behind you doing it? And yeah, kind of for the listener who hasn't read it, we'll obviously link it in the show notes. Tell us about the blog post. Yeah. So back in 2016, Mike Stonebrager was giving a talk at the New England Database Day. And if you don't know Mike, he's a very direct guy. He says what he means, means what he says, and he often puts it in particularly colorful ways. And Mike's talk at this particular New England Database Day was about the hollow core, as he put it, of the database community. He challenged the audience and he said, I challenge any of you PhD students, professors, whoever you are, to name one seminal database paper that has come out after the year 2010. And of course, somebody put their hand up and they said, what about MapReduce? And then Stonebreaker went on

Starting point is 00:03:41 this whole rant about how much he hates MapReduce and, you know, et cetera, et cetera. But, you know, that sort of sat with me for a long time about, you know, wow, is it really true that there haven't been any like core seminal papers in so long? Is there really this kind of hollow core where no one is working on the essential problems in data management anymore? And as academic in training, I did what all academics do and I procrastinated it for, you know, seven or eight years. But then eventually I decided, okay, let's take a look at this. Let's try to actually take a data-focused approach to try to see like, okay, what have the past, what have the most seminal

Starting point is 00:04:14 papers be? What have the most influential papers been recently? So I did probably the most database-y possible way of solving this problem. I scraped the citation graph from Semantic Scholar and a couple of other different places along with DBLP, both extremely excellent resources. You should check out if you've never seen them before. And I tried to reconstruct this citation graph in a Jupyter notebook, grant page rank on it, and then looked at the node weights of each node and asserted, okay, yeah, that's an approximation of the impact or the influence of this particular paper. And, you know, the result is this sort of loosely ranked list of kind of like what papers

Starting point is 00:04:53 have been the most influential and, you know, what authors have created those papers. A lot of data processing and sort of cleaning required to like get all the entity matching correct. I excluded self-citations because if you don't do that, you get all sorts of really weird clusters and loops. There's many instances of SIGMOD papers citing paper A cites paper B and paper B somehow cites paper A. And you're like, how did that happen? But obviously if someone's submitting a couple papers, then maybe they up their citation counts a little bit. You exclude all those, you factor all that out and you end up with quite a nice graph. And you can read trends out of that

Starting point is 00:05:29 graph quite easily. So if you filter out all the papers past 2010, you see, for example, a ton of work on crowdsourcing, a ton of work on human in the loop data exploration, which to the slightly newer folks in the DB community like me is sort of like, oh yeah, yeah, I guess to me it's felt like that has always been around, but it was quite new at that time. And if you look at 2015 and onward, you see this big shift towards machine learning for database systems and database systems for machine learning. So papers like Snorkel and HoloClean and Andy Pavlo's work on large scale machine learning models for tuning databases and Tim Kraska's work on learned index structures. So the conclusion of my experiment, I haven't talked to Mike since writing the blog post, but I do think there have in fact been several major seminal papers since 2010. And

Starting point is 00:06:16 it's quite interesting that the data-driven methods that have been predominantly developed by our community were in fact the methods that were able to identify what these influential papers were yeah so that was sort of that was goal number one with the with that particular blog post um you know mission accomplished i mean that's funny i'd love to get um mike stonebrook's take on it and say like you've proved you're wrong like like the evidence is here it shows like yeah yeah yeah i'll let them know but i'll email him the link to this podcast say might you need to check this out. But that's awesome. Yeah.

Starting point is 00:06:47 I mean, it'd be, actually, you're just looking at, I'm looking at the blog post now on my other screen. And it is really interesting. Can I be just sort of like looking back kind of at some of the stuff in the, I don't know, the most influential one of all times, I guess, is one by Pat Schellinger, right? Which is the access path selection and relational database management. And these other ones are sort of, they're so different to what you see in conference proceedings today right but yeah no that's awesome and absolutely yeah it's the

Starting point is 00:07:13 inspiration for this show and trying to get some of these folks on to come and talk about their um about their work so i guess with that let's talk about your story then ryan so very brief introduction at the top of the show but yeah tell, tell us more about your journey. How did you get to where you are today? Did you always want to be a database researcher? Yeah, yeah. You know, like all four or five-year-old boys, I grew up with dreams of writing SQL queries and functional dependencies and all. So I got into computers at a young age. My dad worked in IT, introduced me to this wonderful programming environment called HyperCard. And I was hooked from day one, started programming quite a bit. I went to college, to the University of Arizona, decided to do an undergraduate degree there. And while I was there, I interned during several summers at Los Alamos National Lab with their high-performance computing team. And this was really the first time that a program that I wrote didn't complete instantly. It was the first time where there was a significant amount of

Starting point is 00:08:15 computes, there was a significant amount of data that the choices that you made really, really mattered. And that was very fascinating to me and very addictive, right? Because you have this program, the goal is to get the program to run faster. And so, you know, you do some work, you optimize some Fortran code, you change some MPI routines or, you know, whatever it is you do, then you run it again and you see 20% improvement. And that's a very both productive and addictive thing to realize. It's like, oh, I can make these performance improvements and I can do this and I can make the program faster.

Starting point is 00:08:43 And, you know, all the physicists who wrote the code are very happy with me and, you know, all that, all that great feedback loop, right. Software development and everybody that I worked with at Los Alamos National Lab, you know, they were all, they all had PhDs. They were all, you know, physicists or computer scientists. And they advised me, they were like, well, you know, if you want to continue your career here, if you want to keep going in this direction, the thing that you should do is you should go and get a PhD. It was sort of the de facto option. In retrospect, that's a really bad reason to get a PhD just because everyone tells you to, but it's what I did. So, you know, okay. So I applied to a bunch of programs and I

Starting point is 00:09:16 ended up picking a really small program outside of Boston at Brandeis University because it had a whole bunch of faculty who worked on not just high performance computing problems, but also on data problems. And I continued to collaborate with Los Alamos a little bit, as well as with Brandeis. And the two types of problems that we really found were at the core of these high performance computing challenges was first, the orchestration of data, and second, the optimization of the routines that access data. So the orchestration of data is like, how are you going to lay out the data that you need to access to? What's your strategy for, you know, in the HPC world,

Starting point is 00:09:54 doing stuff like checkpointing, et cetera. And then optimization is, you know, much closer to the traditional database problem of like, okay, now I have all this data. I know logically how I want to construct it back together. I know, you this data. I know logically how I want to construct it back together. I know the steps of the computation I want to take, but what's my actual plan? When do I move the data to the compute? When do I move the compute to the data? When do I do all these particular things? And Brandeis and Los Alamos were awesome environments to investigate these problems in this kind of scientific computing regime. But as I went through my PhD, I became to appreciate the sort of beauty of relational databases a lot more. High-performance computing and scientific code

Starting point is 00:10:32 doesn't have the same... Well, the principles and the kind of underlying theorems of relational database systems are things like functional dependencies, things like the commutativity of the join operator, things like declarative languages like SQL, right? Whereas the first principles of high-performance computing in the scientific realm are things like physics and neutron transport and importance problems, but much, much harder problems to grasp and problems that require you to have a national lab level of resources to address. So I started getting really interested in database systems at this point, because they provided that same kind of addictive feedback loop, right? It's like, okay, I have some workload, like TPCDS, join order benchmark, whatever it is, and I can execute that workload on a database. And then I can make some change to the database, you know, at the highest level, I could create a new index, or at the lowest level, I could implement some kind of new fancy hash drawing routine. And again, instant satisfaction. I see that my performance increases. I see that my numbers go down or numbers go up, whichever one you're trying to do, and it feels great. And that led me down this path at Brandeis, where we were trying to do these optimizations for cloud databases.

Starting point is 00:11:45 We were trying to figure out how to deal with elasticity, how to deal with when should you make your cluster bigger? When should you make your cluster smaller? We were dealing with things like scheduling. If you have a whole bunch of tasks, they have different affinities to different types of hardware. How do you match them up correctly? And we were dealing with problems like tenant placement, like which queries can I run concurrently

Starting point is 00:12:04 that don't use the same resources so they won't interfere with each other. And all three of those areas, you can get some really, really big gains by being smarter and smarter and smarter. But what we, what my advisor Olga and I realized as we kind of went through these optimization steps over and over and over again, was that the, while very addictive, while very easy to just kind of sit in the chair and make the numbers go down, the solutions that we came up with weren't really generalizing. The thing that made workload A go faster might not have been the things that made workload B go faster. The elasticity strategy

Starting point is 00:12:35 for an analytics customer looks totally different than the elasticity structure for a transactional processing customer, et cetera. And so we started to think about, okay, how can we get the computer to play this game for us? How can we use machine learning techniques or autonomics techniques or whatever you want to call them to sort of play this game of making the number go up or making the number go down for us in a way that automatically tailors its strategy towards the particular customer's requirements. And so this naturally led us to evolutionary algorithms. It naturally led us to a bunch of different machine learning techniques. And then the first thing that really fit, the first thing that was like, oh, this is the perfect match for the problem was reinforcement learning. So in reinforcement learning, you

Starting point is 00:13:19 abstract away this idea. You say that there's an agent who takes actions inside of some environment and receives some rewards, which is great. And we were very easily able to map our problem in database land on top of that reinforcement learning platform, you know, because we had spent so much time being the agents, we knew exactly what actions to take inside of the environment. And we had identified those reward functions. And it was it was a really magical connection to find, you know, it was really great. Brandeis was a super small school. So as soon as I kind of figured out like, oh yeah, maybe it's this reinforcement learning thing. I sat in my chair, I turned 90 degrees and I talked to the person across from me who I knew as a PhD student working on reinforcement

Starting point is 00:13:56 learning. And I said, Hey, like, what do you think about this? How does this work? Yeah. Super, super great. Surround yourself with machine learning experts. Yeah, advice to take away from this. But I'll kind of paper over, you know, there were actually there was a ton of detail, there was a sort of worked out. There's a lot of science to be done, et cetera, a lot of valuable lessons learned. Maybe I talk about something later. But anyway, so that sort of concluded my PhD at this kind of orchestration level of the data, figuring out how to build these smart clouds, intelligent clouds, automatic scaling, whatever you want to call it. And from there, it seemed like naturally the next place to go was down the stack a little bit into the query optimizer and into more of the sort of lower level algorithms that impacted the performance of the system, right? Because once you get the cluster sizing right and the

Starting point is 00:14:58 scheduling right, well, if you want to get more performance gains, now you have to change the thing the database is actually doing, right? So that led me to postdoc at MIT with Tim Kraska and Mohamed Elizadeh, along with the other faculty there, where we started to look at how we could apply reinforcement learning to the query optimization problem. Wrote a bunch of great papers, wrote the NEO paper and the BOW paper, and learned just a whole lot about the actual technical problems you encounter when you try to integrate some of these ML techniques into big systems. You can't just go up to a production database team and say, Hey, we built this crazy reinforcement learning algorithm.

Starting point is 00:15:39 We have no guarantees about it. We have no worst case scenarios. We have no upper bounds, nothing like that. Why don't you go ahead and just let this loose on all of your customers? And they're like, ha ha, very funny, nice joke. So we had to reformulate a little bit. We had to change a bit from kind of the pie in the sky, craziest reinforcement learning, most advanced ML thing that you could possibly get, which we sort of embedded in our Neo paper. We had to make the switch towards, okay, what's something that could actually be

Starting point is 00:16:05 a little bit more practical, which is what led us to the Bao line of work. And I don't want to go into the technical details of either paper. You know, if we're interested, I can absolutely talk about it later, but yeah. And we did, I think we did a decent job because we were able to take those techniques

Starting point is 00:16:21 that we aimed to be more practical. We spun them off into a startup and that startup got acquired by Amazon. So we spent a little bit of time there. And then, you know, as things settled at Amazon, I left for the faculty position. So yeah, hopefully that story is interesting to someone at least, but yeah, that's the general trace of how things worked. Yeah. Awesome. awesome that's fascinating it's great that you kind of had a taste of sort of turning something into a start because that way you kind of you've had the the academic experience and then you've had the industry experience as well

Starting point is 00:16:51 and kind of we might get your thoughts later on kind of what you prefer but i'm guessing by the fact that you came back to academia i kind of guessing which one you prefer to do but yeah frozen cons you know yeah yeah and the other thing i want to say the the Los Alamos, I mean, obviously because of Oppenheimer, that's the same place, right? Yes, yeah, yeah. Okay, well, that's pretty cool. Just a random observation, but yeah. Yep.

Starting point is 00:17:13 It's a much more developed town now than as pictured in the movies, you know. We have grocery stores and bars and restaurants and all those things. How modern, how modern. Cool. I guess that leads us up to date then. And so we can talk about your current work then and kind of next gen data systems and

Starting point is 00:17:29 what your lab is working on at the moment. What can we look forward to in the next 12 months? Yeah, yeah, absolutely. So the sort of current problems that my lab is working on is very directly motivated by a couple of the lessons that we learned at Amazon. So we actually have a paper coming out at the SIGMOD in Chile about all of the lessons that we learned at Amazon. So we actually have a paper coming out at the SIGMOD in Chile about all of these lessons that we learned at Amazon, including some actual data from production workloads. So I'm confident everything I'm about to say is okay. But one

Starting point is 00:17:57 of the things that really surprised us compared to coming from academia to industry was our assumptions about workloads. When we were in academia, we assumed that workloads were crazy, unmanageable beasts where every query was super novel. There was some, but very little repetition between queries. And it was mostly analysts sitting down at their keyboards, smashing away at SQL, sending that to the database, getting some results, looking at it, and then trying again. And when we got to Amazon, we were working on the Redshift database at Amazon, we discovered that we were essentially completely wrong. Almost all of the traffic, a large portion of the traffic that comes into the database actually is SQL, but it's not written by

Starting point is 00:18:38 hand. It's written by BI tools or dashboarding tools or some automatic tool that's sitting between the analyst and the database, interpreting what the analyst wants into SQL, getting the data from the database, and then displaying that information to the analyst in, you know, a chart or, you know, an analysis or like a t-test or, or, or something like that. So that was the first thing we were really wrong about, about workloads is that while humans writing SQL is important, and it's a case that databases need to support for a very, very long time, a whole lot of SQL today is automatically generated. The second thing that we were- Ryan, to jump in, what sort of fraction, because that's a fascinating point because I kind of

Starting point is 00:19:12 myself had that same sort of experience of kind of being the workloads that we work with, TPC, TPC, H and et cetera. Are they representative of what people are actually doing out there? Is it just some kind of data scientists going wild at a terminal, hacking away? What fraction of it is machine-generated at a very high level versus sort of just user-generated? So what I'll point to, it'll get discussed in the SIGMOD paper, but just to make sure I don't say anything

Starting point is 00:19:38 I'm not supposed to, there was an analysis by Alex Van Rennen and Thomas Neumann that benchmarked a whole bunch of different clouds analytics databases and took a look at the traffic in the Snowflake system. And their analysis found

Starting point is 00:19:50 that over 70% of the queries were machine generated. Amazon numbers are similar ballpark. Yeah. Yeah, yeah. So that was quite surprising. Yes, sorry, I interrupted your flow, Ryan. So lesson number two.

Starting point is 00:20:02 And things might look different for a graph database or for a transactional database. Although actually transactional databases, I imagine are even more kind of procedural. Yeah. I don't know. Neither here nor there, but yeah. One big assumption that we were making in academia that was certainly untrue is that there is, that the main thing a database does is serve queries issued by humans. That is not true. Databases serve queries issued by machines. The interface not true. Databases serve queries issued by machines. The interface on the other side of that machine is, you know, there's a user somewhere, obviously,

Starting point is 00:20:31 but there could be quite a few layers of indirection before you actually get there. The second big assumption that we had was about workload or query novelty. We assumed that if the user submitted a particular query, then the next query that they submit would be something either like totally unrelated or only slightly related or something like that, but that it would be different that the user wouldn't be asking like the same question over and over and over again, because, you know, you just got the answer to that question. Now you're going to ask a different one. Turns out this is like just radically incorrect. Analytics databases serve dashboards. They serve report generation. They are essentially

Starting point is 00:21:07 almost all of the workload is executing the exact same query on slightly different data. You know, inserts are coming in, data streaming into the database somehow. Some users might do a nightly load, some users stream data in live, et cetera, et cetera. But you have some dashboard, you have some reports, you have some process that you care about, some analytics that you want to compute, and you recompute that data periodically, frequently. Obviously, when the data hasn't changed, you just answer the query from the cache.

Starting point is 00:21:35 Hopefully, hopefully every database does that. Maybe some do. If they don't, they certainly should. And then otherwise, you either have some incremental view maintenance, maybe you have some strategy to quickly update that query result and keep it fresh, or maybe you recompute it each time. So this was a huge shock to us. You know, when we were developing our reinforcement learning systems for query optimization, we

Starting point is 00:21:55 were assuming that we had this almost adversarial case of like, every user query is totally different. What you learned about the query workload yesterday might have nothing to do with the workload tomorrow. And there are some customers, it's a long tail, there are some customers where that's true. The exact number will be in that upcoming paper, but there is some single digit percentage number of customers where almost every query really is unique, but vast majority are these repeated workloads. And this is actually great news for ML for systems people, because it means that you get to try again. It means that if you make the wrong decision

Starting point is 00:22:30 the first time that you see a particular query and you get some feedback from it, the next time in the future, you're actually quite likely to see even that exact same query pervatum again later. And you have essentially another opportunity to optimize it. This is an implicit assumption in reinforcement learning algorithms

Starting point is 00:22:49 that your past determines your future. But something very exciting that our lab is working on that hopefully we'll be able to release a paper about too is making this assumption very explicit by actually saying, okay, I'm going to say, I'm gonna ask the user to tell me, hey, these are the queries that I care about repeating. These are the queries that I'm going to run every single day. I want you to spend some time optimizing those. And a really interesting opportunity presents itself here.

Starting point is 00:23:14 When the user tells you, the database owner, I'm going to run this query 100 times a day for at least a year. So I know I have at least 365,000 executions of this particular query. Normally query optimizers, they got to be fast, you know, a hundred milliseconds max, preferably much, much less than that. But if you're going to execute a query nearly half a million times, it might be acceptable to tell the user, okay, I'm going to spend five, 10, 15, 20 minutes optimizing this particular query to make sure I really, really, really get it right. Because then when you re-execute this plan over and over and over again, you're going to have the, you know, you're going to have the right plan. You're going to make up

Starting point is 00:23:53 for that 20 minutes by a long shot. I mean, we're calling this problem learned query super optimization. You know, if I have a whole bunch of offline time and allocation, not a whole bunch, you know, something on the order of five times the query latency, right? So I can try a couple of different plans. How much better can I do in my query optimizer than the sort of one-off type of situation? So that's what our lab's working on kind of near term. Yeah. Yeah. A short-term pain for some long-term gain. And I guess there is a tipping point and it depends on the computation, right guess potentially kind of how i don't know it depends there must be a point we go okay no we we can't take this approach we have to do it another way because i don't know

Starting point is 00:24:33 it might need a 90 of those iterations to pick the right one i'm not too sure so it might take too long but i'm sure the paper will explain it all and yeah yeah there's there's two two fundamental problems right one of them is sometimes you add more search time but you don't get any better quality right and then it's like well now i'm just wasting cycles which in cloud speak means you're wasting dollars right which is not great so you need to know kind of when your search is done which is hard to do because there are way too many plans right the other issue is you don't know how long your query plan is going to be good for so like your query plan might be great on the current distribution of the data, but in six months, you know, if I'm a startup and now my

Starting point is 00:25:08 customer base has quadrupled, hopefully maybe now the distribution is totally different, right? So you have two problems there. One, can you tell whether or not your query plan has gone stale, whether or not you shouldn't use that same query plan anymore? Can you even notice when it happens? And then if you can notice when it happens, which, which we think the answer is yes. What can you do about it? How can you adapt that super optimized query plan to changes? So yeah, there's a question of, is the squeeze worth the juice?

Starting point is 00:25:32 And then there's a question of, how long does the juice stay good, right? Please tell me one of those phrases has made it into the title of the paper. I mean, it's so much fun you could have there with that one. It feels like it could be, yeah. We made it into the intro. I don't think they made it into the title.

Starting point is 00:25:48 I always like a good funny title but yeah awesome stuff i look forward to those sound really really interesting and the next sort of section of the podcast i kind of wanted to we mentioned a little bit before when he was telling us your story but kind of look back kind of the bit of a retrospective of the papers and you've mentioned things like bow and neo over over the course of your the course of your career so far and yeah i guess the first one is which one are you most proud of in your career is it the best paper award um so yeah there's there's a funny thing about best paper awards and i you know we were obviously very very thrilled to to get it for the the bow paper you know that's an exceptional honor to get in front of the whole community that you've been working with for decades. But we find that the bow paper got rejected more times than any paper I've ever written.

Starting point is 00:26:33 I think it was rejected three or four times before it got in. And the reason it got rejected so many times is because on the one hand, you had these traditional query optimizers that had very, very low latency, relatively low complexity in the sense that they were already implemented, but kind of just okay performance. Then on the other extreme, you had these Neo-like systems, these crazy learned optimizers that use deep learning and all this fancy stuff and got these huge, massive improvements, but had all these operational difficulties, all these operational challenges. And with Bao, we were trying to hit the sweet spot. We were

Starting point is 00:27:08 trying to say, okay, how can we take advantage of all that infrastructure that exists today? All of the, you know, literal, probably a century of human effort concurrently that have been put into these query optimizers, not throw it out entirely, but, you know, build on top of it to steer it in a way so that we get some of the gains of those crazy ML systems without the massive operational complexity. And reviewers, a lot of them said like, okay, now you've lost the advantage of the crazy deep learning thing because you're no longer getting as good query performances. And you're also adding this operational burden on top of my traditional optimizer, right? So this is not the best of both worlds.

Starting point is 00:27:47 It's the worst of both worlds. And it took us a very, very long time to kind of get the story right, get the tech right, to make that really happen. And the reason, the symptom of this, the thing that you can observe as to confirm my story is that the BAL paper came out. And then six months later, our collaboration paper with MSR applying it to their scope system came out. And obviously we didn't publish the BAL paper and then get all that work done in six months, right? Like we were really, really trying to push it forward. So yeah, I am definitely proud of the BAL work, but probably not because it got the Pest Paper

Starting point is 00:28:21 Award, although I am very thankful for that. I'm mostly proud of it because our whole team, everyone who worked on the BAL paper was really excited about it and really pushed it through to an extent that I would have never imagined. They put a lot of faith in me and in my ideas. And then not only did they help me execute, but they went off on their own and execute. Pari Negi, who's a just graduated PhD student from MIT, was one of the people on that paper, took it to MSR, applied it to scope, built it up there, went super well. Christophe Inesser at TUM, who didn't work on the original bow paper, but was talking with our team at MIT at the time, took that idea to Meta, implemented it in this thing

Starting point is 00:29:00 called Autosteer, wrote a paper about it. And just seeing, yeah, I think the thing that i'm i'm most proud of is is seeing other people take up that idea and apply it to you know real systems and and produce kind of amazing results with it that that was really awesome to see i mean what validation of you kind of your idea that someone else has gone off and had some like i mean actually used it and kind of applied it to like kind of different systems and talk about impact as well. Right. I mean, we always, I always talk about impacts on the show and ask people about the impact of the work, but I mean, how do you get much more impact than that? Right. I mean, that's pretty,

Starting point is 00:29:34 pretty ticks all the boxes almost. Right. Yeah. Yeah. I just, I, you know, the, unfortunately, every time someone uses this, the system in these industrial context, my H index doesn't go up. So I got to convince my tenure committee. But other than that, yes, I do think the gold standard of impact, and this also goes back to what Mike Stonebraker was saying in his 2016 talk about the hollow middle, is that the real validation for these ideas, we can play around as much as we want at SIGMOD and VLDB and CIDR and ICDE, and we can create crazy stuff that maybe will work in 20 years. The real validation of the idea is the mission. We want to make data accessible to everyone. We want to make it cheap and affordable

Starting point is 00:30:17 enough so that people who couldn't use data management systems 10 years ago can use them now. And the best validation you can get for that mission is by having it happen, right? By reducing costs, by creating new capabilities that didn't exist before and yeah, by deploying into the real world. Yeah. Awesome. Yeah. That's fantastic. I guess we can maybe talk a little bit about, I have a section here written down motivation and inspiration, right? So, I mean, obviously there's the blog post that's kind of brought us together today about the most influential database papers. So yeah, kind, obviously there's the blog post that's kind of brought us together today about most influential database papers. So, yeah, kind of what are your favorite papers and which ones have had the biggest impact, you think, on your career over the years? Yeah, yeah, sure.

Starting point is 00:30:55 Okay, let me start with a story that will initially sound off topic. Okay, go for it. So when I was first kind of getting to know the database community, um, I noticed a lot of name dropping. I think it's a problem our community has, you know, people will say like, oh yeah, you know, that Jim Gray paper. Oh yeah. You know, Thomas Neumann's new database system.

Starting point is 00:31:17 And like, you're just kind of supposed to know what those things are, right. You're just supposed to like learn at some point, like who Thomas Neumann is and all of these people. And you do very slowly over time, but you know, you kind of like make some stupid mistakes along the way. I once asked, so I was once asking a conference speaker at SIGMOD when I was a very new PhD student and he did, the speaker was talking about the cost of crowdsourcing and how getting more and more humans into the loop kind of added more and more and more to your cost.

Starting point is 00:31:49 And I raised my hand and I asked a question. I was like, oh, well, you know, don't database systems oftentimes actually like account for this? Like, like the, for example, the, the crowd DB paper, are you familiar with it? And the speaker was Mike Franklin, who was of course the first author of that crowd DB paper. But, you know, I just didn't know who they were or what they were about. I mean, it's a name on the paper at the end of the day, right? I mean, not necessarily like, there's not a little picture there that says this is what Michael Franklin looks like. Right, exactly.

Starting point is 00:32:16 I think you can be forgiven for that one, right? So, yeah, a little embarrassing, but, you know, it's fine. We move on. So, so other thing that I really wanted to do with this blog post, besides try to answer Stonebreaker's question was also try to create a tool for people to kind of explore the, not just the citation graph, but also sort of the more prolific names that are out there in, in, in database systems. So, you know, one of the things you can do on that particular blog post is you can click on someone's name, like Pat Selinger, for example, and you can see a list of all the papers that they wrote, kind of sorted by this particular ranking. So if you want to know, oh, who is Sam

Starting point is 00:32:54 Madden? And what is he famous for? If you click on Sam Madden's name, you'll see kind of like the top particular papers that Sam wrote. And I found, I think, through this process over many years, because I wrote the code a long time ago and didn't get around to actually writing it into a blog post for much longer than that. I found probably my two favorite papers in databases ever through this technique. So the first paper that I found was a Victor Lease paper called, How Good Are Query Optimizers Really? This is a classic. I read it at the very start of my work in query optimization. It basically goes through what are the biggest problems in query optimization? How do cost models and cardinality estimators interact?

Starting point is 00:33:36 Why do cardinality estimators constantly underestimate? Is there some simple correction? And it really lays the groundwork. I think anyone who wants to do research in query optimization or who wants to understand query optimization papers should stop whatever they're doing and go read that paper by Victor because it's really phenomenal. And the idea behind this paper was evaluating a whole bunch of different concurrency control techniques with an eye towards future systems that we're going to have many, many, many cores. I think they give some estimate in the paper about how many. Yes, a thousand cores, right? Yes, a thousand cores. Yeah, yeah, yeah. Which has been achieved, which is a thing.

Starting point is 00:34:20 You can buy a node with that many core now, not even on a GPU. You can buy an x86 node with that many cores. And that paper is also extremely insightful. You know, I'm not an OLTP researcher, but this paper shone so much light on the trade-offs and differences between these concurrency control algorithms that I had never thought of before. They really illustrated the trade-offs and scaling points. See, I think both of those papers I discovered through this sort of data-driven method. And I'm very, very glad that I did because they're both unbelievably excellent.

Starting point is 00:34:54 They're also very famous papers. So, you know, you probably could have discovered them by talking to experts as well, but it's nice that, you know, the two things connect. Yeah. Yeah. Yeah. It's funny what you said earlier on about the sort of like people, you know, the new Thomas Neumann system or things that I remember before it was, but it was my first SIGMOD and it was, I was still relatively sort of green. And so it was the Ares paper and someone, you've not read the Ares paper. And I was like, what do you mean? What was I meant to read it before I came?

Starting point is 00:35:21 Like, was this some sort of, are they going to test me when I try and get in? Like, is it going to be an exam? But yeah, no, it's funny. Because I mean, there's that much literature out there, right? Like it's hard to sort of cover everything. But yeah, no, it's just funny. I had a similar sort of experience with that as well. But yeah, cool.

Starting point is 00:35:37 I guess switching it up from papers to people then. So kind of the same question, but what sort of people have had the biggest impact on your career then? And kind of like, if you had some, what's the best advice you've had along the way and things like that yeah yeah that's so i i probably got the best and the worst advice in my career from the same person okay so my uh a guy named larry cox who works used he's retired now used to work at los alamos national lab he's a physicist, used to work at Los Alamos National Lab. He's a physicist,

Starting point is 00:36:05 worked on particle transport stuff and that kind of thing. The worst piece of advice that I got from him was, oh yeah, you got to go get a PhD. It's just a nice thing. And now that happened to work out very well for me, but in retrospect, not so great. But probably the most important thing that he taught me is that when somebody asks you a question about your workroom and someone challenges an idea that you have, that is not something that you should defend against. It's not something you should view as adversarial. It's something that you should view as an opportunity for you to understand your problem better and potentially start a collaboration. So, you know, I think it's very common, especially in kind of conference talks where someone will ask a question that's clearly very pointed.

Starting point is 00:36:47 And sometimes the speaker will engage with that question in sort of a debate like style, right? They'll try to make their points and their counterpoints and prove their original argument. And what Larry Cox taught me is that that's not really the best approach. It's much better to try to identify the weakness that that particular individual is pointing out and try to see, is this a weakness in my technique? Is it a weakness in my storytelling? Where is this objection coming from? Where have I failed as a communicator that this person is having this particular bit of confusion? And then if you do notice that it's in your storytelling, can you figure out how to improve it? How can I tell my narrative better? How can

Starting point is 00:37:22 I rewrite my introduction? How can I change my slides so that it's more obvious to this person who's clearly smart, but who I haven't been able to communicate the idea to. And if you realize it's a technical issue, then, you know, the best thing to do is to say, Ooh, that's a really good point. We hadn't thought of that. That's a great direction for future work. Maybe we should work on it together. And now we can figure this out. Yeah. Yep. Yep. Exactly. And now you've gone from a semi-adversary to a potential collaborator, right? Big shift. But I think the specific advice is good, but the mindset is the most important thing, right? To think about challenges and rejections to your work, not as like judgments on you, but judgments on your work or on your ideas, but instead on judgments on either how you communicated that or as opportunities to improve the technical side of your work. And that sort of thinking was critical,

Starting point is 00:38:09 not just for my own kind of like mental health, getting through grad school is like a way to think about it, but also as a way to interact with the community and to form productive relationships. So yeah, I think best and worst advice from him, my former boss at los alamos national lab that's a really nice piece of advice and having to detach yourself from the idea like you're not your ideas right they're they're just not they're not i mean we all get emotional attachment to them right because there's something we've kind of brought into the world and nature and getting up there is daunting right so soon to give a presentation and so soon as someone sort of questions you it's so easy to go on the defensive but you're having that mindset like say leads to good collaboration and just yeah

Starting point is 00:38:49 improving mental health as well right because if you got upset about every rejection or every negative comment i mean yeah you can quickly go insane right so but that's a nice segue actually into the next sort of section and that's about setbacks and you mentioned actually that the the bow paper getting rejected three or four times. And how do you like, obviously you've got this mindset is probably, probably helps you deal with these setbacks and rejections, but it's always easy to put those principles into practice and kind of deal with them, the setbacks. Yeah. Yeah. I think as a, as a PhD student in a postdoc and now starting out as, as faculty, I have sort of a threefold rejection handling technique. So the first rung of it is exactly that advice that I got from Larry.

Starting point is 00:39:35 Like when your paper gets rejected, it's not because your idea is bad. It's not because you're bad. It's not because you did bad work. It's probably, you know, 90% of the time it's because of how you presented it. It's because, you know, some part of the story isn't clicking, isn't being communicated to other people. And so the way that I think about it is that like the goal of a paper is to convey intuition to the reader, right? So like you did some work, you spent like a year running some experiments or something like that. You write this 12 page paper, someone should be able to read that 12 page paper and, you know, one to three hours, and then they should gain all that experience that you did over the course of a year, right? So you're compressing your year of experience into this nice paper. And that's how science moves forward. So that's why I can read a paper and I

Starting point is 00:40:17 don't have to spend a year, you know, refiguring out physics or Newtonian gravity or, you know, anything like that, right? So when your paper gets rejected, that's not really saying, hey, your idea is bad. The last year that you spent wasn't useful. What it really means is the way that you've conveyed this to me doesn't make me feel like I skipped a year on the bench. It doesn't make me feel like I gained some insight or understanding that wasn't there, which doesn't mean that there isn't something to learn. It means it wasn't conveyed properly. So you need to go back to the drawing board and you need to think, okay, what did I really learn? What's the real intuition behind my understanding? And how can I convey that to another smart person? So yeah, that's

Starting point is 00:40:52 the first kind of level, which I think is often paraphrased as like, your paper gets rejected, not your idea, kind of, which I think is the right way to go. The second level that I try to think about with rejection is that it's a numbers game. And the more often you submit, the more often you do good work, the more often you are to get rejected, but also the more often you are to get accepted. This is a bit of a dangerous one because it often, if you follow this advice to its logical conclusion, you end up with these minimum viable viable papers where you like, just try to submit as many things as possible. And you, you know, if you flood the committee and then Mike Stonebreaker says something mean about you at a new England database talk or something like that. Right. So, so for me, it's more like a quality times quantity situation,

Starting point is 00:41:39 right? The better you can make your paper, the higher the odds are that it's going to get it. It's going to get admitted, but you should parallelize across time, not through time. You should say, okay, I might have to submit this paper many, many, many times, making small, tiny improvements to it each time, but it's going to get better and better and better. And that's much better than throwing a bunch of things at the wall and seeing what sticks. So suppose that your paper is so good that any reasonable reviewer who sees it has an 80% chance of accepting it, right? Well, okay, there's three reviewers. Say if one of them says reject, the other two, you know, don't have the energy to argue for you. So, you know, actually your odds are slightly below 50%. So maybe you have to submit one or two times, right? And, you know, that kind of idea of turning rejection into an opportunity for refinement, I think is the sort of second level. And then the third level is pure cognitive dissonance,

Starting point is 00:42:32 is pure just say like, okay, if a paper gets rejected, that's because maybe the reason that paper got rejected is because the idea is too new, it's too crazy, it's too potentially impactful and awesome. And right. And you kind of have to convince yourself a little bit that for some reason you want that, but the reviewers don't, which is clearly not the case, right? Like clearly your, the incentives are aligned, but, but, you know, sometimes it helps just kind of at that very, very base mentality of like, oh man, I just got to tweak the story a little bit. And they're going to see how amazing and crazy and impactful this idea is going to be. So yeah, I think that's the main way that, that, that rejection goes.

Starting point is 00:43:08 And now I'm in an interesting situation where I'm, I don't just have to deal with it myself, but I have to prepare my PhD students for it. And I have to kind of show them the ropes and hopefully I can even give them a healthier attitude towards rejection than I have, right? Hopefully like things can improve with, with, with and you know nobody teaches you how to do that nobody teaches you how to teach getting rejected to others you sort of have to just kind of think about it and and make it work yeah yeah there's not there's not a session on that is there in like training it's not like i have to deal with it but no i think that's a really nice way to approach i like the three stages of rejection rather than the five stages of grief.

Starting point is 00:43:45 We've got the three stages. Maybe I can apply those stages to my Tinder profile and my Tinder activities as well. Yeah, there you go. Yeah, there are opportunities for improvement every swipe in the wrong direction. Yeah. Cool.

Starting point is 00:43:58 So an interesting point there, but obviously you've kind of, you've been a PhD student now and now you're on the other side of the fence, so to speak. And it's with the kind of the creative process so they're probably very two different experiences depending on kind of where you are in your career right so kind of how do you probably today approach idea generation and selecting projects and helping people do that versus when you were kind of a phd student has that process changed

Starting point is 00:44:23 over time yeah yeah absolutely so you know when when i was a ph PhD student. Has that process changed over time? Yeah, yeah, absolutely. So, you know, when I was a PhD student, I'd say in like my first two years, I thought that being a good PhD student meant that you produced a lot of code that was really complicated and did lots of cool stuff. And that, you know, I slowly realized over the next, you know, for the full five years that like really what it is, is about, you know, conveying the intuition behind complex ideas through writing, right? Like that's like the main thing that you're going to be doing. You can have the greatest database system in the world. If you can't explain how it works to another human being, it's useless, right? Like it just doesn't, it doesn't matter. And so I think as a PhD student, I was very wrongly concerned with paper count, citation count, these sort of bibliometrics

Starting point is 00:45:06 of like, you know, oh, how much did I publish? How can I get my, you know, next paper in or whatever? What's the smallest change that I can make to this piece so that I can add another Sigma paper to my CV or whatever, right? Like, and, you know, my advisor pushed back against it, but, you know, I was very, it's a metric, right? And where there is a metric there, you will optimize for it and make number go up, right? Like that's, that's what we want to do. But you know, towards the end of my PhD, I started to realize that that's, that's not really how it works, right? That's not really where these things come from. So towards the end of my PhD, I started taking what I'm now thinking about as, as a, as a problem, a technique oriented approach,

Starting point is 00:45:44 basically saying like, okay, what's a cool technique that hasn't been used in database systems? And how can I take that hammer, everything else is a nail, figure out what to do with it, right? So first it was like, okay, reinforcement learning. Let's see what we can do with that. Oh, game theory. Let's see what we can do with that.

Starting point is 00:45:58 And it's a hit or miss strategy. Sometimes you find something that works. Sometimes you find something that doesn't work, but it's not because of your process. It's because you happen to get lucky finding a good match between those two. During my postdoc and after my PhD, I switched to a phase that I'll call was more problem-oriented. It was like, okay, how can I identify a pain point, something that isn't working super well right now, something that could be faster, something that could be better, and then how can I build a metric around that thing? And now I'll search my library of techniques that I have developed to try to find the best

Starting point is 00:46:31 technique to push that metric in the right direction. And then, you know, this was a much better approach than open your algorithms textbook to a random page and see if that's been implemented in a database yet. But, you know, it's still sort of the problem with the problem-driven approach is that you tend to be very incremental, right? You tend to target something pretty small, something that's known about now, something that exists there. So now I'm trying to switch, and I can't really endorse it yet because I don't know if it's going to work, to what I'm trying to consider an idealized version of a database. Basically,

Starting point is 00:47:06 instead of saying, what are the problems that exist now? Instead, I'm trying to imagine, okay, what would I do if I had the perfect database? What new problem would I solve? What new capability would I have? What new businesses could I launch if the database was just, it just worked? Zero query latency, everything hundred percent perfect. Right. And then work backwards from there. Right. So identify like, oh, okay, well, if I had the perfect database that worked just a hundred percent of the time, then I'd have analytics for everything. I'd make all of my decisions data-driven, right? There's, there's, there's no cost to it. I can just collect all the data quite simply, et cetera. Right. Then like, okay, when I teach a 400 person class, you know,

Starting point is 00:47:45 forget the little canvas plot. I'm going to go all out, right? I'm going to build dashboards. I'm going to do, do everything right. And then working backwards from there and kind of saying like, okay, how can I, A, create that perfect system, but identify the problems in such a way that lead me to that new application, that lead me to that new capability, to that new thing that I want to do. We'll see if it works um ask me again in six years when it happens and then we'll see uh we'll see if that's a good way to think about it or not i like it like working back from the glorious future right it's like kind of yeah i like that but with each of these sort of three phases we kind of aware that you were taking that approach

Starting point is 00:48:18 it's like a conscious decision that you're like okay now i'm in now I'm in my problem-oriented. My problem era? Yeah, I have a problem era. I wasn't aware of it as a PhD student. I wasn't aware of the way that I was picking problems. It was just sort of what I was doing to push the envelope, you know, to have something to talk about in the next meeting, you know, that kind of thing. When I was a postdoc, I kind of realized the error of my ways, but I didn't really know what to do about it yet.

Starting point is 00:48:45 And one of the great things about MIT is that there are problems everywhere. Like there are so many people crammed into that data center building that you just kind of walk around the hallway and you listen to what people are saying and you're like, oh, okay, this is an unsolved problem. This is an unsolved problem. And this is an unsolved problem. So there it was just sort of working with what I had, right? Well, okay.

Starting point is 00:49:03 I have all these great unanswered questions. Let's, let's answer them. And towards the end of my postdoc, I definitely realized the limitations of that approach. I saw how it could be more incremental rather than revolutionary. And so I guess now as, as faculty is the first time I'm very explicitly being like, okay, this is the new way that I'm going to identify things to work on. So no, I guess to, to answer your question quite directly. I didn't know at the time, but I became more and more aware of it as time went on. Nice, nice, cool.

Starting point is 00:49:31 Yeah, so the next question is, it's kind of the mission of the podcast, actually. It's bridging the gap or helping to further bridge the gap between academia and industry. And obviously you've had experience in both camps with the startup so kind of what is your take on the current interaction between academia and industry and how can it be improved basically what are the problems with it and what's your take on it i guess

Starting point is 00:49:57 yeah oh boy you know those memes where it's like a two by two grid and like on top it's like what i think i do what they think i do what i think i, what they think I do, what I think I do, what they think I do. And so I think academia has a perception of industry as like, you know, these like profit driven, like. Crazy capitalist monsters are just. goblins, you know, like absolute caricature of like no interest in developing scientific understanding, only caring about the business bottom line and like never willing to share, you know, any data or like work together at all because they just want to minimize risk and maximize profit. Right.

Starting point is 00:50:35 And then I think industry has this view of academia is like this, you know, hoity toity ivory tower, like kind of like, oh, wow. You know, spherical cow kind of like, you know, you're gonna, you're gonna sit and look at a blackboard for a year, and you're going to come up with some theorems that don't really help me like, improve my problem at all, right? Like you have no interest in practicality, you have no interest in addressing the problems that actually exist, right? And so it's like, to an academic, the ideal industry partner is like a data repository, right? It's just someone who says like, hey, here's a bunch of workloads.

Starting point is 00:51:06 Let me know if you find anything cool, right? And to someone in industry, the ideal academic is someone, is like an unpaid intern, right? Is someone who they say like, here's my bug report backlog. Like, let me know, like, if you want to fix any of them, right? And like, obviously the incentives are such that neither group is going to like perfectly be happy. Right. So I think the really, really successful collaborations between academia and industry is when the academic incentives and the industrial incentives are aligned. So, you know, MSR has been the model

Starting point is 00:51:36 of this for a very, very long time where it's like, yeah, okay. You know, we have like the fun house where all the researchers stay and they like do whatever crazy thing they're going to do. And every once in a while they're expected to surface something that improves a product right but there are other successful models as well that don't necessarily exist inside of msr one of them that's been very successful for me in in collaborations with with meta is trying to identify pain points that are academically interesting right so like query optimization is a great example where it's a well-contained problem with a measurable objective outcome. And you can say like, you can de-risk it for the industry side by saying like, hey, look,

Starting point is 00:52:13 you know, we'll, our PhD students will come intern for you or like all consult for you for a bit. You know, we'll sign all the NDAs. We'll do all that stuff that you need to protect yourself and your data. And we have a quantifiable metric and, you know, after a semester or two semesters or whatever, we'll be able to look at that metric and decide if we like where we're going. And if the answer is yes, then we continue. And if the answer is no, then, you know, we walk away and kind of no hurt feelings, right? So I think the most successful collaborations come when those incentives are aligned. That said, I think there are very big kind of day-to-day differences between how people in industry approach problems and people in academia approach problems. And I think both of them could be improved by some influence from the other. need some grounding, need some practical push to solving problems that are more realistic. The 10,000th paper making TPCDS go 1% faster is fine and it will get you out the door and you can get your PhD and all that stuff. But it's understandable that industry is not particularly

Starting point is 00:53:17 interested in this. And I think industrial partners will have more productive collaborations with academics if they can recognize that the problems that those academics are interested in are at the core of their business, right? They do matter. And it just takes a little bit of window dressing and a little bit of persuasion to sort of get the academics kind of aligned with what that particular thing is going to be improved. And sometimes it doesn't work out, right? Sometimes your industry has an organizational structure where it's like, you know, shit flows downstream and it's like,

Starting point is 00:53:48 hey, you've got to solve these tickets. And then the engineering manager is like, hey, researcher, can you solve this ticket for me? And the researcher is like, I have no idea what this is. And like, you know, you have a mismatch, the wrong person for the wrong job. But I think if you, I think if you can avoid that and I think you can get the incentives aligned, you can have productive collaborations. Yeah yeah i like the viewpoint of thinking of aligning incentives it's a very sort of economist way of thinking about things cool yeah i guess penultimate question now ryan so this is about current trends and where you think oh kind of what things have you seen recently current and exciting advancements that you've observed that you think kind of like, okay, that's really cool.

Starting point is 00:54:26 Or yeah, what's your take on current trends? So the hype cycle in databases, right? It's like, first, there's some papers at Sigmund that are totally crazy, far out ideas. This will never work. We're never going to implement it. Niche things, they all get put into that one potpourri section of the conference where everyone gives their talk and like, you kind of hear about them and it's like fine. Right. And then suddenly, like four years later, half the conference is whatever those topics are.

Starting point is 00:54:52 Right. And then three or four years after that, there's industry adoption and the thing becomes what it is. I started out my research journey at the first stage of the ML for systems bits, where all of the ML for systems researchers could fit in half a conference room at SIGMOD. And it didn't matter if you were doing ML for clouds or if you were doing ML for neural network training, you were in that same conference section. And now we're definitely in the second phase of that. We're definitely into the point where half of SIGMOD, half of VLDB has learned in the title, has instance optimized in the title, has some ML component. And I think the industry adoption is also starting. So I think we're right in between two and three on that current wave, right? And eventually, so, you know, if I were somebody in industry right now, or someone looking to do a startup, I would definitely be looking at those ML techniques. They're starting to be mature. They're starting to be usable. They're starting to be good open source artifacts.

Starting point is 00:55:47 In terms of what's coming down the pipeline, what's in stage one, it's hard to identify in stage one because a lot of trends die there. A lot of things exist for a while in that stage, and they never quite make it to stage two. But I think the most clear stage one thing right now is quantum computing and database systems. Now, I should premise this. I know nothing about quantum computing. I did some physics work at Los Alamos. I vaguely know what a qubit is.

Starting point is 00:56:11 I know what spin is. I know what those things are. The quantum computer people talk. I don't understand it. I have a lot to learn there. I have a lot to get going there. The current results are absolutely in that phase where people are like, oh, kind of a weird edge topic, kind of like we don't really understand what it is. But if I take those researchers at their word,

Starting point is 00:56:29 which I think is reasonably good, it has a lot of potential. So maybe that's a trend worth betting on. Maybe it's not. I'm not 100% sure what's here or there with that particular thing. The second topic that I think is in stage one that I'm a lot more bullish about is kind of either updatable or out of place learned index structures. So, you know, obviously learned index structures for static data or data that is sorted is well into its second stage by now, but a true kind of replacement for the traditional B-tree structures with all the locking, with all the caching, you know, with all the concurrency access, all of those kind of fine-tuned packages is finally starting to look like it could maybe be a thing. A couple of years ago, I think the first paper that kind of put it out there was the Bourbon

Starting point is 00:57:16 paper from Wisconsin that kind of showed like, hey, this learned index structure thing and this LSM tree thing are kind of highly related. And there's a couple of new interesting works that are tree thing are like kind of highly related um and there's a couple of new interesting works that are coming out that are kind of showing that the relationship between the sorted nature of your data and the compression in your index structure is a lot more negotiable than we originally thought it's a lot more degrees of freedom than just your data must be fully sorted so i'm a little more bullish on that one but if i i wish i could give a better interpretation of the quantum stuff but yeah i i saw one of my friends in my in the same lab as me he did his phd on quantum computing and it was he occasionally get up at a seminar to

Starting point is 00:57:54 give a talk about and i think i got past the title that was like after that i was like what right now he's so confused sounds amazing and like he was like potentially very very revolutionary like it's going to change the game but yeah it felt a long way it felt like kind of very early stage one kind of five years ago yeah and i think it will get there eventually right but it's a case of when not if i think with it but yeah i don't think my brain's capable of understanding it i think that's the limit for me but yeah i actually think it's unclear whether or not it'll be in our lifetimes. Like I think the hardware is pretty complicated to produce from a physics point of view.

Starting point is 00:58:30 But yeah, yeah. You know, if you asked me about large language models in 2015, I probably would have given you a similar answer. So, you know. Yeah, that's true. You never know, right? Everything, the world moves in mysterious ways, I guess. You never know what's next. If we could predict it, I mean, we'd all be a bit richer, right?

Starting point is 00:58:46 If I could have... I'd be in finance, that's for sure. Yeah. Cool. So I only had kind of one more question. That was again about sort of the future and sort of promising direction to future research. But I think we probably covered it off

Starting point is 00:58:56 with that answer there. So I guess we can end things there then, Ryan. Yeah. So thank you so much for talking to me today, Ryan. I'm sure the listeners will have thoroughly enjoyed it as well and where can we find you on on socials if you want to kind of see what you're tweeting or anything like that where can we find you oh yeah sure you can find me on twitter at ryan marcus it's my name and you can find me on the good old-fashioned internet at r marcus dot info fantastic and yeah if you do enjoy the show please consider

Starting point is 00:59:26 supporting us through buy me a coffee every note and we'll see you all next time for some more awesome computer science research Thank you.

Your Ad Here

Disseminate: The Computer Science Research Podcast - High Impact in Databases with... Ryan Marcus

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.