Disseminate: The Computer Science Research Podcast - High Impact in Databases with... Andreas Kipf

Starting point is 00:00:00 Hello and welcome to Disseminate the Computer Science Research Podcast. As usual, Jack here to take you on this journey. Before we do start, I'd like to give a shout out to our sponsor, Pomtree. Pomtree are the developers behind Raftree, the open source temporal graph analytics engine for Python and Rust. Raftree supports time traveling, multi-layer modeling, and comes out of the box with advanced analytics like community evolution, dynamic scoring, and temporal motifs mining. It's blazingly fast, scales to hundreds of millions of edges on your laptop,

Starting point is 00:00:52 and connects directly to all your data science tooling, including Pandas, PyG, and Langchain. So go check out what the Pometry guys are doing at www.rafri.com, where you can dive into their tutorial on their latest release. On to the show. So today is another installment of our high impact series for the listeners who are new to this type of episode. It was originally inspired by a blog post by Ryan Marcus on the most influential database papers. And today we are going to be speaking to Andreas Kipf, who has the number one spot for 2019 with his paper on learned cardinalities estimating correlated joins with deep learning, which was published at that year's edition of CIDER. Andreas is a professor

Starting point is 00:01:38 at the University of Technology Nuremberg. Before that, he's worked as a senior applied scientist at AWS in their Learn Systems Group, where he was one of the founding members. Before that, even he was a postdoc at MIT's Data Systems Group. And before that, he did his PhD at the Technical University of Munich. Andreas's research interests, you can correct me if this is wrong, Andreas, but you like to improve systems with machine learning with the focus on index structures storage layouts and query optimization welcome to the show andreas yeah so glad to be on the show uh thanks jack for the invitation um yeah looking forward to our discussion fantastic so i gave you a very high level sort of overview of your career so far there when i

Starting point is 00:02:26 introduced you but help us color in between the lines a little bit and tell us more about what that journey was actually like and yeah kind of what led you to become a researcher yeah so i would go back um a few years in time um so during my graduate studies, like during my master's at TUM, I did like a master's thesis at UC Berkeley. So that was like sort of an exchange I did. And I was part of a collaborative research group over there. And I was really inspired by, you know, how people approached research. So my advisor back then was Professor Eric Brewer, who was also at Google at the time. So, yeah, I was very inspired by, you know, how people worked in groups on collaborative

Starting point is 00:03:16 projects, not only at UC Berkeley, but also with people at UW in Seattle. And I really enjoyed that part. So yeah, so basically, you know, I changed my mind a bit because in the very beginning, you know, when I started studying computer science, I always wanted to become a software engineer at Google. And yeah, so funnily enough, like during my PhD at T PhD at TUM, um, I ended up doing, doing a couple of internships at Google. And so I also got that experience. It was, you know, really, really interesting and inspiring, uh, as well, but always had

Starting point is 00:03:56 this, um, you know, research in mind. Um, and I always wanted to go back. So, um, yeah, so, so eventually, as I said, like I did my PhD at TU Munich, did these internships. And my goal was still to become a software engineer for some reason. But then, and I think we'll talk about this later,

Starting point is 00:04:19 my research turned out to be a little bit more successful. And it got really inspiring to be in the research, a little bit more successful. And, you know, it got really inspiring to be in the research community and to have your own projects and so on. So yeah, I mean, I still worked with Google. I'm also like in the later stages of my PhD, we created, you know, also some publication together and it was all great. But yeah, I think the main turning point was really when I gave, when I was in California for my talk about learned cardinality estimation. And I talked to Tim Kraska over there, who I did a postdoc with later at MIT. And he was basically asking me, you know, if I would want to join MIT as a postdoc with later at MIT. And he was basically asking me, you know,

Starting point is 00:05:05 if I would want to join MIT as a postdoc. And yeah, I always had this dream to go back to the US, to one of these groups. And yeah, so that's how it started. Awesome stuff. Yeah, it feels like you got to experience the best of both worlds, right? You kind of had one foot in each camp

Starting point is 00:05:22 kind of throughout your studies and kind of getting the experience. And that's kind of, I like each camp kind of throughout your studies and kind of getting the experience. And that's kind of, I like what you're saying about the appeal of the kind of the collaborative nature of research and academia and how kind of rewarding that can be as well. So I can definitely see how you've swung back towards being back in, being in research rather than in primarily in software engineering. So let's talk about learned cardinalities then. So tier us up with some background there. Give us some of the background information that we're going to need to talk about this

Starting point is 00:05:51 paper today. Yeah, so learned cardinality estimation. Yeah, so basically it's, you know, it's about SQL databases and query optimization specifically. And yeah, I think the main idea is really, you know, can we improve, you know, the estimation of result sizes of SQL queries and intermediate results using machine learning?

Starting point is 00:06:14 So that's like the research question we asked back then. And yeah, and as it turned out, you know, you can have a rather simple model, I would say, and apply it, you know, to this problem, dedicate some training time to it and get pretty good results out of that. Yeah, so that's like the, you know, the high level idea. And yeah, I think one of the reasons we decided to work on that is that, you know, I mean, query optimization is a complex problem. It, you know, consists of many, many sub problems or few sub problems, I would say. So one is, you know, you get your query in and you've got to estimate, or basically you've got

Starting point is 00:06:57 to explore the search space for your query plan, right? Like, for example, if you have, you know, three tables that you want to join a b c you could first join a and b and then c but you could also join first first join b and c so to decide on you know such a plan what you typically do is you enumerate possible plans and then you cost them so the first component is this John enumeration component. The second one is the costing component. And then the costing component typically consists of a cost model. And the cost model would take these cardinality estimates as input. So for example, if I would have my query A join B, I would want to know how expensive is it to execute? A typical, very simple cost function people use is the cout cost function, which basically just counts the number of output tuples of that particular join.

Starting point is 00:07:55 And yeah, and that cost function has an input, which is the cardinality estimates, which we need to compute and provide to the cost model. So it really starts with estimating how many tuples qualify my join. So if, let's say, I have table A with a predicate like X equals 5, I would want to know how many tuples would qualify, what would be the selectivity of that scan. Same for the scan on B. And then these base table estimations

Starting point is 00:08:27 are usually pretty simple. So I would say, you know, to estimate something like X equals five or even some complex string predicates, like, you know, you have names and you want to know how many people start with an A. That's usually done or pretty easily done using sampling. So you just take a sample of your database and compute the selectivity on the sample. And then you would extrapolate to the whole table.

Starting point is 00:08:56 So that's usually doable. It gets more complicated with joins. And that's the specific problem we targeted with learned cardinalities. So basically with joins, you get correlations in between tables. And yeah, so that's something machine learning is known to be good at, capturing correlations. And typically in databases,

Starting point is 00:09:21 this is implemented using simple formulas. You could, for example, assume independence between these inputs. So that's, I would say, the background. And actually something I really wanted to talk about is how this all started. Yeah, yeah. Tell us about the origin story. Yeah, so that's actually, yes, really, really interesting, I would say.

Starting point is 00:09:52 So it's not like, you know, how you would normally approach a paper. At least I think, you know, most people wouldn't approach it like that. So as you remember, or as, you know, our listeners might remember, you know, it was, I think around in 2018 when this entire, you remember, or as our listeners might remember, it was I think around in 2018 when this entire machine learning for systems topic came out. So we saw a few first papers on knob tuning and most notably on indexing,

Starting point is 00:10:24 which created a lot of hype around it. And then also some work on query optimization. So, yeah, so I was in Munich, you know, with the database group back then doing my PhD. And, you know, we would all discuss, you know, the papers that would come out. And, yeah, so we thought, you know, what could we do in that space? Is there actually something in it? You know, how does it compare to more traditional approaches and so on?

Starting point is 00:10:53 So where should we really apply machine learning? So yeah, so we basically started brainstorming that, right? As probably every group did at the time. And yeah, and I think our outcome was that, our conclusion was that many of the problems in databases can be solved exactly. Like, you know, the join enumeration problem I mentioned is something you can solve

Starting point is 00:11:17 exactly up to 10 joins. You know, if you have a cost model and you trust that cost model, the join enumeration part can be solved using traditional, let's say, advanced algorithms. Whereas cardinality estimations are just explained it, you know, like finding out how many people's names start with an A or let's, you know, consider the internet movies database, which I used in my work. You know, how many actors, how many German actors participated in French movies, for example, right? That's a very tough problem to solve.

Starting point is 00:11:53 And it's a fuzzy one. So it's not like that there's one correct solution or only one correct solution. But, you know, if you get close to the correct solution, it's already probably good enough for the task. So it's rather a fuzzy problem. And we all know that machine learning is really good at these fuzzy problems, right? I mean, it often or like rarely provides, you know, the exact answer, but often it gets, you know, very close, like these models, like regression models, they would get pretty close to, you know, forecasting or, yeah, like, you know, very close, like these models, like regression models, they would get pretty close to, you know, forecasting or, yeah, like, you know, just fitting a line, right? So that's one thing.

Starting point is 00:12:34 And we also, you know, said like, you know, cardinality estimation is this really hard problem that researchers have worked on for many decades in the database community. So why not try that? So yeah, so that's how the motivation came along when we still didn't have a solution. But the thing is, and that's like, you know, for me at least, yeah, the most fulfilling one is that, you know, I have a brother working in, like he's not only my brother, he's also a deep learning researcher. Oh, really?

Starting point is 00:13:09 Yeah, so that was very convenient or is still. So yeah, so basically I explained that problem to him and said like, yeah, we need a model that can capture these correlations between joints, right? And he, yeah, he basically immediately said, you know, there's a model that just came out and was released in Europe at the time called DeepSets. So it's a neural network architecture that basically has set semantics. And SQL also, you know, has set semantics and sql also you know has set semantics so like you know it doesn't matter you know for example like like our query plan right like it doesn't matter if i first you know

Starting point is 00:13:54 join a and b and then c the output of that subtree like in terms of how many tuples it produces is always the same no matter which plan I use, right? So the cardinality estimation problem by itself, you know, has set semantics. At least that, you know, specific problem that we targeted. So yeah, so there was the perfect model for it already. And yeah, one limitation was still that this original model was built for one set and we actually needed multiple sets.

Starting point is 00:14:29 So we needed one for joints, we needed one for base tables. And yeah, so my brother Thomas, you know, sat down with me and we adapted or we created a new model, which would, you know, consider all these features, have different set modules. I would, you know, set up the benchmark, provide the feature set. And yeah, and that's how it ended up being created. That's a fascinating origin story. It was a family affair sort of thing. That's brilliant.

Starting point is 00:15:03 And I guess that's why it gets its name of multisetssets right because it was kind of taking the idea for a single set and now we've got multi-sets right so yeah that guy explains the the name of the of the network so let's talk a little bit more kind of about how the multi-set convolutional networks actually work then so like what's the secret sauce behind them for being really good for solving this problem yeah so i think the the main intuition um is really this the set semantics right like you don't waste any model capacity on remembering um you know permutations um because they're just not important um so it's really about the um yeah the set aspect um of it so you basically get a good um you know basically yeah it needs fewer data samples to convert because it doesn't you know need to learn that you know it's actually permutation invariant like what we are you know

Starting point is 00:15:58 what we are um training on um so that is one thing but you know, the model is always and it's always like, you know, in the center of a lot of research. And I still think the model is important. But what I think is even more important is the features that we should not only train on SQL features, right? Like, so let's say, you know, you have your from clause, you have your where clause, your base table and the join predicates that you could use and sort of featurize and input to the model. So that's one thing I would call this sort of the static features that you get um yeah with your sql query with your sql string but then there's also like these dynamic features um or runtime features i think they're called in the paper which yeah as the name implies you would only be able to collect at query runtime or optimization time and yeah i think that was the main finding um that we we should also use this runtime information and i think yeah so so i talked about sampling earlier right when when i talked about base table or when we discussed base table uh scans um so yeah what a database system like classical one uh would do is would uh it these samples, right, and then execute predicates on these samples.

Starting point is 00:17:26 And with MSCN, we did not want to really compete against that. But instead, we said, let's use it as additional feature input. So we still do all the traditional cardinality estimation, but we input it as additional features into the model. So we basically learn, you know, if the fifth tuple in my sample qualifies, how does that affect our output? Yeah, and we basically show that the model can capture this information,

Starting point is 00:17:56 basically the runtime information that we get. Yeah, and that would improve quality significantly. And that's how it came along. Awesome. Yeah, that leads into the kind of the question i mean i know it was it was five years ago now so maybe you don't remember it remember exact numbers but how much better was it than the traditional approach yeah so i mean

Starting point is 00:18:16 and and that's always the problem right it it always depends on your on your train and tests that i would say yeah so why we didn we didn't test on the training data, we still, I mean, we tested on a constrained space, right? So basically all the training and test data was generated and then we did our train test split, but the test data would still follow a similar distribution. For example, the test data would not contain completely different predicates.

Starting point is 00:18:50 Like if, you know, in our case back then, we only supported equality in range. So we wouldn't add, you know, like predicates to the test data. So yes, it was a constrained setup. But for that constrained space, you know, on IMDB, I think it was like six tables back then. It's, yeah, I mean, it was, you know, up to an order of magnitude better than what you would get with, you know, more traditional approaches. The downside, of course, was that, you know, you had to train it. So I think that was the main limitation that, you know, it took a while to train.

Starting point is 00:19:29 You needed a lot of training data. And at the same time, you know, once you trained it and your data would change or your workload would change, you know, you might need to retrain again. So we didn't have update support at the time. So yeah, and that said, I think back then for this constraint problem space, I think we did really well. But then we also discovered a lot of limitations

Starting point is 00:19:53 to make it really practical. Yeah, I guess, did you ever find the answer to the question of how many German actors played in French movies? Well, actually, I just came up with this one. So I don't know, but we asked many of these. Yeah, it feels like you said that you were kind of aware of the various limitations with it, but it sort of was a nice,

Starting point is 00:20:20 was well-scoped and led the groundwork for a lot of future work then. So I guess let's talk about the impact that this paper has had over the past five years since it was published. And have many of these shortcomings been addressed? Yeah, so it's crazy how time flies, like you just said, you know. Yeah, five years. I mean, that's crazy. But yeah, so there has been a lot of follow-up work, actually, addressing our limitations. I was involved in some of it, but only very limited.

Starting point is 00:20:53 So there have been many papers, for example, addressing updates, which was one of our main limitations. I still wouldn't say it's solved yet, just because it's a very hard problem in machine learning in general, right? It's not something very specific just to databases. I mean, of course, we can collect new training data and so on, but you still got to update your model and basically make sure that your model sort of scales to your new feature dimensions

Starting point is 00:21:26 and so on. So yeah, so that's not, you can do, you can do there, I think. So yeah, there's still some update cost for sure. On the other hand, people showed that, you know, in some cases, you might actually not need a deep net for it, right? So you can actually, you know, use simple gradient boosting techniques to get pretty close. You might not be able to outperform it, but you can at least get pretty close. And then one other limitation was around, you know, low selectivity cases. Let's say, you know, I select an actor in my internet movie database who only participated or played in a single movie, right? So I might not have seen that person during training. So it might be very hard to estimate what that actor, you know, how many movies the actor

Starting point is 00:22:21 participated in. So that's something we cannot really, you know, solve with a supervised approach like MSCN is. So yeah, so there have been a lot of follow-up works in the unsupervised space as well, like NARU and DeepDB, just to name two of them, which, yeah, address specifically these cases. So they basically train an autoregressive model over the entire input. So they basically train an autoregressive model over the entire input. So they would have seen, you know,

Starting point is 00:22:49 those rare cases as well. Yeah, and finally, I would say string predicates is something we didn't handle. So there have been a few follow-ups. Uncertainty estimation is something which, you know, you actually need before you deploy something like this in production because otherwise, how do you know

Starting point is 00:23:06 if the model performs well enough to be deployed? So that is something a few groups worked on. And I think there still needs to go more research into that to be able to trust these models fully because yeah, up to date, they haven't been adopted by many of the big players. So there's one example in Microsoft Azure Cosmos DB. They implemented a similar approach as far as I'm aware of.

Starting point is 00:23:31 And also there are other database companies that are implementing similar approaches still today. But yeah, so they're usually very careful with, and which makes total sense, right? I mean, you don't want to ship a model where you're not sure that it performs well. Yeah, you preempted one of my questions there. It was going to be, has any of these techniques

Starting point is 00:23:54 found their way into commercial systems yet? But I guess they're still sort of finding their way and gaining confidence in these techniques and the stability of them before they unleash them on the wild, I guess. Quickly on the gradient boosting um aspects you mentioned as well uh you can kind of get a lot of the way with techniques such as that and i guess it's the benefit of a gradient boosting that they are quicker to train as well compared to like deep network because that's kind of the main advantage of like the training times a lot smaller yeah so that is one advantage um it's also smaller so it might might better fit into ram but i i wouldn't

Starting point is 00:24:30 necessarily necessarily say that the training time of the model is the big bottleneck um in long cardinality estimation i think it's much more about um training data collection for example i mean of course like if you are if you're a big player, like one of the big, you know, cloud data warehousing companies, I mean, you basically all have all of this training data already, right? Just because you have this like, yeah, large customer base, and you can potentially, you know, you know, train a model maybe for each customer on its own, or maybe you can do something, you know, in a model maybe for each customer on its own, or maybe you can do something, you know, in an anonymized form across customers or something like that. But typically, if you

Starting point is 00:25:11 don't have that, I mean, you're basically starting from scratch, right? And you've got to sort of cold start and deploy your model. So, so the thing, like the approach we followed is we, we basically try to, I mean, we back then hard coded the important joins. But what you could do is you could just get that from the user. And then for the important ones, you actually collect training data. So you execute queries in the background with different predicates. You get all that training data, sample the entire subspace, and then train your model on that. And I think in that case, model training is really not the bottleneck.

Starting point is 00:25:49 It's much more about training data collection. I had another question bouncing around my mind. What was I going to ask? Oh, yes. It was about the, as sort of the data, the workloads changing, and these, maybe the models, the accuracy decreases over over time so you need to do this sort of collecting new data retraining processes is in an issue maybe very workload sensitive or data sensitive but how quickly can things become stale and is it just like what is it a function of

Starting point is 00:26:17 primarily of how quickly your your models accuracy drops off as the data or workloads changing or is it like i said workload dependent yeah so it really depends on your data and workload changes. So for example, you know, if someone adds a new column to your database, there's not much you can do about it, right? I mean, you haven't seen that, like predicates on that column before. Likewise, you know, if you add, you know, 10% new tuples, which follow a completely different distribution, there's also not much you can do. But in my experience, I mean, data usually follows the same distribution, right? Like a sales database won't suddenly contain weather data. Well, maybe it might. Yeah. But I would say it's usually following a similar distribution so usually it should be

Starting point is 00:27:07 good for a while and there have been a few papers studying this exact question but yeah again it really depends so for example what the model does is it normalizes column features so basically when i have a predicate on a given column that, you know, would say like same example as before X equals five and the column domain for X is between one and 100 at training time. But now at test time, you know, people ended up inserting values up to a million. You know, the model wouldn't be able to guess that. So yeah, so it really depends on the kind of changes and if they're breaking changes, I would say. So after the LAND card analysis paper, you then went on to do a postdoc and then eventually going on to work at AWS in their LAND systems group. So tell me about how

Starting point is 00:27:58 that paper then led to that. And then I guess after you were at AWS, you kind of came back to the University of Nuremberg. So yeah yeah tell us about that story a bit more yeah so um yeah so basically after i um you know submitted and published um the learn cardinality estimation work i wrapped up my phd at munich joined mit and um yeah so so i worked with, in the beginning with Ryan Marcus and Tim Kraska, like that was, you know, how it started on this SageDB prototype, which is, so for those not familiar, it's basically a system that tries to, you know, implement different learning techniques in a coherent way. So yeah, so we ended up doing some, you know, storage optimizations. Ryan worked on some career optimization aspects.

Starting point is 00:28:51 And we created a research prototype. And, yeah, so that went on for a while. And then eventually we decided, you know, we could have most impact in industry because we're simply lacking workload. And that's, I think, still nowadays a big problem. In academia, you don't really have access to the real problems. So we decided we could have most impact in industry with these kind of techniques. Yeah, so we joined, this HDP group joined Amazon and we founded the Learn Systems group led by Tim Kreska.

Starting point is 00:29:36 And yeah, and over there, it was actually very interesting to see how the real problems actually look like after we solved the not so real ones in academia. And yeah, so we get access to workload statistics and we're able to build sort of new solutions and for example, in the storage space and also in terms of workload forecasting at AWS and specifically at Redshift. Yeah, and that was really inspiring.

Starting point is 00:30:12 And I think I got a lot out of that time. So you're at Amazon and things are going well. You've got this new group you founded and you've got all this access to all these nice real workloads. you've seen behind the the cat and to all access to all this really cool data and stuff so then you eventually you kind of then went back to back to academia and you ended up back at the university of technology in nuremberg so tell us how that happened yeah so that's that's a very interesting one so i I've been following, you know, the creation of UTN, which I just called for short, for many years now. So it's, you know, it's a second technical university in Bavaria next to TU Munich.

Starting point is 00:30:56 So it was, you know, it's a big project by the Bavarian government. And, you know, it's, yeah, it's been many years since the last one was founded. I think it was 40 years ago, the University of Passau. So another one in Bavaria. So it's a big thing, you know, to have a new university. So you don't get this chance, you know, multiple times in your life, I would say. Yeah, so I was very excited from the beginning on about the creation of this you know new buildings new campus um new everything pretty much and yeah i mean on the other hand um you know with every new place

Starting point is 00:31:32 there's a lot of work so you gotta you know build it up first and uh there's really more work than you would would have in other places um but yeah i see it more as a chance. And yeah, so it was sort of in the back of my mind for many years already to, you know, to get a spot, get a research position at this place. And it was also one of the reasons why I decided to do that postdoc at MIT because I had this idea in mind. And I was like, yeah, you know,

Starting point is 00:32:03 if this comes around around you know uh I I would pursue academia and um so basically I you know I saw the opening I applied and I I got the offer and uh yeah just I just couldn't resist fantastic yeah it must be really nice to have the blank slate to kind of go and create something so yeah while it's probably a lot of work initially to kind of get things off the ground this is like an opportunity you'll probably never get again, right? They don't make, like you say, they don't make universities every day, right? A new universities.

Starting point is 00:32:30 So yeah, that's fantastic. So I guess kind of, yeah, what are you working on at the moment at UTN? Yeah, so some of the learnings at Amazon were, you know, that, you know, we are basically working on the wrong problems, as I said, in academia. And it's a lot about, I think,

Starting point is 00:32:48 it's a lot about usability of these systems. And you might know that from your experience at Neo4j. If people cannot really use your system, you can optimize performance however you want. It just doesn't matter, right, at the end of the day. Doesn't mean that performance isn't important so i still would say performance is a very important aspect especially the cost component of it but i would say you know one of the most important things is you want to make it

Starting point is 00:33:17 easy to use and and this is actually something which, well, I discovered during my first class that I taught at UTN. And yeah, so basically I had an exercise with my students where they had to import some log files into a data warehouse. And yeah, I was basically, you know, helping them out and seeing how they're doing. And I found that, you know, they were basically, you know, manually creating, you know, table statements and so on. And yeah, it just took them a very long time. And they had a, yeah, I mean, they basically just couldn't really import it, you know, in the given time, right? It was just, there were just too many edge cases to consider

Starting point is 00:34:06 and too many questions being asked. So, yeah, so I just thought, you know, it would be great to automate this process. And nowadays with generative AI, I think it's a great chance for, you know, for us to create easier use interfaces to automate a lot of this. Yeah, and so we basically started a project that would ease the data loading aspect.

Starting point is 00:34:34 And yeah, I think my vision for that is that you would just have, you just give it an S3 URL and with some parquet, CSV, JSON files, whatnot in there, and it would just automatically figure out everything for you, right? It would basically suggest some schema. It would, you know, tell you, you know, that's the data you should look at. That's what, that's how you should import it.

Starting point is 00:34:57 That's the system you want to use and all of that. And ideally with some, you know, I mean, not ideally, but possibly with some, um, LLM, uh, interaction, um, to,

Starting point is 00:35:09 to basically have the, have the user in the loop. Yeah. Honestly, you hit the absolute nail on the head there with like kind of data import and being like one of the most horrible experiences in the world. I mean, so many times I've come across, tried to,

Starting point is 00:35:22 I'm like, okay, I'll fire up an instance of this. And then I spend the next three days trying to get the data into the damn thing i need to be like yeah i think a few of these gray hairs i'm getting it because of that i think over the years but um yeah and but yeah you are right as well that i think often in academia we're very much focused on performance performance performance whereas this is the other dimension of east usability right? And it being easy to use.

Starting point is 00:35:45 Obviously, like you said, performance is very important still because when you get to use it, you want it to run fast, right? Anyway, that sounds fantastic. I look forward to seeing the results of that research. Cool. So let's kind of take a little retrospective now and kind of look back over your career today. And what are you most proud of in your career? Yeah, I think I would probably still name the cardinality estimation work.

Starting point is 00:36:16 And not because it's the most difficult or most impactful one out there, but simply because of the fact that it was done together with my brother. And this is just something, you know, very special to me. And yeah, it was also a process that I really enjoyed. So yeah, this is the number one. Yeah, I can definitely see that. It must have been super nice to work with your brother on that.

Starting point is 00:36:40 Do you still actively collaborate? Because I mean, obviously, there's still a relative amount of overlap between your two fields of interest so yeah did you get something you're kind of gonna do more in the future yeah so i wish i could uh actually uh so back then he was you know still doing uh still doing a phd at the university of amsterdam um but now um yeah he's at he's at google deep mind so uh he's busy doing doing uh fancy stuff yeah okay cool always good to have those contacts though right yeah yeah and cool we could we can jump on to sort of the section on motivation then and sort of what's motivated you over over your career and what are the or if they are any sort of specific papers

Starting point is 00:37:26 and people who have motivated you throughout your career? Yeah, I would say the learned indexing paper by Tim Kraska is probably the one that influenced me the most or that had the most impact on my career just because I got inspired by it, right uh to do the cardinality estimation work to do something uh in the learning space and um also which you know then led me to to eventually do a postdoc in that space uh with tim kreska uh yeah, so that was, I would say, the most impactful one on me personally as well.

Starting point is 00:38:10 And yeah, otherwise, I mean, I was fortunate to be part of a very strong group at TU Munich, led by Alfons Kemper and Thomas Neumann. And yeah, so I was also very inspired by their work and especially, you know, Thomas' creation of Hyper, you know, the in-memory database system got, you know, acquired by Tableau back then. And, you know, many of his papers were very influential at the time. So, for example, one that I remember is the adaptive optimization of very large

Starting point is 00:38:48 join queries, which I found a very nice read and I recommend it to everyone to take a look. He basically showed that up to 10 joins you can get exact answers in join enumeration,

Starting point is 00:39:07 which I think is a great result. And yeah, so I think I was mostly inspired by the people around me. But yeah, I was just fortunate to be at TU Munich. Cool. Yeah. So I guess kind of going off motivation and sort of things like this what what's the best piece of advice anyone's ever given you i think it's really um you know that you know you should not um expect your um projects to all work out right it's really like research should be

Starting point is 00:39:41 about um you know trying out crazy ideas um that's why we are in research if the about, you know, trying out crazy ideas. That's why we are in research. If the ideas, you know, wouldn't be crazy. I mean, we could also, you know, be in industry, which is, you know, not a bad thing. But in industry, you want to make sure that things actually work at the end of the day. And you have, you know, customers paying for it. And that's like this main privilege that we have in academia that we can do, can try out crazy things. So I would say, you know, like the best advice I've been given is really to, you know, to be advantageous and try out things, even if they don't end up working in the end. Yeah, just try it and uh yeah i mean eventually i mean you know

Starting point is 00:40:27 something will stick and uh it might be a success it might take a while but um yeah being optimistic and and trying out uh crazy things i think that's the that's the main main advice i've been given and i need to do more of that i think everyone needs to. We are all a bit conservative now and then, you know, just because we want to get publications out and get things done. But yeah, having that moonshot project, I would say now and then, I think is a great advice.

Starting point is 00:40:59 Nice. Yeah, just keep iterating. And I guess when you're in academia as well, you have the freedom to do that, right? You're not the massive shareholders like you are when you're in in academia as well you have the freedom to do that right you're not at the mercy of shareholders like you are when you're in in industry so yeah awesome stuff i like that's a good answer and that's a nice you kind of touch on a few things there which is a nice segue into like the next question i want to ask you and that's about it doesn't always work first time and there are setbacks and things get rejected. So I want to ask kind of what your process is for dealing with that. So, I mean, yeah, first of all, I would say, you know, I mean, you, you get better,

Starting point is 00:41:33 like in that aspect over time. And I think I, yeah, I had many setbacks throughout my, you know, professional, but also personal life as everyone, you know, has it. I mean, it's, it's just that people don't talk about it. But, you know, the more experience you gather, the more setbacks you will also gather. And I think the most important thing for me is that, again, you know, is to stay optimistic and to stay positive. So if your paper gets rejected a couple of times, and, you know, I had many rejections throughout my PhD and also postdoc. And yeah, just stay positive. And one thing I like to say is, you know, every downhill is followed by an uphill.

Starting point is 00:42:23 And I think it's just, you know, it is followed by an uphill and i i think it's just you know it's just important uh just you just want to make sure to to take enough momentum with you on the downhill you know have a good uphill so um yeah i think that's that's my my approach to it yeah so yeah it's gonna ride the roller coaster right that's it there's always another wave coming along so yeah just make sure you've got enough momentum you so you'll be just fine so now i like that great cool um yeah my next question andre is this is actually my favorite question that i always i always ask to my guests on the show and it's about about the creative process and do you have a systematic approach to idea generation and then once you have generated a set of ideas how do you

Starting point is 00:43:06 then choose which ones to to pursue and yeah what's your approach to that yeah it's also also a very good one so um and i don't think i have you know the approach uh for it uh so i'm still still learning and i've been so far um you know I've been, you know, inspired a lot by people around me. So it's, you know, sometimes it's your own idea. Sometimes it's someone else's idea. So, but I think, and that already brings me to my answer, is the important thing is, in that process, at least for me, is to work with people, right? To basically brainstorm with a group of people, ideally from different fields. So because if you just talk to people in databases,

Starting point is 00:43:53 they would probably all have a similar opinion. But if you suddenly talk to people in machine learning, they might have some new ideas or there might be something very interesting on that intersection. And that's actually something really nice now at my new place, UTN. So I'm basically the only systems professor there at this point and everyone else is in AI and machine learning. So it's really easy to work with, you know, AI researchers right now.

Starting point is 00:44:27 They're very approachable and the brainstorming, you know, it really works. It takes some time to, you know, in such an interdisciplinary setting. I mean, it's still, you know, computer science, but there are still some differences in how research is done and what is interesting, what not. And also the background, you know, their background in systems, my background in machine learning and so on. But yeah, I think this interdisciplinary approach is the right one for me at least. And especially, you know, given my current circumstances with the new school. And yeah, I also like to involve my students in that, in the idea generation, because I think one of the most important things is that you basically have ownership of the idea.

Starting point is 00:45:14 And I think students are most motivated if they, you know, basically contribute in that stage already, right? I mean, if they were just given some some tasks to do this is not you know really inspiring right so you you should basically give them the freedom to to participate in the brainstorming have their own ideas work on their own ideas and that's how people will um yeah be successful yeah like get that early stage buy-in right from them so yeah that's a lovely answer to that question so yeah my my next question is uh about the interaction between academia and industry and obviously you you've got a very good perspective because you've experienced both and i wanted to get your take on what you think the

Starting point is 00:45:56 current interaction is between these two um two different groups at the moment and how that can be improved going forward. Yeah, so this is something which I really care about. And I think there's a lot that can be improved just because, you know, as I said in the beginning, I mean, research is just too decoupled right now. And, you know, this might be fine for theory research, but it's not fine or at least not, you know, not what we want for, you know, data systems research, but it's not fine or at least not, you know, not what we want for, you know, data systems research, which is, you know, something very practical by nature. And we should be thinking about, you know, technology transfer and so on.

Starting point is 00:46:37 And yeah, technology transfer just wouldn't work if you, you know, work on the wrong problems in academia i think it's it's really important to have uh many interactions with industry and to enable such you know technology transfers and especially in my research where you you know in ml4 systems where you need access to data and workloads yeah you just cannot do the same research in academia without industry. So yeah, so I would say it's really important to have these projects. You know, we as academics, we should do sabbaticals,

Starting point is 00:47:14 you know, go to these companies for a while, come back, work on their problems. And yeah, just have more exchange, however that's being done. There's obviously, you know, legal restrictions and so on, but we got to push and we got to, you know, try to find solutions to that. And yeah, so that's one aspect in, you know,

Starting point is 00:47:34 like just how to work together. But I would say there's also something on the more technical side. So, you know, it doesn't just, you know, mean that we need, you know, to basically optimize for their workloads. But if we build something, it should also be something that at least can be transferable, you know, in the future at some point, right? It shouldn't be that, you know, we work on some prototype, which is not practical at all or would take forever to integrate into their systems. And I think the way out there is really to, you know, to invest in open standards.

Starting point is 00:48:12 And I think Apache Iceberg, so the data lake format is a good example. You know, we should invest in these. And then, you know, we can sort of, you know, decouple again and work on these problems independently. But then, you know, when integrating it, it's not a big deal. I mean, it doesn't mean that it has to be directly transferable and useful, but at least we should make sure that we don't go in totally different directions. So that's that part. But at the same time, and that's something we discussed before, right?

Starting point is 00:48:51 Like we should not just, you know, work for industry in research, right? I mean, there should be some interaction, but it shouldn't be that we only, you know, focus on that. There should also be enough time in academia to do these sort of moonshot projects, right? Which might not be very practical today, but they might be practical in five or 10 years. And yeah, I mean,

Starting point is 00:49:12 if you just think about large models nowadays, right? I mean, it's not a commodity, you know, to train such a one, but it might be in a couple of years. So yeah, and in research, we can already think about you know what would happen uh then and how should we change things everyone's been told now we need to go ahead we need to do this you need some time to work on our moonshot moonshot projects we also

Starting point is 00:49:35 need to work closely we can get access to those to that real world data and yeah taking people taking sabbaticals and yeah just kind of sharing more between each other being more open and people getting exposed to both different settings i think so yeah, is definitely the way to go for sure. Cool. So let's talk about the future as well, while we're on this sort of topic. And what do you think are the most exciting advancements? I mean, I can maybe think of one or two already, these large language models and Gen AI and all things in that sort of space. But yeah, what is the most exciting advancements for you that you've observed recently yeah i think that there's not much more to add because it's really about generative ai like

Starting point is 00:50:14 everyone's thinking about it and and especially um you know me being at this new school now with all the ai researchers i mean there's just so much, you know, so much interaction you can do. And yeah, you can, you know, basically, well, you can go two directions, right? You can think about integrating them into data systems, making them easier to use, for example. But you can also think about, you know, speeding up model training and so on. And that's also something we started looking into, like, how can we help them with our system knowledge to accelerate or improve their systems? And I think one example is, you know, in training,

Starting point is 00:50:53 I mean, you basically got to keep all the model weights in GPU memory, right? So yeah, there are many quantization techniques out there. And yeah, so that's like a little project that we just started, you know, to look into how can we, you know, compress these weights even further with the knowledge we have in databases. So that is one thing. But yeah, I mean, especially now at UTN, we also have a lot of people working on robotics. And that is something i never thought of before but now you

Starting point is 00:51:25 know when i when i go into the office i mean i see those those robots right so you get and get inspired by it and and apparently there are also um you know foundation models for robotics nowadays um so you know you can basically you know train a robot on you know sets and basically simulate its behavior and then transfer it to a real robot. And yeah, so they also require a lot of training data, a lot of video data and so on. And yeah, I'm also thinking about how can we help them to basically

Starting point is 00:52:02 speed up their training pipeline and make these systems more efficient awesome stuff yeah it feels like we're at a very sort of critical juncture for gen ai and i kind of wonder what the world's going to look like in 20 years and i mean you look back and kind of how the internet's changed the world right i mean you wonder in what way is uh positively and maybe negatively as well that that these gen ai systems or gen ai is going to change the world it's um going to be interesting to see for sure um but yeah thank you so much for we'll end things there andreas thank you so much for for um coming to talk to us today it's been a fascinating chat and i'm sure the listener will have thoroughly enjoyed it thanks for having me

Starting point is 00:52:38 check uh yeah it was a pleasure fantastic we'll end it there then so thank you very much for coming on and yeah we'll see you all next time for some more awesome computer science research.

Disseminate: The Computer Science Research Podcast - High Impact in Databases with... Andreas Kipf

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.