Disseminate: The Computer Science Research Podcast - High Impact in Databases with... Andreas Kipf
Episode Date: July 15, 2024In this High Impact episode we talk to Andreas Kipf about his work on "Learned Cardinalities". Andreas is the Professor of Data Systems at Technische Universität Nürnberg (UTN). Tune in to hear Andr...eas's story and learn about some of his most impactful work.The podcast is proudly sponsored by Pometry the developers behind Raphtory, the open source temporal graph analytics engine for Python and Rust.Papers mentioned on this episode:Learned Cardinalities: Estimating Correlated Joins with Deep Learning CIDR'19The Case for Learned Index Structures SIGMOD'18Adaptive Optimization of Very Large Join Queries SIGMOD'18You can find Andreas on:TwitterLinkedIn Google ScholarData Systems Lab @ UTN Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate the Computer Science Research Podcast.
As usual, Jack here to take you on this journey.
Before we do start, I'd like to give a shout out to our sponsor, Pomtree.
Pomtree are the developers behind Raftree,
the open source temporal graph analytics engine for Python and Rust.
Raftree supports time traveling, multi-layer modeling,
and comes out of the box with advanced analytics like community evolution, dynamic scoring, and temporal motifs mining.
It's blazingly fast, scales to hundreds of millions of edges on your laptop,
and connects directly to all your data science tooling, including Pandas, PyG, and Langchain.
So go check out what the Pometry guys are doing at www.rafri.com,
where you can dive into their tutorial on their latest release.
On to the show. So today is another installment of our high impact series for the listeners who
are new to this type of episode. It was originally inspired by a blog post by Ryan Marcus on the most
influential database papers. And today we are going to be speaking to Andreas Kipf, who has
the number one spot for 2019 with his paper on learned cardinalities estimating correlated joins
with deep learning, which was published at that year's edition of CIDER. Andreas is a professor
at the University of Technology Nuremberg. Before that, he's worked as a senior applied scientist at AWS in their
Learn Systems Group, where he was one of the founding members. Before that, even he was a
postdoc at MIT's Data Systems Group. And before that, he did his PhD at the Technical University
of Munich. Andreas's research interests, you can correct me if this is wrong, Andreas, but you
like to improve systems with machine learning with the focus on index structures storage layouts and query
optimization welcome to the show andreas yeah so glad to be on the show uh thanks jack for the
invitation um yeah looking forward to our discussion fantastic so i gave you a very
high level sort of overview of your career so far there when i
introduced you but help us color in between the lines a little bit and tell us more about what
that journey was actually like and yeah kind of what led you to become a researcher
yeah so i would go back um a few years in time um so during my graduate studies, like during my master's at TUM, I did like a master's thesis at UC Berkeley.
So that was like sort of an exchange I did.
And I was part of a collaborative research group over there.
And I was really inspired by, you know, how people approached research.
So my advisor back then was Professor Eric Brewer, who was also at Google at the time.
So, yeah, I was very inspired by, you know, how people worked in groups on collaborative
projects, not only at UC Berkeley, but also with people at UW in Seattle.
And I really enjoyed that part. So yeah, so basically, you know,
I changed my mind a bit because in the very beginning, you know, when I started studying
computer science, I always wanted to become a software engineer at Google. And yeah, so
funnily enough, like during my PhD at T PhD at TUM, um, I ended up doing,
doing a couple of internships at Google.
And so I also got that experience.
It was, you know, really, really interesting and inspiring, uh, as well, but always had
this, um, you know, research in mind.
Um, and I always wanted to go back.
So, um, yeah, so, so eventually, as I said,
like I did my PhD at TU Munich,
did these internships.
And my goal was still to become a software engineer
for some reason.
But then, and I think we'll talk about this later,
my research turned out to be a little bit more successful.
And it got really inspiring to be in the research, a little bit more successful. And, you know, it got really inspiring
to be in the research community and to have your own projects and so on. So yeah, I mean,
I still worked with Google. I'm also like in the later stages of my PhD, we created,
you know, also some publication together and it was all great. But yeah, I think the main turning point was really when I gave,
when I was in California for my talk about learned cardinality estimation.
And I talked to Tim Kraska over there, who I did a postdoc with later at MIT.
And he was basically asking me, you know, if I would want to join MIT as a postdoc with later at MIT. And he was basically asking me, you know,
if I would want to join MIT as a postdoc.
And yeah, I always had this dream to go back to the US,
to one of these groups.
And yeah, so that's how it started.
Awesome stuff.
Yeah, it feels like you got to experience
the best of both worlds, right?
You kind of had one foot in each camp
kind of throughout your studies
and kind of getting the experience. And that's kind of, I like each camp kind of throughout your studies and kind of getting
the experience. And that's kind of, I like what you're saying about the appeal of the
kind of the collaborative nature of research and academia and how kind of rewarding that can be as
well. So I can definitely see how you've swung back towards being back in, being in research
rather than in primarily in software engineering. So let's talk about learned cardinalities then.
So tier us up with some background there.
Give us some of the background information that we're going to need to talk about this
paper today.
Yeah, so learned cardinality estimation.
Yeah, so basically it's, you know, it's about SQL databases and query optimization
specifically.
And yeah, I think the main idea is really,
you know, can we improve, you know,
the estimation of result sizes of SQL queries
and intermediate results using machine learning?
So that's like the research question we asked back then.
And yeah, and as it turned out, you know,
you can have a rather simple model, I would say,
and apply it, you know, to this problem, dedicate some training time to it and get pretty good results out of that.
Yeah, so that's like the, you know, the high level idea.
And yeah, I think one of the reasons we decided to work on that is that, you know, I mean, query optimization is a complex problem.
It, you know, consists of many, many sub problems or few sub problems, I would say.
So one is, you know, you get your query in and you've got to estimate, or basically you've got
to explore the search space for your query plan, right? Like, for example, if you have, you know,
three tables that you want to join a b c
you could first join a and b and then c but you could also join first first join b and c
so to decide on you know such a plan what you typically do is you enumerate possible plans
and then you cost them so the first component is this John enumeration component. The second one is the costing component. And then the costing component typically consists of a cost model. And the cost
model would take these cardinality estimates as input. So for example, if I would have my query
A join B, I would want to know how expensive is it to execute? A typical, very simple cost function people use is the cout cost function,
which basically just counts the number of output tuples of that particular join.
And yeah, and that cost function has an input, which is the cardinality estimates,
which we need to compute and provide to the cost model.
So it really starts with estimating how many tuples qualify my join.
So if, let's say, I have table A with a predicate like X equals 5,
I would want to know how many tuples would qualify,
what would be the selectivity of that scan.
Same for the scan on B.
And then these base table estimations
are usually pretty simple.
So I would say, you know, to estimate something like X equals five
or even some complex string predicates,
like, you know, you have names
and you want to know how many people start with an A.
That's usually done or pretty easily done using sampling.
So you just take a sample of your database and compute the selectivity on the sample.
And then you would extrapolate to the whole table.
So that's usually doable.
It gets more complicated with joins.
And that's the specific problem we targeted with learned cardinalities.
So basically with joins,
you get correlations in between tables.
And yeah, so that's something machine learning
is known to be good at, capturing correlations.
And typically in databases,
this is implemented using simple formulas.
You could, for example, assume independence between these inputs.
So that's, I would say, the background.
And actually something I really wanted to talk about is how this all started.
Yeah, yeah.
Tell us about the origin story.
Yeah, so that's actually, yes,
really, really interesting, I would say.
So it's not like, you know,
how you would normally approach a paper.
At least I think, you know,
most people wouldn't approach it like that.
So as you remember, or as, you know,
our listeners might remember, you know, it was, I think around in 2018 when this entire, you remember, or as our listeners might remember,
it was I think around in 2018 when this entire machine learning for systems topic came out.
So we saw a few first papers on knob tuning and most notably on indexing,
which created a lot of hype around it.
And then also some work on query optimization.
So, yeah, so I was in Munich, you know, with the database group back then doing my PhD. And, you know, we would all discuss, you know, the papers that would come out.
And, yeah, so we thought, you know,
what could we do in that space?
Is there actually something in it?
You know, how does it compare
to more traditional approaches and so on?
So where should we really apply machine learning?
So yeah, so we basically started brainstorming that, right?
As probably every group did at the time.
And yeah, and I think our outcome was that,
our conclusion was that many of the problems
in databases can be solved exactly.
Like, you know, the join enumeration problem
I mentioned is something you can solve
exactly up to 10 joins.
You know, if you have a cost model
and you trust that cost model,
the join enumeration part can be solved using
traditional, let's say, advanced algorithms. Whereas cardinality estimations are just
explained it, you know, like finding out how many people's names start with an A or let's,
you know, consider the internet movies database, which I used in my work. You know, how many actors, how many German actors participated in French movies, for example, right?
That's a very tough problem to solve.
And it's a fuzzy one.
So it's not like that there's one correct solution or only one correct solution.
But, you know, if you get close to the correct solution, it's already probably good
enough for the task. So it's rather a fuzzy problem. And we all know that machine learning
is really good at these fuzzy problems, right? I mean, it often or like rarely provides, you know,
the exact answer, but often it gets, you know, very close, like these models, like regression
models, they would get pretty close to, you know, forecasting or, yeah, like, you know, very close, like these models, like regression models, they would get pretty close to, you know, forecasting or, yeah, like, you know, just fitting a line, right?
So that's one thing.
And we also, you know, said like, you know, cardinality estimation is this really hard
problem that researchers have worked on for many decades in the database community.
So why not try that?
So yeah, so that's how the motivation came along when we still didn't have a solution.
But the thing is, and that's like, you know, for me at least, yeah, the most fulfilling
one is that, you know, I have a brother working in, like he's not only my brother,
he's also a deep learning researcher.
Oh, really?
Yeah, so that was very convenient or is still.
So yeah, so basically I explained that problem to him
and said like, yeah, we need a model
that can capture these correlations between joints, right?
And he, yeah, he basically immediately said, you know, there's a model that just came out and was released in Europe at the time called DeepSets.
So it's a neural network architecture that basically has set semantics.
And SQL also, you know, has set semantics and sql also you know has set semantics so like you know it doesn't matter
you know for example like like our query plan right like it doesn't matter if i first you know
join a and b and then c the output of that subtree like in terms of how many tuples
it produces is always the same no matter which plan I use, right?
So the cardinality estimation problem by itself,
you know, has set semantics.
At least that, you know, specific problem that we targeted.
So yeah, so there was the perfect model for it already.
And yeah, one limitation was still
that this original model was built for one set and we actually needed multiple sets.
So we needed one for joints, we needed one for base tables.
And yeah, so my brother Thomas, you know, sat down with me and we adapted or we created a new model,
which would, you know, consider all these features, have different set modules.
I would, you know, set up the benchmark, provide the feature set.
And yeah, and that's how it ended up being created.
That's a fascinating origin story.
It was a family affair sort of thing.
That's brilliant.
And I guess that's why it gets its name of multisetssets right because it was kind of taking the idea for a single set
and now we've got multi-sets right so yeah that guy explains the the name of the of the network
so let's talk a little bit more kind of about how the multi-set convolutional networks actually
work then so like what's the secret sauce behind them for being really good for solving this problem yeah so i think the the main intuition um is really this the set semantics
right like you don't waste any model capacity on remembering um you know permutations um because
they're just not important um so it's really about the um yeah the set aspect um of it so you basically get
a good um you know basically yeah it needs fewer data samples to convert because it doesn't you
know need to learn that you know it's actually permutation invariant like what we are you know
what we are um training on um so that is one thing but you know, the model is always and it's always like, you know, in the center of a lot of research. And I still think the model is important. But what I think is even more important is the features that we should not only train on SQL features, right?
Like, so let's say, you know, you have your from clause, you have your where clause, your base table and the join predicates that you could use and sort of featurize and input to the model.
So that's one thing I would call this sort of the static features that you get um yeah with your sql query with your sql string but then there's also like these dynamic features um or runtime features i think
they're called in the paper which yeah as the name implies you would only be able to collect at
query runtime or optimization time and yeah i think that was the main finding um that we we should also use this
runtime information and i think yeah so so i talked about sampling earlier right when when i talked
about base table or when we discussed base table uh scans um so yeah what a database system like
classical one uh would do is would uh it these samples, right, and then execute predicates on these samples.
And with MSCN, we did not want to really compete against that. But instead, we said, let's use it
as additional feature input. So we still do all the traditional cardinality estimation,
but we input it as additional features into the model. So we basically learn, you know, if the fifth tuple
in my sample qualifies,
how does that affect our output?
Yeah, and we basically show
that the model can capture
this information,
basically the runtime information
that we get.
Yeah, and that would improve
quality significantly.
And that's how it came along.
Awesome. Yeah, that leads into the kind
of the question i mean i know it was it was five years ago now so maybe you don't remember it
remember exact numbers but how much better was it than the traditional approach yeah so i mean
and and that's always the problem right it it always depends on your on your train and tests
that i would say yeah so why we didn we didn't test on the training data,
we still, I mean, we tested on a constrained space, right?
So basically all the training and test data was generated
and then we did our train test split,
but the test data would still follow a similar distribution.
For example, the test data would not contain
completely different predicates.
Like if, you know, in our case back then,
we only supported equality in range.
So we wouldn't add, you know, like predicates to the test data.
So yes, it was a constrained setup.
But for that constrained space, you know, on IMDB, I think it was like six tables back then.
It's, yeah, I mean, it was, you know, up to an order of magnitude better than what you would get with, you know, more traditional approaches.
The downside, of course, was that, you know, you had to train it.
So I think that was the main limitation that, you know, it took a while to train.
You needed a lot of training data.
And at the same time, you know, once you trained it and your data would change or your workload
would change, you know, you might need to retrain again.
So we didn't have update support at the time.
So yeah, and that said, I think back then
for this constraint problem space,
I think we did really well.
But then we also discovered a lot of limitations
to make it really practical.
Yeah, I guess, did you ever find the answer
to the question of how many German actors
played in French movies?
Well, actually, I just came up with this one.
So I don't know, but we asked many of these.
Yeah, it feels like you said that you were kind of aware
of the various limitations with it, but it sort of was a nice,
was well-scoped and led the groundwork for a lot of future work then.
So I guess let's talk about the impact that this paper has had
over the past five years since it was published.
And have many of these shortcomings been addressed?
Yeah, so it's crazy how time flies, like you just said, you know.
Yeah, five years. I mean, that's crazy.
But yeah, so there has been a lot of follow-up work, actually, addressing our limitations.
I was involved in some of it, but only very limited.
So there have been many papers, for example, addressing updates, which was one of our main limitations. I still wouldn't say it's solved yet,
just because it's a very hard problem
in machine learning in general, right?
It's not something very specific just to databases.
I mean, of course, we can collect new training data and so on,
but you still got to update your model
and basically make sure that your model sort of scales
to your new feature dimensions
and so on. So yeah, so that's not, you can do, you can do there, I think. So yeah, there's still
some update cost for sure. On the other hand, people showed that, you know, in some cases,
you might actually not need a deep net for it, right? So you can actually, you know, use simple gradient boosting techniques to get pretty close.
You might not be able to outperform it, but you can at least get pretty close.
And then one other limitation was around, you know, low selectivity cases.
Let's say, you know, I select an actor in my internet movie database who only participated
or played in a single movie, right? So I might not have seen that person during training. So
it might be very hard to estimate what that actor, you know, how many movies the actor
participated in. So that's something we cannot really, you know, solve with a supervised approach like MSCN is.
So yeah, so there have been a lot of follow-up works
in the unsupervised space as well,
like NARU and DeepDB, just to name two of them,
which, yeah, address specifically these cases.
So they basically train an autoregressive model
over the entire input. So they basically train an autoregressive model over the entire input.
So they would have seen, you know,
those rare cases as well.
Yeah, and finally, I would say string predicates
is something we didn't handle.
So there have been a few follow-ups.
Uncertainty estimation is something which,
you know, you actually need
before you deploy something like this in production
because otherwise, how do you know
if the model performs well enough to be deployed?
So that is something a few groups worked on.
And I think there still needs to go more research into that
to be able to trust these models fully
because yeah, up to date,
they haven't been adopted by many of the big players.
So there's one example in Microsoft Azure Cosmos DB.
They implemented a similar approach as far as I'm aware of.
And also there are other database companies
that are implementing similar approaches still today.
But yeah, so they're usually very careful with,
and which makes total sense, right?
I mean, you don't want to ship a model
where you're not sure that it performs well.
Yeah, you preempted one of my questions there.
It was going to be, has any of these techniques
found their way into commercial systems yet?
But I guess they're still sort of finding their way
and gaining confidence in these techniques
and the stability of them before they unleash them
on the wild, I guess.
Quickly on the gradient boosting um aspects you mentioned as well uh you can kind of get a lot of the way with techniques such as that and i guess it's the benefit of a gradient
boosting that they are quicker to train as well compared to like deep network because that's kind
of the main advantage of like the training times a lot smaller yeah so that is one advantage um it's also smaller so it might might better fit into ram but i i wouldn't
necessarily necessarily say that the training time of the model is the big bottleneck um in
long cardinality estimation i think it's much more about um training data collection for example
i mean of course like if you are if you're a big player, like one
of the big, you know, cloud data warehousing companies, I mean, you basically all have all
of this training data already, right? Just because you have this like, yeah, large customer base,
and you can potentially, you know, you know, train a model maybe for each customer on its own,
or maybe you can do something, you know, in a model maybe for each customer on its own, or maybe you can do something,
you know, in an anonymized form across customers or something like that. But typically, if you
don't have that, I mean, you're basically starting from scratch, right? And you've got to sort of
cold start and deploy your model. So, so the thing, like the approach we followed is we,
we basically try to, I mean, we back then hard coded the important joins.
But what you could do is you could just get that from the user.
And then for the important ones, you actually collect training data.
So you execute queries in the background with different predicates.
You get all that training data, sample the entire subspace, and then train your model on that.
And I think in that case, model training is really not the bottleneck.
It's much more about training data collection.
I had another question bouncing around my mind.
What was I going to ask?
Oh, yes.
It was about the, as sort of the data, the workloads changing,
and these, maybe the models, the accuracy decreases over over time so you need to do this sort of
collecting new data retraining processes is in an issue maybe very workload sensitive or data
sensitive but how quickly can things become stale and is it just like what is it a function of
primarily of how quickly your your models accuracy drops off as the data or workloads changing or is
it like i said workload dependent yeah so it really depends on your data and workload changes.
So for example, you know, if someone adds a new column to your database, there's not much you can do about it, right?
I mean, you haven't seen that, like predicates on that column before.
Likewise, you know, if you add, you know, 10% new tuples, which follow a completely different distribution, there's also not much you can do.
But in my experience, I mean, data usually follows the same distribution, right? Like
a sales database won't suddenly contain weather data. Well, maybe it might.
Yeah. But I would say it's usually following a similar distribution so usually it should be
good for a while and there have been a few papers studying this exact question but yeah again it
really depends so for example what the model does is it normalizes column features so basically
when i have a predicate on a given column that, you know, would say like same example as
before X equals five and the column domain for X is between one and 100 at training time. But now
at test time, you know, people ended up inserting values up to a million. You know, the model
wouldn't be able to guess that. So yeah, so it really depends on the kind of changes and if they're breaking
changes, I would say. So after the LAND card analysis paper, you then went on to do a postdoc
and then eventually going on to work at AWS in their LAND systems group. So tell me about how
that paper then led to that. And then I guess after you were at AWS, you kind of came back to
the University of Nuremberg. So yeah yeah tell us about that story a bit more yeah so um yeah so basically after i um
you know submitted and published um the learn cardinality estimation work i wrapped up my phd
at munich joined mit and um yeah so so i worked with, in the beginning with Ryan Marcus and Tim Kraska, like that was,
you know, how it started on this SageDB prototype, which is, so for those not familiar, it's basically
a system that tries to, you know, implement different learning techniques in a coherent way.
So yeah, so we ended up doing some, you know, storage optimizations.
Ryan worked on some career optimization aspects.
And we created a research prototype.
And, yeah, so that went on for a while.
And then eventually we decided, you know, we could have most impact in industry because we're simply lacking workload.
And that's, I think, still nowadays a big problem.
In academia, you don't really have access to the real problems.
So we decided we could have most impact in industry
with these kind of techniques. Yeah, so we joined, this HDP group joined Amazon and we founded the Learn Systems group
led by Tim Kreska.
And yeah, and over there, it was actually very interesting to see how the real problems
actually look like after we solved the not so real ones in academia.
And yeah, so we get access to workload statistics
and we're able to build sort of new solutions
and for example, in the storage space
and also in terms of workload forecasting at AWS
and specifically at Redshift.
Yeah, and that was really inspiring.
And I think I got a lot out of that time.
So you're at Amazon and things are going well.
You've got this new group you founded
and you've got all this access to all these nice real workloads. you've seen behind the the cat and to all access to all this really
cool data and stuff so then you eventually you kind of then went back to back to academia and
you ended up back at the university of technology in nuremberg so tell us how that happened
yeah so that's that's a very interesting one so i I've been following, you know, the creation of UTN, which I just called for short, for many years now.
So it's, you know, it's a second technical university in Bavaria next to TU Munich.
So it was, you know, it's a big project by the Bavarian government.
And, you know, it's, yeah, it's been many years since the last one was founded.
I think it was 40 years ago, the University of Passau.
So another one in Bavaria.
So it's a big thing, you know, to have a new university.
So you don't get this chance, you know, multiple times in your life, I would say.
Yeah, so I was very excited from the beginning on about the creation of this you know new buildings new campus
um new everything pretty much and yeah i mean on the other hand um you know with every new place
there's a lot of work so you gotta you know build it up first and uh there's really more work than
you would would have in other places um but yeah i see it more as a chance. And yeah, so it was sort of in the back of my mind
for many years already to, you know,
to get a spot, get a research position at this place.
And it was also one of the reasons
why I decided to do that postdoc at MIT
because I had this idea in mind.
And I was like, yeah, you know,
if this comes around around you know uh
I I would pursue academia and um so basically I you know I saw the opening I applied and I I got
the offer and uh yeah just I just couldn't resist fantastic yeah it must be really nice to have the
blank slate to kind of go and create something so yeah while it's probably a lot of work initially
to kind of get things off the ground this is like an opportunity you'll probably never get again, right?
They don't make, like you say,
they don't make universities every day, right?
A new universities.
So yeah, that's fantastic.
So I guess kind of, yeah,
what are you working on at the moment at UTN?
Yeah, so some of the learnings at Amazon
were, you know, that, you know,
we are basically working on the wrong problems,
as I said, in academia.
And it's a lot about, I think,
it's a lot about usability of these systems.
And you might know that from your experience at Neo4j.
If people cannot really use your system,
you can optimize performance however you want.
It just doesn't matter, right, at the end of the day.
Doesn't mean that performance
isn't important so i still would say performance is a very important aspect especially the cost
component of it but i would say you know one of the most important things is you want to make it
easy to use and and this is actually something which, well, I discovered during my first class that I taught at UTN.
And yeah, so basically I had an exercise with my students where they had to import some log files into a data warehouse.
And yeah, I was basically, you know, helping them out and seeing how they're doing. And I found that, you know, they were basically, you know,
manually creating, you know, table statements and so on.
And yeah, it just took them a very long time.
And they had a, yeah, I mean, they basically just couldn't really import it,
you know, in the given time, right?
It was just, there were just too many edge cases to consider
and too many questions being asked.
So, yeah, so I just thought, you know,
it would be great to automate this process.
And nowadays with generative AI, I think it's a great chance for,
you know, for us to create easier use interfaces
to automate a lot of this.
Yeah, and so we basically started a project
that would ease the data loading aspect.
And yeah, I think my vision for that
is that you would just have,
you just give it an S3 URL
and with some parquet, CSV, JSON files, whatnot in there, and it would
just automatically figure out everything for you, right?
It would basically suggest some schema.
It would, you know, tell you, you know, that's the data you should look at.
That's what, that's how you should import it.
That's the system you want to use and all of that.
And ideally with some, you know, I mean, not ideally, but possibly with some,
um,
LLM,
uh,
interaction,
um,
to,
to basically have the,
have the user in the loop.
Yeah.
Honestly,
you hit the absolute nail on the head there with like kind of data import and being like one of the most horrible experiences in the world.
I mean,
so many times I've come across,
tried to,
I'm like,
okay,
I'll fire up an instance of this.
And then I spend the next three days trying to get the data into
the damn thing i need to be like yeah i think a few of these gray hairs i'm getting it because
of that i think over the years but um yeah and but yeah you are right as well that i think
often in academia we're very much focused on performance performance performance whereas
this is the other dimension of east usability right? And it being easy to use.
Obviously, like you said, performance is very important still because when you get to use it,
you want it to run fast, right?
Anyway, that sounds fantastic.
I look forward to seeing the results of that research.
Cool.
So let's kind of take a little retrospective now and kind of look back over your career today.
And what are you most proud of in your career?
Yeah, I think I would probably still name the cardinality estimation work.
And not because it's the most difficult or most impactful one out there,
but simply because of the fact that it was done together with
my brother.
And this is just something, you know, very special to me.
And yeah, it was also a process that I really enjoyed.
So yeah, this is the number one.
Yeah, I can definitely see that.
It must have been super nice to work with your brother on that.
Do you still actively collaborate?
Because I mean, obviously, there's still a relative amount of overlap between your two fields of interest so yeah did you get something you're kind of gonna
do more in the future yeah so i wish i could uh actually uh so back then he was you know still
doing uh still doing a phd at the university of amsterdam um but now um yeah he's at he's at
google deep mind so uh he's busy doing doing uh fancy stuff yeah okay cool always good
to have those contacts though right yeah yeah and cool we could we can jump on to sort of the
section on motivation then and sort of what's motivated you over over your career and what
are the or if they are any sort of specific papers
and people who have motivated you throughout your career?
Yeah, I would say the learned indexing paper
by Tim Kraska is probably the one
that influenced me the most
or that had the most impact on my career
just because I got inspired by it, right uh to do the cardinality estimation
work to do something uh in the learning space and um also which you know then led me to to
eventually do a postdoc in that space uh with tim kreska uh yeah, so that was, I would say, the most impactful one on me personally as well.
And yeah, otherwise, I mean,
I was fortunate to be part of a very strong group
at TU Munich,
led by Alfons Kemper and Thomas Neumann.
And yeah, so I was also very inspired by their work and especially, you know, Thomas' creation of Hyper, you know, the in-memory database system got, you know, acquired by Tableau back then.
And, you know, many of his papers were very influential at the time.
So, for example, one that I remember is the
adaptive optimization of very large
join queries, which I found
a very nice read
and I recommend
it to everyone to take
a look. He
basically showed that up to 10 joins
you can get
exact answers in join enumeration,
which I think is a great result. And yeah, so I think I was mostly inspired
by the people around me.
But yeah, I was just fortunate to be at TU Munich.
Cool. Yeah.
So I guess kind of going off motivation
and sort of things like this what what's
the best piece of advice anyone's ever given you i think it's really um you know that you know you
should not um expect your um projects to all work out right it's really like research should be
about um you know trying out crazy ideas um that's why we are in research if the about, you know, trying out crazy ideas.
That's why we are in research.
If the ideas, you know, wouldn't be crazy.
I mean, we could also, you know, be in industry, which is, you know, not a bad thing. But in industry, you want to make sure that things actually work at the end of the day.
And you have, you know, customers paying for it.
And that's like this main privilege that we have in academia that we can do, can try out crazy things.
So I would say, you know, like the best advice I've been given is really to, you know, to be advantageous and try out things, even if they don't end up working in the end.
Yeah, just try it and uh yeah i mean eventually i mean you know
something will stick and uh it might be a success it might take a while but um yeah being optimistic
and and trying out uh crazy things i think that's the that's the main main advice i've been given
and i need to do more of that i think everyone needs to. We are all a bit conservative now and then,
you know, just because we want to get publications out
and get things done.
But yeah, having that moonshot project,
I would say now and then,
I think is a great advice.
Nice. Yeah, just keep iterating.
And I guess when you're in academia as well,
you have the freedom to do that, right?
You're not the massive shareholders like you are when you're in in academia as well you have the freedom to do that right you're not at the mercy of shareholders like you are when you're in in industry so yeah
awesome stuff i like that's a good answer and that's a nice you kind of touch on a few things
there which is a nice segue into like the next question i want to ask you and that's about
it doesn't always work first time and there are setbacks and things get rejected. So I want to ask kind of what your process is for dealing with
that. So, I mean, yeah, first of all, I would say, you know, I mean, you, you get better,
like in that aspect over time. And I think I, yeah, I had many setbacks throughout my,
you know, professional, but also personal life as everyone, you know, has it. I
mean, it's, it's just that people don't talk about it. But, you know, the more experience you gather,
the more setbacks you will also gather. And I think the most important thing for me is that,
again, you know, is to stay optimistic and to stay positive. So if your paper gets rejected a couple of times,
and, you know, I had many rejections throughout my PhD and also postdoc.
And yeah, just stay positive.
And one thing I like to say is, you know, every downhill is followed by an uphill.
And I think it's just, you know, it is followed by an uphill and i i think it's just you know it's just
important uh just you just want to make sure to to take enough momentum with you on the downhill
you know have a good uphill so um yeah i think that's that's my my approach to it yeah so yeah
it's gonna ride the roller coaster right that's it there's always another wave coming along so
yeah just make sure you've got enough momentum you so you'll be just fine so now i like that great cool um yeah my next
question andre is this is actually my favorite question that i always i always ask to my guests
on the show and it's about about the creative process and do you have a systematic approach
to idea generation and then once you have generated a set of ideas how do you
then choose which ones to to pursue and yeah what's your approach to that yeah it's also also
a very good one so um and i don't think i have you know the approach uh for it uh so i'm still
still learning and i've been so far um you know I've been, you know, inspired a lot by people
around me. So it's, you know, sometimes it's your own idea. Sometimes it's someone else's idea. So,
but I think, and that already brings me to my answer, is the important thing is, in that process,
at least for me, is to work with people, right? To basically brainstorm with a group of people,
ideally from different fields.
So because if you just talk to people in databases,
they would probably all have a similar opinion.
But if you suddenly talk to people in machine learning,
they might have some new ideas
or there might be something very interesting on that intersection.
And that's actually something really nice now at my new place, UTN.
So I'm basically the only systems professor there at this point
and everyone else is in AI and machine learning.
So it's really easy to work with, you know, AI researchers right now.
They're very approachable and the brainstorming, you know, it really works.
It takes some time to, you know, in such an interdisciplinary setting.
I mean, it's still, you know, computer science, but there are still some differences in how research is done and what is interesting, what not.
And also the background, you know, their background in systems,
my background in machine learning and so on.
But yeah, I think this interdisciplinary approach is the right one for me at least.
And especially, you know, given my current circumstances with the new school.
And yeah, I also like to involve my students in that, in the idea generation, because I think one of the most important things is that you basically have ownership of the idea.
And I think students are most motivated if they, you know, basically contribute in that stage already, right?
I mean, if they were just given some some tasks to do this is not you know
really inspiring right so you you should basically give them the freedom to to participate in the
brainstorming have their own ideas work on their own ideas and that's how people will um yeah be
successful yeah like get that early stage buy-in right from them so yeah that's a lovely answer
to that question so yeah my my next question
is uh about the interaction between academia and industry and obviously you you've got a very good
perspective because you've experienced both and i wanted to get your take on what you think the
current interaction is between these two um two different groups at the moment and how that can be improved going forward. Yeah, so this is something which I really care about.
And I think there's a lot that can be improved
just because, you know, as I said in the beginning,
I mean, research is just too decoupled right now.
And, you know, this might be fine for theory research,
but it's not fine or at least not, you know, not what we want for, you know, data systems research, but it's not fine or at least not, you know, not what we want for, you know,
data systems research, which is, you know, something very practical by nature.
And we should be thinking about, you know, technology transfer and so on.
And yeah, technology transfer just wouldn't work if you, you know, work on the wrong problems
in academia i think it's it's really
important to have uh many interactions with industry and to enable such you know technology
transfers and especially in my research where you you know in ml4 systems where you need access to
data and workloads yeah you just cannot do the same research in academia without industry.
So yeah, so I would say it's really important
to have these projects.
You know, we as academics, we should do sabbaticals,
you know, go to these companies for a while,
come back, work on their problems.
And yeah, just have more exchange,
however that's being done.
There's obviously, you know, legal restrictions and so on,
but we got to push and we got to, you know,
try to find solutions to that.
And yeah, so that's one aspect in, you know,
like just how to work together.
But I would say there's also something
on the more technical side.
So, you know, it doesn't just, you know, mean that we need, you know,
to basically optimize for their workloads. But if we build something, it should also be something
that at least can be transferable, you know, in the future at some point, right? It shouldn't be
that, you know, we work on some prototype, which is not practical at all or would take forever to integrate into their systems.
And I think the way out there is really to, you know, to invest in open standards.
And I think Apache Iceberg, so the data lake format is a good example.
You know, we should invest in these.
And then, you know, we can sort of, you know, decouple again and work on these problems
independently. But then, you know, when integrating it, it's not a big deal. I mean, it doesn't mean
that it has to be directly transferable and useful, but at least we should make sure that
we don't go in totally different directions. So that's that part.
But at the same time,
and that's something we discussed before, right?
Like we should not just, you know,
work for industry in research, right?
I mean, there should be some interaction,
but it shouldn't be that we only, you know, focus on that.
There should also be enough time in academia to do these sort of moonshot projects, right?
Which might not be very practical today,
but they might be practical in five or 10 years.
And yeah, I mean,
if you just think about large models nowadays, right?
I mean, it's not a commodity, you know,
to train such a one,
but it might be in a couple of years.
So yeah, and in research,
we can already think about you know what
would happen uh then and how should we change things everyone's been told now we need to go
ahead we need to do this you need some time to work on our moonshot moonshot projects we also
need to work closely we can get access to those to that real world data and yeah taking people
taking sabbaticals and yeah just kind of sharing more between each other being more open and people
getting exposed to both different settings i think so yeah, is definitely the way to go for sure.
Cool. So let's talk about the future as well, while we're on this sort of topic.
And what do you think are the most exciting advancements?
I mean, I can maybe think of one or two already, these large language models and Gen AI and all things in that sort of space.
But yeah, what is the most exciting advancements for you that you've observed recently
yeah i think that there's not much more to add because it's really about generative ai like
everyone's thinking about it and and especially um you know me being at this new school now with
all the ai researchers i mean there's just so much, you know, so much interaction you can do. And
yeah, you can, you know, basically, well, you can go two directions, right? You can think about
integrating them into data systems, making them easier to use, for example. But you can also think
about, you know, speeding up model training and so on. And that's also something we started looking
into, like, how can we help them with our system knowledge
to accelerate or improve their systems?
And I think one example is, you know, in training,
I mean, you basically got to keep all the model weights
in GPU memory, right?
So yeah, there are many quantization techniques out there.
And yeah, so that's like a little project
that we just started, you know,
to look into how can we, you know, compress these weights even further with the knowledge we have in
databases. So that is one thing. But yeah, I mean, especially now at UTN, we also have a lot of
people working on robotics. And that is something i never thought of before but now you
know when i when i go into the office i mean i see those those robots right so you get and get
inspired by it and and apparently there are also um you know foundation models for robotics nowadays
um so you know you can basically you know train a robot on you know sets and basically simulate its behavior
and then transfer it to a real robot.
And yeah, so they also require a lot of training data,
a lot of video data and so on.
And yeah, I'm also thinking about
how can we help them to basically
speed up their training pipeline
and make these
systems more efficient awesome stuff yeah it feels like we're at a very sort of critical juncture for
gen ai and i kind of wonder what the world's going to look like in 20 years and i mean you look back
and kind of how the internet's changed the world right i mean you wonder in what way is uh positively
and maybe negatively as well that that these gen ai systems or gen ai is going to change the world it's um going to be interesting to see for sure um but yeah thank you so much for we'll
end things there andreas thank you so much for for um coming to talk to us today it's been a
fascinating chat and i'm sure the listener will have thoroughly enjoyed it thanks for having me
check uh yeah it was a pleasure fantastic we'll end it there then so thank you very much for
coming on and yeah we'll see you all next time for some more awesome computer science research.