Disseminate: The Computer Science Research Podcast - Immanuel Haffner | mutable: A Modern DBMS for Research and Fast Prototyping | #21
Episode Date: February 6, 2023Summary:Few to zero DBMSs provide extensibility together with implementations of modern concepts, like query compilation for example. This as an impeding factor in academic research. In this episode, ...Immanuel Haffner, presents mutable, a system that is fitted to academic research and education. mutable features a modular design, where individual components can be composed to form a complete system. Check out the episode to learn more!Links:PaperWebsiteMutable github repoBobby Tables xkcd Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate, the computer science research podcast. I'm your host, Jack Wardby.
This is the first episode in our CIDR series, and I'm delighted to say I'm joined today by Emanuel Hafner, who will be talking about his research interests are databases, and specifically he has looked at in the past query processing.
Manuel, thanks for joining us on the show.
Hello, thanks for having me.
Great stuff. Well, let's jump right in then.
So can you start off by telling the listeners how you became interested in database research. Okay. Yes, it was kind of an accident, perhaps you could say that,
because I graduated in compilers, actually.
So I did my bachelor's and master's in compilers.
I was working on Clang and LLVM.
Sadly, my professor didn't have an open position for me as a PhD student.
And it turned out that in my last semester of my studies,
I was taking the database class of my now,
my chef, Professor Jens Dittrich.
And so this database class was actually really interesting.
And I really liked the way that we connect theory to practice.
And also there was quite a lot of coding
involved and I liked that. So I graduated the best of my class and then actually Jens approached me
and he was like whether I would like to work for him as a research assistant or something.
And I was laughing and saying, no, actually, I'm done.
We could talk if you had a PhD position.
I was making a joke.
And then I got a mail a few weeks later from him inviting me over for a chat about that position.
And I wasn't really convinced I would like to do that.
But then we had this chat and it was a really good offer.
And I was also eager to work in a domain where we have compilers and databases collaborating, like query compilation.
So that's how I got into databases.
That's a really nice origin story. I like that. Do you find that there's plenty of things
like from your experience kind of with compilers that you can apply? I mean, maybe this is
what we can touch on this over the course of this conversation, because I kind of feel
like that influence in this work a little bit. But did you feel that a lot of transferable
skills from that that just applied very nicely to databases?
Yeah, definitely.
For one, I can say that at the time when I was starting my PhD,
compiling to LLVM was kind of the hot topic.
And I knew LLVM very well.
So I was able to pick that up quite easily.
So, yeah.
Okay, great. So can you set the scene for your paper?
All right. So the introduction is kind of the story of my own, rephrased as someone else. So
when I started my PhD at this group, no one there was working on a complete system.
So the people were doing fundamental research
and under lab conditions, I would call it,
like writing the code just for particular experiments.
And this means if you have a new research idea,
it's kind of hard to give it a quick try
because you don't have a system at hand that you could
just plug your idea into, right? And it's also, you suffer that there's no one next door that
you could just ask for help who is familiar with the system. This is one problem that I spotted.
And the other problem is that if you're a new PhD student, you have to come up with ideas, right?
But this can be really hard.
And if you are working on a system or you have a system at hand, then it may have flaws.
It may have issues.
And this can be a spark of inspiration and lead you into your PhD.
And I would say that during my time as a PhD now, I've spoken to some people that shared this experience.
So how does this lead to the introduction of the paper?
In the paper, I'm telling the story of Bobby Tables,
which you could think of was, is me,
because we witnessed the same story.
So consider Bobby Tables, who is pursuing a PhD and just started recently.
And Bobby is working or pursuing a PhD in database research.
Now, Bobby has this idea that he or she might want to implement and give it a try, but has no database system at hand.
So what Bobby can do is go to the database of databases, or in short, the dbdb.io.
And there you have a list of, I think it's 877 databases.
And you could just type in search queries for different categories of databases that
you're looking for. So Bobby could just start hacking into the search engine
what kind of system Bobby's looking for
to find a system where Bobby can implement the ideas.
And there are plenty of systems, as I've said before,
so you could guess there will be one that fits my needs.
But as I do this in an introduction, I show that if you have particular interests, then
you're only stuck with very few systems.
And when you just have like two systems available in the end, then, well, you're stuck with
design decisions that are baked into these
systems. And another problem that arises is not only that you have that few systems available
for the particular kind of research that you might be interested in, but also I found that
documentation is lacking. And I have to explain this because when I say documentation is lacking,
I don't mean that these systems usually come without documentation.
Like if you consider, for example, PostgreSQL, of course, there's documentation.
There's plenty of it.
But this documentation is targeted more towards database administrators and database users.
But Bobby is a researcher.
So he or she needs guidance how to implement the idea into the system, and I had the feeling that this is lacking.
And now if you're wondering why the main character of the introduction is called Bobby Tables, so this is actually a name taken from XKCD, and I also looked it up.
So it's XKCD number 327, and I really like that xkcd it's quite funny it's about
sql injections you can give it a read and also what's quite nice about the name is bobby is also
a female name it's both male and female nickname so it's also a way of inclusive writing i would
say amazing thanks yeah i i mean that experience that you said at the very start of your PhD was the exact same as mine.
I mean, I know it's the true of most PhD students in databases, right, where you kind of left for this scenario of building a very small prototype to show your thing.
But then that kind of makes kind of comparison very hard because you have to reimplement things and stuff and yes it's a bit of a mess and if you go down the other route taking an existing system and trying to like find documentation on the internal system of
components of that system is pretty much non-existent right i mean there's a few
systems where it's okay but by and large it's really difficult and these systems are so
complicated and like especially some of the older ones and they're involved over such a long period
of time right like it's very hard to to go in there and actually run your experiment, right?
Or do what you want to do with it.
Let me tell you a fun story.
You know, I'm listening sometimes to the Stack Overflow podcast.
And on one of these episodes, they told that the founding idea behind Stack Overflow was documentation
is a myth.
There is no documentation.
Everyone claims to do it, but it's never sufficient.
So the radical new idea was instead of having documentation, let people discuss the problems
in an open forum and you can just search that forum and this
forum becomes the actual documentation of the software genius idea right incredibly successful
right i mean like self-generating documentation in a way right i mean where the hell would we
all be without stack overflow so yes it's a crutch i rely on most days so yeah that's really cool
that's really cool um That's really cool.
Great.
So obviously a new paper,
we've set the scene here.
We've gone through the story of Bobby Tables and come to the conclusion
that you're going to design a database system
that allows us to basically solve all of these problems, right?
And it's called Nutable.
And can you, first of all,
maybe tell us about like the design goals for this idealistic database management system for Bobby Tables?
Yes. So when we say design goals, let me just clarify that now I won't talk about the software design goals, but more of like the conceptual design goals.
So how would we want to help Bobby tables and put him or her out
of his misery so I would say that what we wanted to do is to build a new system
that is primarily targeted at research and developers and so the system it's
not designed to be a product in the end right so it should be designed to be a product in the end, right? So it should be designed to be used for research and prototyping.
So that's why the title.
Yeah, it has to serve as a unifying framework for database research.
And I find this is quite an important aspect because if you're doing research, at some
point, you need to evaluate your work and you need to also compare your work to related work.
And this can also take a lot of time because you have to get this related work.
It may reside in some code base where you have no access to, or the code base is outdated, and the code isn't compiling, or whatnot.
And many things can go wrong.
So just doing this comparison to related prior work is already demanding.
And so, what I want to say is that Mutable should serve as a unifying framework where we can have multiple implementations of various research ideas coexist in one system.
And you can just flip a switch and then you use the alternate algorithm or approach. Then another design goal is that we impose as few
design decisions on the developer as possible, such that the developer is free to pursue any
alternative path for solving a particular problem. And I mean, this is also kind of a limitation that we will probably discuss
later. So not going too much into detail now. We want mutable to be flexible and to be configurable
by the developer. So by flexible, it kind of means, as the same thing before, if you design
decisions, so we are flexible in the sense that you can implement many different kinds of approaches.
And by configurable, I mean that you have this ability to flip a switch and mutate the system behavior.
And that's also why we named it mutable.
And because there's a table in the name.
It's a good name.
Thank you. Lastly, Mutable should provide documentation that targets developers and eases onboarding.
And this should at least be the getting started section of a Markdown document where you know how to download and compile the project.
But it should go beyond that and instruct people on how, if're interested in in providing a new algorithm for
a particular part of the system how would you go about it and cool so this is this is i love the
sound of this system um so we mentioned earlier on that there were other systems if i go on dbdbi.io
and there's someone else tried tackling this problem before and
and if if that's the case how does how far do those systems go to achieving these design goals
and what's missing from them and what's the sort of the gap that mutable fills
yeah that's that's a tough question so of, we did have a look at related work.
And what we found there is that this, I mean, we're not there yet, but this modular design of mutable, this has been explored years back.
And, yeah, so we had a look at two systems in particular named Genesis
and Exodus and Genesis explored this modular design of a database system but
focused heavily on storage management the other system Exodus is more of like
a toolbox providing many building blocks for a whole system and then you
have to plug them together to to make a whole i think mutable is not trying to be this this toolbox
it's it's more of this genesis approach where we have this modular design but we would like to
extend this into the whole system that every part of a database management system can be unplugged and replaced
by an alternative implementation.
Cool, cool.
So yeah, so I know in kind of following on from this then, so let's dive into Mutable
a bit more.
So there are five sections in your paper and each one outlines a different problem and
in Mutable, the vision for it and
gives an example of how mutable approach is solving solving this problem so let's let's
start off with with components and give us the the problem the vision and the example
okay so as i've said we want to do a modular design, and we didn't call them modules. We called these things components, perhaps because I didn't like the name clash with the new upcoming C++ modules.
I don't know.
I just preferred the name components.
Conceptually, it's the same.
What we want to achieve in Mutable is that we kind of dissect and hold database management system into its individual parts, the components.
And now we can provide alternative implementations
for either of those, plug them back into the system,
and they just cooperate and they work together.
And we have four software design goals in Mutable
that we also want to achieve through this component design.
So let me just elaborate the software design goals. First of all, we have extensibility.
Our system should be extensible, meaning that you can always add another algorithm that solves a particular problem and i would say it's kind of different to
a plug-in mechanism or something like for example you can in postgres you can
have define your data types something like something like that through plugins it's not
the same thing it's like extensibility here means like you have a part of a system for which there
might already be an implementation but now you you can provide another one. Then we have as a software design goal, a separation of
concerns. So here we mean that if you design one component, you should not need to know about
how any other component is being implemented. Merely the API of the other components might be of interest to you in case you make use of it.
So let me give you an example.
For example, if you do plan enumeration, what we call it, I think a more common term would be join ordering,
where you have to just come up with different join orders.
To do so in an efficient and meaningful way, you would have to know cardinalities of
subproblems. And there will be a component that does exactly this cardinality estimation
in mutable. So now this plan enumeration component will use the given cardinality
estimation component to estimate those cardinalities. However, there can be various
implementations of this cardinality estimation component. You could use histograms, you could use,
I don't know, neural networks, or what we have done actually is some product networks,
which are also machine learning approach, but they are placed in the domain of interpretable ML,
and that's kind of cool. So no matter the implementation of this cardinality estimator component,
the plan enumeration component
should be separated from that.
And that also guarantees us
that we can just replace one thing
and not breaking the other things in the system.
Okay, so our third design goal
is kind of the software design goal
that you find in every piece of software.
It's called abstraction.
Who would have thunk?
But let me just try to make this a little bit more expressive.
With abstraction, I want to achieve that we define a nomenclature in the system that fits our academical terms,
like what we learn in the lectures,
what we use in teaching,
such that you have a very easy one-to-one correspondence from your lecture materials
to the implementation and the code.
And this, of course, this makes it,
if we would achieve that, if we can achieve that,
it makes it very easy for newcomers
to start working with the system.
And then our fourth design goal is actually a follow-up on the third design goal, because usually abstraction comes at a cost.
So now our fourth design goal is abstraction without regret.
And also here, I think I should elaborate a little bit on what this regret part is.
So if you think of abstraction in an object-oriented programming language,
then you can say you define an interface and this interface defines some method.
And now you can have classes implementing this interface.
And these classes overwrite this method. some method and now you can have classes implementing this interface and these
classes overwrite this method. Now if you have the pointer to this to an
object of this interface and you call a method at runtime you now have to deduce
the runtime type of the object. So it's called a dynamic dispatch and a
common implementation of virtual
methods which are then compiled using V tables for example. And this overhead is the regret part of
abstraction. And it can be a pain if you have functions or methods such as virtual methods
that are being called frequently but doing only little work.
And there is one particular use case in database systems where that occurs, and these are interpreters.
So query interpreters, they suffer from this overhead.
And this has led away from the initial Volcano iterator design
of query execution towards this vectorized interpretation.
So there you accommodate for this call overhead by just processing a batch of values.
So yeah, these are these four software design goals that we want to achieve in this component design.
Cool. I just wanted to roll back a sec to
highlight what are the components, what did you decide on the boundaries would be in a database system and how easy was it to come to that decision? Was it obvious that you were going
to split it up into storage engine, plan enumeration, etc. And was there other ways
of which you could have divided the system up?
Okay, that's a very interesting question.
Some parts were kind of obvious, and some parts are not even done yet.
I have to be honest here.
I mean, this is a work in progress. So for many things, we don't have a solution yet.
Yet, that's the key word.
Yes.
Consider, for example, concurrency control.
So modern concurrency control, you would say perhaps multiversion concurrency control.
And then you look at implementations of those, you have these chains of versions of a tuple.
And usually you need to integrate this with your storage engine.
And this would kind of violate the separation of concerns part, right?
So I don't have an answer yet to how we will do this.
But our research agenda is to find a solution
to this problem of designing
a multiversion concurrency control system
without having to bake into the storage engine
that you have these version chains.
So it's an open problem.
But I think this is also quite interesting
to have a look at.
Yeah, it's definitely an interesting problem.
And you've actually preempted one of my questions I had written down to ask you earlier on was about concurrency control
and how you would solve problems associated with it like the one you just said.
But I look forward to seeing how you tackle that.
Oh, actually, on the components, we didn't enumerate what the components were.
So maybe you could give us just a 30-second rundown of the different components.
All right.
So the components that we have implemented in the system and for which you also provide alternative implementations are the data layout component.
So data layout describes how would you lay out the data that is given by a table schema into a linear address space.
And so the data layout component is actually not really a component in the sense of the other components, but it's more like the API that we use to communicate between a particular
data layout and the remainder of the system.
So how it works is that we have kind of like a domain-specific language
in which you can express how you go about laying out data
in a linear address space.
For example, you could describe for a given schema,
you could describe how you would implement a row layout
by computing padding between the attributes you could describe for a given schema, you could describe how you would implement a row layout
by computing padding between the attributes and then padding between the rows.
And then you'd give this back to Mutable and now Mutable understands, aha, so this is how I'm going to address the individual attributes and the individual rows of this table. And it's a
recursive way of formulating this layout so So we can also express PAX layout,
or we can express PAX and PAX layout.
It's still not feature complete.
So for example,
what we really would like to do is support arrow format, but for the arrow format,
we need also to have various variable length arrays.
And this is something that we don't have yet,
but yeah, that's it about this data layout component.
And when I later come to another component,
we will also see how this is later picked up in the system and used.
So then the next component that we have implemented is cardinality estimation.
And for cardinality estimation,
we actually have only one implementation at the moment,
which are some product networks.
And these are a really interesting way
of gathering statistics of tables.
And these basically work in a fashion
that you take a table
and now you have two ways of modifying it.
You can either partition it
horizontally, so building clusters of rows, or you can partition it vertically, splitting
attributes from other attributes. And so then this algorithm for the Sun product network builds a tree of nodes where in each level you either split rows or you split columns.
And it's done in such a way that it understands correlated columns and eliminates this correlation through this clustering process.
I mean, it's hard to wrap up,
but what's really cool about it is
this model learns correlations between columns
and you can exploit that by actually
attaching the number of foreign keys
that match to a primary key
and then learning on this kind of extended column and
then you also can catch correlations over foreign key joints and that's really powerful and what's
cool about the some product networks is i mean there was this is not our work right so we just
picked this up and said let's implement it into this database system. Now, what we did is we used Eigen, the vector
and metrics library, to do a kind of an efficient
implementation. And there we spotted a neat trick. So we can actually tune
these SPNs with just one configuration to just behave like
regular histograms. And so we just have histograms for free.
Nice. Cool. And so we just have histograms for free. Nice.
Cool.
Then next we have cost functions.
And here we have two kinds of cost functions.
So first we have logical cost functions or algebraical cost functions.
We take just a look at this algebraic tree of operations, operators,
how you join the relations together.
And then we compute logically how costly
is a plan and this is mostly concerned about the cardinality estimates that you get and the
logical type of operation like if you have a join then you consider the input sizes things like that
and then later during the optimization we also do a physical optimization step where we already have computed a tree shape.
But now we have to compute for every node in the tree which physical operator will we use.
And here we actually have linear regression models that are automatically trained on the system through benchmarks,
such that Mutable learns about how efficient certain operations run on the system through benchmarks such that Mutable learns about how efficient certain
operations run on the system.
And then it learns these cost functions for these physical operators and that these can
be used at physical optimization time.
Then we have plan enumeration.
That's the name that we gave it.
The common term is join ordering.
We just extended it to plan generation because
there can be other things than joins that need to be considered but basically it's join ordering and
here we implemented many of the textbook algorithms like these dynamic programming
algorithms sites sub and ccp also more more recent ones like these top down but still all of also dynamic programming algorithms
min cut advanced generate and test and min cut branch and then we also have our own algorithm
this is based on heuristic search and i think we will briefly discuss it at the end of the session
cool and then there's there's the query execution component and this was actually the the origin of
mutable so to say um so in the query execution component we have one minor implementation that
is deprecated already it's it was a stupid interpreter and i really want to get rid of
the thing and then we have the compilation of sq to WebAssembly and compiling that to x86 using WebAssembly VM from Google.
And that's actually the initial project that started Mutable.
These are the components.
Fantastic.
So just to summarize there there we've got query execution
storage data layout plan enumeration or join ordering depending on your preference of terms
um cardinality estimation and how you go about um doing cost function so cool awesome so let's let's
let's move on to the next section in your paper.
And the one where you talk about, I mentioned it a little bit there a second ago, code generation.
So what's the problem? What's the vision?
And you mentioned slightly how it's tackled a little bit, but can we dig into a little bit more maybe?
Of course, let's do it.
So code generation we do in kind of two fashions so the
first way we use code generation i mean code generation first of all we use it for this
abstraction without regret part yeah so we want to get rid of this regret one way of doing it and
it's kind of the easy way is using meta programming through your compiler and So mutable is written in C++ so we can use
templates and templates are just a way of meta programming so we can compose
different parts of the system at compilation time of the system and
that's one way to do it. Of course this doesn't always work and there are
situations where we have to compose code at
runtime of the database system but at the compilation time of the query and this is where
the the compilation of the query comes in and i mean this this was already explored in the very
first database relational database system right so the system R already had compilation of queries to machine code.
They just abandoned it.
I think if I remember correctly,
it was because it wasn't maintainable at the time
because it didn't have compilation framework like LLVM.
And then Thomas Neumann picked this up.
In 2011, I think it was,
there was this paper,
Efficiently Compiling Efficient Query Plans. And there he uses LLVM, I think it was. There was this paper, efficiently compiling efficient query blends.
And there he uses LLVM,
which is like the most famous open source compilation framework,
compiler framework.
And so he uses it in his system hyper to compile queries.
And it works wonders.
So now you can compile whole pipelines into tight loops.
And you can pass values in registers rather than through memory.
And it's a big gain on efficiency.
However, LLVM is not a JIT compiler, in my opinion, at least.
So, of course, there is a tool in LLVM to do just-in-time compilation, for sure.
But LLVM wasn't designed as a just-in-time compiler as its primary goal.
And if you look at the LLVM code, you quickly realize that this design of LLVM is highly dynamic. It's many pointers in there, many dynamic data structures
in there, because its main purpose is analysis and transformation of code. And to enable this
transformation, you need these dynamic data structures. This is probably not what you would see
in a just-in-time compiler.
And so the just-in-time compilation with LLVM,
you still see that there are high compilation times.
And this is where we actually had a discussion
at the reading group of the compiler's chair
at my university.
And so we had this outrageous idea of just compiling sql to
javascript yeah because javascript it's a highly dynamic language which is being just in time
compiled in every browser and i mean it was a funny idea but i never gave it much thought
until web assembly was released. And then
I was like, oh yeah, WebAssembly, that's really the game changer here because now we have a fully
typed low-level language and that could do it. And so I started writing it and I started writing
code generation to WebAssembly and we hacked V8 into Mutable.
So we embedded V8
and we had to actually patch it a little bit.
You can read it in the paper.
And V8 is like, it's a toolbox that works just magic.
Like you provided this WebAssembly code
and it compiles it to x86 and executes it.
But it does so much more because V8 actually ships with two compilers.
It's called tiered compilation.
So you have one compiler that does a single pass over the WebAssembly code and compiles it immediately into non-optimized x86. And I mean, it can be slightly optimized,
but the critical part is
you don't have optimal register allocation
because that's really a difficult optimization problem.
And so you have this fast compilation tier,
which does a single pass linear time
over the WebAssembly code,
produces x86,
and you can just run it.
Low overhead, low delay.
Now V8 in the background runs an optimizing compiler called turbofan and it does optimizing compilation and it produces optimized x86
and then because it's a vm and runs your code and instructs your code it can just behind the scenes
switch from the slow running non-optimized code to the optimized fast running
code and you get this for free and that's that's really good and if you look at umbra which is
which is a new system from uh thomas neumann where they actually built their own ir and they
built their own justin time compiler that's a lot of work to get this.
And if you can just
take V8 and
WebAssembly as an IR, you get
basically these features for free without
this development effort.
That sounds a bit like a free
lunch there, really.
Taking V8. That's awesome.
So this is where that's going to appear
at ED edbt right
yes exactly that paper's called what's the title of it so we can the listener can look out for it
it's called a simplified architecture for fast adaptive compilation and execution of sql queries
there you go listen you've been told go check that paper out when it when it um
when it gets published and let me add to that
yeah this paper is already outdated okay so what we did in the meantime is so so what we what you
see in the paper the technique is the same we do this so the way we do this is basically the same
but what we changed now is how we how we implemented it earlier we had to
write the code generation kind of directly like writing a compiler for sql queries now what we did
is we wrote a domain specific language that's embedded into c that looks like C, has almost the same syntax, similar semantics.
You can just write it in between your C code, but when it's being executed,
it actually generates the WebAssembly code in the background.
So now this is another way of metaprogramming.
So it looks like you're writing a regular program, but through the execution actually produces the code.
And so this is one way that we hope we can make code generation
much more accessible to the database community.
Fascinating stuff.
I just wanted to make a real quick point as well.
And you said that they'd already tried this back in System R in the 70s.
And I'm pretty sure that is a rule for all ideas in databases since then.
They already tried it in the system.
All the good ideas were taken straight away in the 70s.
But no, they were probably a bit ahead of the time.
Cool.
I guess the next problem that you mentioned in the paper is physical optimization.
So what is the problem here yeah so physical optimization I mean maybe it's a
bit of a vague term you could think of joint ordering but now consider physical operators
so usually if you would just say joint ordering by the textbook or how you would do it in a lecture, then you would just say, okay, I have my algebraic join operation.
I have some very abstract cost model for this,
this operation and using that, I now just compute an optimal plan.
And doing so is already, it's, it's a complex problem,
depending on the cost function, it's already NP hard.
But now if you change on the cost function it's already np hard but now if you change from
this cost function which is just algebraic to physical implementations meaning that you have
not just one join operation but you have various implementations and just considering binary joins
now the complexity already explodes exponentially and And this is a problem,
but on the same side,
it's important to consider
these physical implementations
because you can exploit some things.
For example, if you know that a stream of data
after some operation is sorted,
it makes a merge of joint operation cheaper, right?
So that should be considered
during the optimization process.
So how we tackle the problem
of this highly,
very hard optimization problem in Mutable,
how we tackle that
is that we first do
a purely algebraical optimization step.
So we take this query,
we represent it as a graph,
then we compute a join order, so to say. This is what we call the plan enumeration. And this is done using
an algebraic cost model. And what we compute is a partial order. It's a tree. And the tree
dictates a partial order of the joins. Now with that tree, we go into the physical optimization step.
And in this physical optimization step, now we only determine two things. For the nodes in the
tree, we determine which physical implementation to use. The children of each node can still be
permuted depending on the operation. So if you have a commutative operation like an inner join then you can just permute them right so permuting them you decide which one becomes the build and
which one becomes the probe relation for a hash join for example and so this is also still not
determined by this algebraical optimization but now we can determine an order using physical
optimization because for example for the hash join usually
wants to build the hash table on the smaller relation yeah and what's actually quite funny is
that now after we have determined this tree structure during this algebraical optimization
and we're and we are now left with this physical optimization optimization step
the physical optimization step cannot be done in linear time actually because you could say okay
but there are so many dimensions of this optimization problem yeah but if you think
about them they're all constant in size the only thing that remains dynamic depending on the
problem is the height of the tree.
And so this becomes a linear
time optimization problem, actually.
So, I mean,
I guess building
on this, how
is this...
So this is implemented in Mewble
at the moment, and all this functionality
exists in there at the minute?
Or is this still sort of... Are we still in the vision sort of territory here at the moment not on the main
branch okay right yeah this is this physical optimization is currently a master thesis of
one of my master students okay and uh parts are there parts are missing okay so very much coming
soon then but one to look forward to.
Yeah.
Great stuff.
So I guess it's moving on now to the next sort of way you've divided this up in your paper.
You've got a section about the way that you've tackled physical data layout independence.
Now, what's the problem here? And as we've done with a few previous sections, what's the vision and the approach? And maybe an example as well.
Yeah, so physical data independence is one of the key selling points or key optimization techniques in databases because the user describes the schema, but how you actually
implement it, how you lay it out on memory or on disk, this can still be decided by the
system.
And this creates this independence between this implementation and the description of
the schema by the user. So now what you can do about this,
you can organize data in different ways on your storage,
be it main memory or on disk.
And I mean, we have seen this in the past
because initially we had these relational database systems.
Naturally, I would say they organized their data in rows of tuples.
Then came the column stores
that just did it the opposite way,
using a columnar layout.
And this layout is actually
much better consumable by your hardware
because now we have these SIMD extensions
which can just batch process
homogeneous chunks of data and Colnallayout just provides these chunks.
And then we have hybrids like the Pucks, for example.
And depending on what operation you're doing, one layout may be more suitable,
the other may be less suitable.
So, for example, transactional processing may benefit from
a row layout, whereas analytical processing may benefit from a columnar layout. And we don't want
to impose this design decision on the user of mutable. So we thought of how can we just decouple
this from the system and have this like a tuning knob. And it's actually quite
difficult because if you implement a system with a particular layout, you can just hard code your
whole system for that particular layout. Now, if you want to make this configurable,
then loading the data suddenly becomes slow because at runtime of the database system, you now have to consider what is my configuration and then how do I load the data from my storage?
And this is again an abstraction that we can achieve, but it comes with a regret part.
And the solution that we found to this is that we provide a way of describing the layout.
This is the data layout abstraction.
You describe how you would lay out your data in a linear address space.
Now, this description is picked up by the query execution engine, and we compile this directly into optimized code for loading the data. So we don't have one particular implementation for loading pucks and one particular implementation
for loading row and one for column layout.
But instead, we have this, I mean, it's a big chunk of code, yeah, but it processes
this description and it generates exactly that code that you would probably even handwrite
to load this data layout.
Cool, yeah.
So I just wanted to touch on that.
You said that there's like a tuning knob.
This kind of comes from, I was reading a paper a while back
where they had like a hybrid layout in their system
of some data was stored in row format,
some was stored in column format based on,
it was like a HTAP system that kind of dynamically configured itself
based on what the workload was and changed the layout based on that is it possible to to do that or do
you think it is possible to do that like kind of and have uh define your own custom layout right
where you have stuff that may dynamically change on the fly as well or is it very much kind of you
configure it one way you're going to say i want to come on a column column in the layout and then
you're stuck with that for the well not stuck with it but you
know for the experiments whatever that's the that's what it has to be fixed for the duration
of whatever or could you be dynamic maybe so at the moment it's it's static so you make this
decision once for a table and this table is then locked to this layout cool we have thought of of
this um migration from one layout to another.
And what you actually can do at the moment is if you have a table with a particular layout, you can just augment that table by a new column.
So that would be possible.
But just transforming it from one layout into another, that's not there yet.
And what's also not supported, and I don't know whether we really want that,
but perhaps someone wants to try that out,
is like having a change in the physical layout
at some point, like, I don't know,
the first million records on row layout,
and then you'd say, ah, but now from now on,
I would like them to be in columnar layout.
I guess it's doable.
Because if you know that the first 1 million records,
this is a fixed amount,
then you can just actually encode that in our data layout
and say, okay,
there's a block of 1 million records,
which is in raw layout.
And afterwards there comes a block in Colna layout.
Yeah, that's doable.
Yeah, it's up to debate.
Maybe it has no value or anything. I mean, yeah, it depends. But that's cool yeah it's up to debate yeah maybe it has no value or anything like i mean yeah it depends but that's cool that's cool so i guess we're on to the the kind of the last section
that you wrote about would have automated evaluation and what's the problem here and again
vision the approach and an example okay so i think on the vision part, there are kind of two things here.
On the problem part, actually, there are two problems, I would say.
So first of all, when we think back of this introduction with Bobby Tables having to do research and then evaluating the idea,
Bobby has to evaluate related works as well.
And as I've said, this can be a pain.
So what we wanted to do is right from the beginning,
compare ourselves always to the major systems
that are out there.
And I mean, we just, some of them we do. So we compare ourselves to Postgres, Hyper
and DuckDB. And we want to automate this process of comparing our system to the others
such that we don't have to do this by hand every time that we change something.
The other thing, the other problem that I see is that if you build a system, you have to be very careful about regression.
And when I say regression, you might initially think of testing, but there's also performance.
And performance regression is also not to be considered or to be ignored. So we have to be aware that changes to
the system might deteriorate the performance
and we should do something about
it. Because in the long run,
if we make many small
changes that deteriorate performance,
in the end, we might end up with a system
that is of no use.
So
what we did is actually we stole this idea or we copied
the idea from DuckDB because DuckDB has this continuous benchmarking
website where they show their performance over some benchmarks
over the past days and that's a great idea to do and so we also wanted to
do that and we wanted to go one step further and integrate into this process, not only our system, but other systems as well.
So what we have done now is we have built a script, basically, that runs a set of benchmarks that are written descriptively in YAML files.
And basically, it's just a YAML file telling you which data to load into the database,
which queries to run,
which properties to extract from the system.
What we now have done is we have implemented,
we call it database connectors,
like you could know them from your,
I don't know, preferred programming language
having its own database connector.
But it's not really the same thing.
So these connectors that we implement,
they are meant to run a query in the benchmark setting
and extract benchmark information from that system.
So for example, we have one connector for Postgres,
which then runs a query, does some timing,
and reports back this timing information.
And the same we have for Hyper and DuckDB and our system.
And now this script that runs these benchmarks
picks up these descriptive YAML files,
runs them in all these systems,
and produces output that we just insert into a database.
And so we have a database of all our measurements of all all the past since we started doing measurements and then what we did is we built
a rest api on top of the database such that we can query this information from the web and also get
immediately some aggregated data and then we built a web application in Flutter
to visualize this benchmarking data.
And perhaps we could just have used Grafana.
I don't know.
Anyways, we did it.
We built this web application in Flutter
and you can now visualize the recent experiments
or just of a particular date.
You can just go to a particular date.
You see all the experiments that were run, all the results that were gathered of Mutable and the other systems.
You can go to the continuous benchmarks tab and you can see over the past months or years how the performance has evolved, not only of our system, but also of the other systems.
And now that's really the neat feature here is once you have that data, you can analyze
it.
And so what we do at the moment is we take a look at our performance over the past two
weeks, and we compute the standard deviation and the performance.
And then we apply some threshold, some multiplier to that standard deviation and the performance. And then we apply some threshold,
some multiplier to that standard deviation.
And whenever now the performance peaks out of this range,
we say there's a performance anomaly.
And we issue a warning on the webpage,
as well as we send mails to the developers
that have committed to the system
in that range of time since just before the
anomaly. So we can actually find the people responsible for this performance anomaly.
And you could blame them. I was going to say, it's like an early warning system to know you've
got a problem on the horizon. You've pushed some dodgy code. But again, I guess if it's a
performance spike and it goes up,
I mean, maybe it's a good thing, right?
As long as it stays up there, right?
Yes, we've thought of that already.
So on the website, when you have this
performance anomaly, you can actually
sign into our website using
our GitLab, for which
only we now at the moment have access.
But anyways, you can just sign in through GitLab.
And when you do this,
you can actually configure or respond
to such a performance anomaly.
And you can say it's either expected.
For example, you did some improvement,
you ran it locally and saw,
ah, it increased performance.
So now this performance anomaly is expected
because I improved the code.
You could say it's a false positive,
for example, because someone at midnight
was doing updates on this benchmark server.
And then you can say
it's a confirmed performance anomaly.
And when you just click that button on the website,
it creates immediately an issue in our GitLab instance
and assigns all the people that have changed the code
in that time to this issue,
such that these people are immediately made aware of.
And then when you close the issue in that time to this issue such that these people are immediately made aware of and then when you close the issue in the git lab it also removes the performance anomaly from the web page that's really cool that's a really nice system that's here yeah i mean one of those you'd
be panicking when you get a notification like oh wow have i done something good or something bad? But no, that's fantastic. So how easy is it to add either new systems to this framework
or new benchmarks to this framework?
And what's the sort of coverage in terms of benchmarks
that are in there at the minute?
Are we talking all the classic ones like YCSB, TPCC, TPCH,
etc., etc., etc.?
Or is it just kind of handcrafted benchmarks at the moment?
It's a mix. It started with handcrafted benchmarks, particularly because when we started doing the code generation part,
so compiling queries, then we had to do benchmarking of individual operators.
Then I did research on joint ordering. So I wrote benchmarks for that.
Then we integrated TPC-H
or at least part of it into the system.
But I must say that there are many parts of SQL
that we don't support.
So we have to take a look at the TPC-H queries
and see which parts do we actually support.
Some more features are being added over time but
as this is a research project and it's only driven by research
there's no one really taking care of that to be honest you know but we have tpch in there
and you can see the performance of a tpch and no one cared to add more benchmarks to be honest cool cool um cool
yeah so that's great i mean i guess my next kind of question is um we've touched on all this all
the all the good stuff but what are the what are the limitations of mutable and like what areas of
it are currently i know this i know it's sort of a developing project and it's not feature complete.
But yeah, so what are the limitations?
And maybe I've kind of been primed for this a little bit, but maybe if you could touch on the limitations around concurrency control and things such as that.
Yes. Okay. So in its current state, Mutable is a prototype system, and it's therefore lacking much functionality that you would expect from a complete database system.
For example, concurrency control simply isn't there.
And the reason why that is, is because the development has been driven by the research, and we just haven't been researching on all parts of a system yet.
But to make this more concrete, so when you consider, for example, concurrency control,
it's not implemented, but how would you go about implementing it?
And that's really a tough question because one major limitation of Mutable is actually its design.
Because we said we want to do
everything as components and these components have api's this means that
any extension or algorithm that you want to add to the system must be squeezed
into this API but I cannot guarantee at the current state that this API is rock
solid and will remain the same for the next decade. The problem really is that we might have to adapt.
So I'm pretty certain we will refactor the API in the future
because at some point we or someone else
wants to use the system, wants to implement something
and says, with the current API, it's simply not possible.
It's not generic enough.
So there's certainly a limitation that we are aware of,
but we have a happy culture of refactoring.
So that's not a problem, I would say.
So if we figure out that there's a problem in the API,
we will rework it.
On the other side, when it comes to limitations,
it's a tough objective to split a complex system like a database system into small, independent, exchangeable components.
And during this step of doing so, one might increase the development effort of the system. Because now if we have to do refactoring of the API,
for example, this would also mean
that we have to force updating all the components
that implement this API
because now it's deprecated, the old one.
So at this point, it might increase the maintenance cost.
On the other side, one could perhaps argue
that it may also reduce the maintenance cost
because
through this componentization,
the individual part of the system
becomes simpler to understand
and so they become more maintainable.
But that's
something that I cannot tell yet.
The future has to show.
Cool.
Going
off that, obviously it's it's not like
not all the features you want it's not like kind of the the final sort of product yet it's very
still in a prototype state but is it usable from the point of view of um i can i can develop some
new joint order algorithm and i can use your system tomorrow in my in my in my work
to generate results for me like how i guess the question that i'm gonna get around to is how usable
is it at the moment um the honest question is depends on what you're after if you're after
something that we did research on i would say it's pretty usable and particularly
fitted for research.
So it's really meant to just plug in a new experiment, run it, produce measurements.
That's really the purpose and it does that job quite well in those areas where we did
research ourselves and have evolved this system.
In other parts, at the moment you might be left alone.
If you say, now i want to do
concurrency control and you go to mutable yeah there's nothing there okay so i have a master's
now starting a master thesis on this particular topic of concurrency control but at the moment
it's not there cool cool but in the future i'm sure it will be so I mean
obviously I'm imagining the code is
publicly available right, we can go and find that I guess
in the GitLab repo and we can
link to that in the show notes and stuff
Yes, so we made it
just recently
open sourced Mutable
because we wanted it to
be open source for the CIDR conference.
So, yeah, it's open-source and it's under the AGPL license.
Cool. So have you had any feedback yet, any initial feedback from anybody outside of the creators?
Have you had any?
Yeah.
Yeah, that's good. That's good. Feedback's good.
Yeah, one guy came and he said, benchmarks aren't working.
And I talked to him and it turns out like when you want to run the benchmarks,
you first have to get the benchmark data and there's a script that pulls the data,
but it's nowhere written that you have to do that.
Ah, okay.
So it's just a documentation problem.
Yes, fine.
That's fine.
That's no big deal.
But that's good, right?
That's good.
Getting feedback is always always good so this kind of leads on to my my next my next question in that obviously this this this podcast is obviously aimed at a whole why
anyone who's basically interested in computer science right but we're like i always like to
try and bring the research around to how people who work in industry can maybe leverage whatever
it is we're talking about as well so is there a way as a software developer maybe someone working on databases for some database
company or whatever can can leverage to kind of mutable in their day-to-day sort of
workflow or do you think it's primarily obviously it was designed for academia so this it kind of
makes sense for it just to be primarily used used in that domain but does it have any out um in like use uh
use outside of that that domain perhaps i would say for for a particular use case so as i've said
my main interest in research was query processing and so these parts were more and more evolved so
if you are now after weird complex queries which you need to run fast then mutable might
be your thing because this thing is a heavy number cruncher you know but apart from that
if you're looking for features or transactions no cool cool um this this next question is i
imagine you're gonna have some interesting things to say on
this one so what what has been probably i asked this to everyone as well it's a pretty standard
question so what is the the most interesting and maybe perhaps like unexpected lesson that you
that you learned while working on mutable um the first one is systems complexity was something that I didn't expect.
On the one side, systems complexity is a beautiful thing, but it's also very frightening.
And it's frightening because one can get the impression of never being done, never being finished.
There's always more to do.
There are always issues around.
It's really demanding, I would say.
One really has to keep the spirits up
to keep working on the project and not be,
how would you say?
Demoralized.
Take it one day at a time, right?
Yeah, exactly.
On the other side, it's very pleasing to see
when the system components start interacting and cooperating together and forming the whole system.
And when that works, it's really enjoyable.
Then what I have learned during Mutable is that third parties can be a pain in the ass.
And mostly for the reason of automating the build process so i mean i wanted to build
mutable such that it can also be easily deployed but then again we have to use third parties like
for example v8 from google and you have to build that thing and just if you build it for the first
time your computer will be will be like consumed for time, your computer will be consumed for one hour.
It will be compiling one hour straight.
And this itself, okay, you have to live with that.
But getting this build process started the right way, that's really tough.
Because then you look at another project like V8 and it has these millions of tuning knobs
for the compilation process.
And you have to configure it just in the right way
that you want to use that project.
Then it comes in an update and suddenly things break.
And that's really annoying.
Yeah, yeah.
And I've also learned that CMake can be so
useful but when it's
misconfigured particularly by a third
party it basically
renders the third party
unusable in your project
so that is what I have seen
and
also in conjunction to
that dependency management
in such a project with third parties
is also annoying.
I mean, we have mighty tools
to do that,
like good submodules,
which we don't use,
or CMake external projects.
This is the way that we go.
But still,
then some things change
and it works on one machine, you commit it and it works on one machine.
You commit it.
It works on one build bot.
It works.
And suddenly on someone else's machine, it's not working.
And then you have to think about, do you have to roll back the third party?
But then you're missing on a feature or something.
Yeah, that was really annoying.
Yeah, yeah.
I can imagine.
One thing that I have learned, though, which I'm very happy about, is building a team.
So I had many people contributing to Mutable as part of being research assistant or bachelor or master student.
And it was really a joy to work with those people on the project.
And I hope we can continue working on the project and so i what i think i started learning
and i'm not done yet is identifying people's strength and their interests and then foster
them and co-coordinate their works and spark their intrinsic motivation to work on this project
um that's a skill that i didn't have before and i wouldn't say i have it yet but i'm on the way
to learning that work Work in progress.
But no, that's great.
I mean, like, this is sort of, it's a very different answer to anything I've had before for that question.
One of those sort of soft skills about how to, like, build a team and how to correctly allocate people to the right sort of tasks and keep everyone motivated.
Especially when you were saying, like, systems development can be quite, no, what's the word?
Like, the moral item we said at the time right
so trying to kind of keep everyone's spirits high and stuff yeah no that's that's that's a really
nice answer to that question um kind of going going off that then so i mean obviously progress
it is another question that i that i ask quite quite frequently is that like
progressing research is is non-linear right there's ups and downs so from the initial
sort of conception of the idea for immutable to the side of paper that we've been speaking about
today what were the sort of war stories that you kind of could share along that and and the things
that failed that maybe other people could benefit from from knowing about yeah i have a funny answer to that there are none because we actually we had the
submission for the sigmund paper on july 15th okay and so the cider deadline was just one month ahead
and my boss was like huh there's cider let's write a paper and we were like one month oh that's going
to be painful and so this really was just a sprint
just writing down everything that we have done so far on mutable there wasn't really any time for
problems okay it's such a power throw yeah they can't be problems just get that get it down get
it written and fingers crossed okay cool cool it was only possible though because there was this
prior work that's now published in EDBT and Sigmund.
Only because of that, we were able to rush this out in one month.
Okay, yeah.
That kind of makes sense.
You've done most of the work for anywhere, and it's just a case of bringing it all together and writing it up.
Cool.
Just one thing.
The major obstacle here, though, was the six-page CIDR limit.
Is it six pages? I'm sure some papers are longer than six. major obstacle here, though, was the six-page CIDR limit.
Is it six pages? I'm sure some papers are longer than six.
The mechanism
there is funny.
For submission, it's six pages.
When the paper is being
accepted, you're unlimited.
Oh, okay. Right.
I guess that
makes the reviewer's life a little bit easier, or does
it get reviewed again after you write
like an extended version
no it's not being reviewed again
so basically
my boss was like
let's just paste in the code of
mutable
so where do you
go next from here then I mean obviously we've spoken about this
across the course of the podcast
and there's a million and one directions for you to take um mutable in what's
the next the initial next steps um yeah so if i if i were offered a postdoc position where i can
continue my work on mutable i would say i take it even though I didn't have planned a career in academia.
But I don't know, the project somehow has grown to my heart and also the people that I've been working with.
So I would take the opportunity. And actually, we have started thinking about founding a company.
But that's actually really hard because it's such an academic project with a research goal.
Who are the customers?
That's the question there, right?
Yeah, exactly.
So what's the selling point here?
Yeah, so that really was a frustrating thing to realize that you have to come up with this idea of how to make money with it.
And yeah, we don't have one yet so that's why
i think proceeding development in in academia first and taking this time to think of how to
make it to a company that that would be one way but regarding research what i'm currently looking
at is on the one side, this physical optimization stuff.
So how can we, what can we do during a physical optimization step?
And I think we have some quite cool ideas here.
Because, so we do this query compilation.
And we have kind of like patterns that we apply.
So we know how to compile a hash join.
We know how to compile a hash join we know how to compile
selection and whatnot so now what we are currently looking at is how can we leverage more information that we have on the data to steer this compilation process and so the idea came originally from
from looking at compilers and if you look at a compiler like Clang or GCC,
you have to realize that these are general purpose compilers
for any kinds of programs.
And what's really the most important distinction
between a compiler and the database system to me
is that the compiler,
the general purpose compiler,
only gets the program as an input,
whereas the database system gets the query as an input whereas the database system
gets the query as an input and will compile it but it already knows the data that the program is
being run on so you already know the inputs to that program that you're compiling so if you do
know that then you can use that information perhaps to steer the compilation process.
So what information do we have on the data?
And as I've said before, we have these fancy sum product networks where we can get statistics on the data in the tables.
What we can do now is, for example, if we get a selection where x less than 42, we can actually look at the model and see,
aha, so this predicate is highly selective.
It's like 0.01% of tuples will qualify.
Now this makes sense to compile into a conditional branch.
But then again, there might be a predicate,
a selection predicate where X less than 42
with only selectivity of around 50 percent and here
a conditional branch is actually not a good choice because the branch misprediction penalty might be
very high and so this can now flow into this physical optimization process and then we can
on one side having trained these three regression models on the particular hardware we know how
costly is a selection using a conditional branch and how costly is a selection using predication. And then we can steer this
compilation process to compile just the right code for the particular machine that we're running on.
I mean, that is a really interesting direction. Yeah. So my next question is, we've touched a
little bit on your other research, but if you want to elaborate on any more, please go ahead now.
If you want to touch on maybe the upcoming EDBT paper,
I'll obviously mention the Sigma one that was last year.
Or was it this year, Sigma? Is it Sigma 2022?
It's this year, 2023.
I thought it was 2022. Ah, sorry.
No, no. It's submitted in 2022.
I mean, the cycle is quite long for Sigma.
Yes, no. We submitted in 22. I mean, the cycle is quite long for SQL.
Yes. Yeah.
Okay. So for the EDBT paper, which is about this compilation of SQL queries to WebAssembly, I think I have stated most of the key takeaway messages already.
So if any of your listeners is interested in how we compile this actually to WebAssembly, have a look at the paper, but bear in mind that the way that we implemented immutable is already overhauled
in the sense that we don't write
the code generation directly now,
but we have this DSL in between.
It looks like you're writing C code,
but actually you're doing code generation.
Makes it a whole lot more accessible.
Yes, and then the other research paper is the SIGMOD paper, and we haven't spoken of
that yet, I think.
So here it's a whole different topic.
So it's about joint ordering.
And actually the interesting way is how this actually sparked.
So you have to know that our chair, our database chair is co-located on the same floor
with the foundations of artificial intelligence chair
and their domain of expertise is automated planning.
And so at some point we were sitting in the kitchen
and I had a chat with a colleague from that chair
and I was describing this problem of joint ordering to them.
And then he was thinking about this problem from his point of view of a planning task.
And it turned out that this can be seen as a planning task, of course, but it's not one that you would usually find in their domain because it it has from the perspective of planning
it has a particular property that the state space changes with every action and okay um anyways you
could argue that it's it's a it fits into this lifted planning domain which it just had a quick
search backwards i think it's like started only in 2019 so it's, so it's recent research.
So what we then did is we decided that we perhaps should collaborate on this problem,
and we had a bachelor student looking for a thesis.
So we said, how about you implement joint ordering as a planning task in PDDL,
which is the problem description language that they use,
and then you just feed it into a solver.
It's named FastDownward.
And then we just have a look what's coming out. So how well does a regular general purpose automated planning solver work on our problem?
And results were quite disappointing in several ways. because on the one way on the one side the the solver has to translate from this description
of the problem into an intermediate representation and thereby already doing exponential work
amount of work so that's already not not a good idea but then once that was done and you could
run the solver on this description well Well, the description is exponentially large in size, so there's not much to gain.
But what we saw is that the heuristics for planning that were provided in this off-the-shelf planner, they are general purpose, but they don't really help in our problem.
They're not informative.
And also another problem, and that's implementation detail this
dissolver is implemented for all kinds of planning tasks and so it's not a
domain specific solution and therefore it has it's low there's some overhead in
it and so after the spatula thesis I decided to implement my own software in
mutable that is domain specific
for our problem. And basically I implemented many variants of the A-star search algorithm.
And then I did some memory and compute optimizations for the representation of this
problem, because now we can do that. We are not domain specific. And also I got some statistics of this planning process
and to understand what's actually going on in the search
because we didn't really have that much insight
in the beginning.
And then we started just doing experiments
and writing this up.
And then at some point I spoke again to my colleague
from the artificial intelligence group,
whose name is Daniel Gnade. And I'm very thankful for his advice during this time. So I discussed with him this
problem, and then I spotted an optimization opportunity that already has a name in this
planning community. It's called a partial order reduction. And basically what it means is you can
figure out that you have a search problem, and there are several paths that go from the same start to the same goal.
And these paths are actually identical up to permutation of the actions.
And you don't want to consider them because they're of no value.
So consider, for example, you have a Bashi join plan and you have A join B on the one side, C join D on the other side,
the order in which you do those
doesn't really matter, right?
It doesn't matter.
And so you don't want to consider those.
And then we figured out
that there's an optimization potential
and we also implemented that.
And it turns out that now
the heuristic search works wonders.
It's really fast and particularly fast
on tough optimization problems.
Like if you have dense queries, like, I don't know, like 20 relations in a clique shape or
something similar to a clique, these are really tough problems and heuristic search can get away
there much faster. Like we had experiments where it's a thousand times faster than state of the art.
And this number is just, it's coming from the fact that we're just asymptotically faster now in solving this problem.
And the crucial part is to understand this.
When you look at these textbook algorithms like dynamic programming and top-down dynamic programming, they are all exhaustive.
They enumerate all these existing plans.
They have to enumerate all of those that exist and pick the best.
Now, what you can do with the heuristic search
is you can prune some away
and still prove that you remain optimal.
And that's really a game changer, I would say.
Yeah, wow, it sounds amazing.
It's the number itself a thousand x
sounds yeah but but again let me put this in into into perspective it's on the tough nuts you know
like if you really have a hard case where you have many relations and and many edges between
those relations then yeah it excels but if you just have like a chain of relations or yeah then
then your standard dynamic programming will be better
cool but i mean obviously it has some utility right if if you can integrate that into a system
somehow that can tell whether to use this approach or that approach based on what the
query is like right then it feels like a win a big win as well and cool and so kind of just going off
like how this your research and whatnot how do you go about generating these ideas?
I mean, a lot of it seems to be sort of just being in an environment where you're having conversations with people from other domains, maybe, and kind of that spark kind of happens.
But how do you have a systematic way of approaching generating ideas and then selecting ones to work on um honestly i think i don't have have really
much advice to give here i don't consider myself a visionary person to be honest i think of myself
more as a problem solver an engineer and so as you've put it being in a situation where you're
confronted with a problem so now this sparks sparks my interest and gets me thinking.
Also, I should say that considering how slowly
I progressed during my PhD, I was really slow.
I should add the disclaimer that what I'm about to say
is to be taken with a grain of salt.
So it's my personal experience.
And please don't think like this can be advice.
But what I think, if someone's listening and is an ongoing PhD student, then perhaps what I could say is do what you do best.
Play out your strength and don't blindly follow the next trend.
You will excel at what you do best. And if you don't know your strengths yet
because you just started a PhD,
as a PhD,
then follow your personal interests
and don't let your PhD advisor get in your way.
Find a way to convince him or her of your ideas.
And looking back at the SIGMOD paper,
what really has set this off
was looking beyond the horizon,
like speaking to people that were outside of my domain and also that where the conversation sometime
put me out of my comfort zone because you know i have like i have this nomenclature the terms and
this way of thinking about databases but now we look at the problem from very different angle and
and with discussing it with
someone from a completely different community they use different terms and describe problems
differently it can be uncomfortable to to assume their point of view but it may be very worthwhile
sure i think those like you say putting yourself in an environment where you are gonna be out of your comfort zone just instantly sort of gives yeah it's a more fertile
environment for i like new ideas to generate right i think because if you stay within your
comfort zone on time i mean i don't know you need that spark and i think sometimes going out
stepping out of your comfort zone speaking to somebody else about a problem who's from a
different area can yeah i mean it might end up
with no ideas right but you never know you might end up with a really good one as well so i think
it's um that's good solid advice it's trial and error exactly right yeah that's just a scientific
way right um so i last question now um what do you think is the biggest challenge in your research area now and i don't
know how you'd go about actually categorizing your research area because it's quite a novel
sort of topic in itself right in the sense of building a an academic data but a database system
that people can kind of can uh can use i mean yeah i mean yeah what do you think is the biggest
challenge in the area and maybe even
in the wider sort of space of databases in a minute that's really a tough question i mean
what i felt during my time as a phd student is that when talking to other PhD students, they were always interested in a communication
and sharing ideas and perhaps even collaborating.
And I think what would be cool is if we could
motivate our community to work more collaboratively
on software.
And I think generally working on software
is currently a problem for research, I think,
or our domain and databases, you know, because,
and I must say I'm a bit copying or paraphrasing what Peter Bonsch and others have said at this memorial of Martin Kerstin at the CIDR.
They said that Martin Kerstin was always thinking of how to defend his employees or PhD students that were working in software and getting credit for their
work in the academic domain. And you may know this because you have been, or you still are a PhD,
no, you have handed in, right? I'm kind of in the hinterland between sort of submitted,
I haven't defended yet, but yeah, I'm still kind of technically one, even though I have a full-time job now.
But yeah.
But I guess then you had the same experience.
So you have to implement something in a system
to do your evaluation.
And this takes a lot of time.
And this takes a lot of time that you're not writing a paper
and you're not publishing.
And this is a problem of systems research, right?
And I think that we could perhaps at least help people
that are considering a PhD in this domain
if we could somehow better balance this programming effort and
systems development with the actual part of writing papers.
And perhaps a system like Mutable that tries to be a system for researchers could help
them just lift the burden off their shoulders.
Totally agree.
If you can minimize the overhead of the implementation effort effort whether whatever in just even i mean because i guess in systems it can be anything between like three
months to two years right depending on what the system is what you're building if you're going to
build it from scratch yourself which like you say is a huge amount of time of a phd that that is just
it's not dead time but it's time like you are not producing um papers and things that you
basically eventually get judged by right um in terms of all that contribute towards um well
your thesis or if you want to stay in academia and whatnot like they're important things you
need if you want to progress in that in that in that career right so totally agree if you can
minimize that overhead somewhere and make it like a more
streamlined experience or less time consuming i think that is a massive win so yeah totally agree
with that and i mean like we can we can we can end it on that note and thanks thanks so much
manuel for coming on the show and it's been a fascinating chat and if the listener is interested
in knowing more about manuel's work i'll put links to all of the
relevant materials in the show notes and we will see you next time for some more awesome computer
science research thank you jack Thank you.