Disseminate: The Computer Science Research Podcast - Immanuel Haffner | mutable: A Modern DBMS for Research and Fast Prototyping | #21

Starting point is 00:00:00 Hello and welcome to Disseminate, the computer science research podcast. I'm your host, Jack Wardby. This is the first episode in our CIDR series, and I'm delighted to say I'm joined today by Emanuel Hafner, who will be talking about his research interests are databases, and specifically he has looked at in the past query processing. Manuel, thanks for joining us on the show. Hello, thanks for having me. Great stuff. Well, let's jump right in then. So can you start off by telling the listeners how you became interested in database research. Okay. Yes, it was kind of an accident, perhaps you could say that, because I graduated in compilers, actually. So I did my bachelor's and master's in compilers.

Starting point is 00:01:14 I was working on Clang and LLVM. Sadly, my professor didn't have an open position for me as a PhD student. And it turned out that in my last semester of my studies, I was taking the database class of my now, my chef, Professor Jens Dittrich. And so this database class was actually really interesting. And I really liked the way that we connect theory to practice. And also there was quite a lot of coding

Starting point is 00:01:46 involved and I liked that. So I graduated the best of my class and then actually Jens approached me and he was like whether I would like to work for him as a research assistant or something. And I was laughing and saying, no, actually, I'm done. We could talk if you had a PhD position. I was making a joke. And then I got a mail a few weeks later from him inviting me over for a chat about that position. And I wasn't really convinced I would like to do that. But then we had this chat and it was a really good offer.

Starting point is 00:02:27 And I was also eager to work in a domain where we have compilers and databases collaborating, like query compilation. So that's how I got into databases. That's a really nice origin story. I like that. Do you find that there's plenty of things like from your experience kind of with compilers that you can apply? I mean, maybe this is what we can touch on this over the course of this conversation, because I kind of feel like that influence in this work a little bit. But did you feel that a lot of transferable skills from that that just applied very nicely to databases? Yeah, definitely.

Starting point is 00:03:09 For one, I can say that at the time when I was starting my PhD, compiling to LLVM was kind of the hot topic. And I knew LLVM very well. So I was able to pick that up quite easily. So, yeah. Okay, great. So can you set the scene for your paper? All right. So the introduction is kind of the story of my own, rephrased as someone else. So when I started my PhD at this group, no one there was working on a complete system.

Starting point is 00:03:45 So the people were doing fundamental research and under lab conditions, I would call it, like writing the code just for particular experiments. And this means if you have a new research idea, it's kind of hard to give it a quick try because you don't have a system at hand that you could just plug your idea into, right? And it's also, you suffer that there's no one next door that you could just ask for help who is familiar with the system. This is one problem that I spotted.

Starting point is 00:04:19 And the other problem is that if you're a new PhD student, you have to come up with ideas, right? But this can be really hard. And if you are working on a system or you have a system at hand, then it may have flaws. It may have issues. And this can be a spark of inspiration and lead you into your PhD. And I would say that during my time as a PhD now, I've spoken to some people that shared this experience. So how does this lead to the introduction of the paper? In the paper, I'm telling the story of Bobby Tables,

Starting point is 00:04:55 which you could think of was, is me, because we witnessed the same story. So consider Bobby Tables, who is pursuing a PhD and just started recently. And Bobby is working or pursuing a PhD in database research. Now, Bobby has this idea that he or she might want to implement and give it a try, but has no database system at hand. So what Bobby can do is go to the database of databases, or in short, the dbdb.io. And there you have a list of, I think it's 877 databases. And you could just type in search queries for different categories of databases that

Starting point is 00:05:42 you're looking for. So Bobby could just start hacking into the search engine what kind of system Bobby's looking for to find a system where Bobby can implement the ideas. And there are plenty of systems, as I've said before, so you could guess there will be one that fits my needs. But as I do this in an introduction, I show that if you have particular interests, then you're only stuck with very few systems. And when you just have like two systems available in the end, then, well, you're stuck with

Starting point is 00:06:23 design decisions that are baked into these systems. And another problem that arises is not only that you have that few systems available for the particular kind of research that you might be interested in, but also I found that documentation is lacking. And I have to explain this because when I say documentation is lacking, I don't mean that these systems usually come without documentation. Like if you consider, for example, PostgreSQL, of course, there's documentation. There's plenty of it. But this documentation is targeted more towards database administrators and database users.

Starting point is 00:06:59 But Bobby is a researcher. So he or she needs guidance how to implement the idea into the system, and I had the feeling that this is lacking. And now if you're wondering why the main character of the introduction is called Bobby Tables, so this is actually a name taken from XKCD, and I also looked it up. So it's XKCD number 327, and I really like that xkcd it's quite funny it's about sql injections you can give it a read and also what's quite nice about the name is bobby is also a female name it's both male and female nickname so it's also a way of inclusive writing i would say amazing thanks yeah i i mean that experience that you said at the very start of your PhD was the exact same as mine. I mean, I know it's the true of most PhD students in databases, right, where you kind of left for this scenario of building a very small prototype to show your thing.

Starting point is 00:07:57 But then that kind of makes kind of comparison very hard because you have to reimplement things and stuff and yes it's a bit of a mess and if you go down the other route taking an existing system and trying to like find documentation on the internal system of components of that system is pretty much non-existent right i mean there's a few systems where it's okay but by and large it's really difficult and these systems are so complicated and like especially some of the older ones and they're involved over such a long period of time right like it's very hard to to go in there and actually run your experiment, right? Or do what you want to do with it. Let me tell you a fun story. You know, I'm listening sometimes to the Stack Overflow podcast.

Starting point is 00:08:38 And on one of these episodes, they told that the founding idea behind Stack Overflow was documentation is a myth. There is no documentation. Everyone claims to do it, but it's never sufficient. So the radical new idea was instead of having documentation, let people discuss the problems in an open forum and you can just search that forum and this forum becomes the actual documentation of the software genius idea right incredibly successful right i mean like self-generating documentation in a way right i mean where the hell would we

Starting point is 00:09:16 all be without stack overflow so yes it's a crutch i rely on most days so yeah that's really cool that's really cool um That's really cool. Great. So obviously a new paper, we've set the scene here. We've gone through the story of Bobby Tables and come to the conclusion that you're going to design a database system that allows us to basically solve all of these problems, right?

Starting point is 00:09:40 And it's called Nutable. And can you, first of all, maybe tell us about like the design goals for this idealistic database management system for Bobby Tables? Yes. So when we say design goals, let me just clarify that now I won't talk about the software design goals, but more of like the conceptual design goals. So how would we want to help Bobby tables and put him or her out of his misery so I would say that what we wanted to do is to build a new system that is primarily targeted at research and developers and so the system it's not designed to be a product in the end right so it should be designed to be a product in the end, right? So it should be designed to be used for research and prototyping.

Starting point is 00:10:27 So that's why the title. Yeah, it has to serve as a unifying framework for database research. And I find this is quite an important aspect because if you're doing research, at some point, you need to evaluate your work and you need to also compare your work to related work. And this can also take a lot of time because you have to get this related work. It may reside in some code base where you have no access to, or the code base is outdated, and the code isn't compiling, or whatnot. And many things can go wrong. So just doing this comparison to related prior work is already demanding.

Starting point is 00:11:06 And so, what I want to say is that Mutable should serve as a unifying framework where we can have multiple implementations of various research ideas coexist in one system. And you can just flip a switch and then you use the alternate algorithm or approach. Then another design goal is that we impose as few design decisions on the developer as possible, such that the developer is free to pursue any alternative path for solving a particular problem. And I mean, this is also kind of a limitation that we will probably discuss later. So not going too much into detail now. We want mutable to be flexible and to be configurable by the developer. So by flexible, it kind of means, as the same thing before, if you design decisions, so we are flexible in the sense that you can implement many different kinds of approaches. And by configurable, I mean that you have this ability to flip a switch and mutate the system behavior.

Starting point is 00:12:14 And that's also why we named it mutable. And because there's a table in the name. It's a good name. Thank you. Lastly, Mutable should provide documentation that targets developers and eases onboarding. And this should at least be the getting started section of a Markdown document where you know how to download and compile the project. But it should go beyond that and instruct people on how, if're interested in in providing a new algorithm for a particular part of the system how would you go about it and cool so this is this is i love the sound of this system um so we mentioned earlier on that there were other systems if i go on dbdbi.io

Starting point is 00:13:02 and there's someone else tried tackling this problem before and and if if that's the case how does how far do those systems go to achieving these design goals and what's missing from them and what's the sort of the gap that mutable fills yeah that's that's a tough question so of, we did have a look at related work. And what we found there is that this, I mean, we're not there yet, but this modular design of mutable, this has been explored years back. And, yeah, so we had a look at two systems in particular named Genesis and Exodus and Genesis explored this modular design of a database system but focused heavily on storage management the other system Exodus is more of like

Starting point is 00:14:01 a toolbox providing many building blocks for a whole system and then you have to plug them together to to make a whole i think mutable is not trying to be this this toolbox it's it's more of this genesis approach where we have this modular design but we would like to extend this into the whole system that every part of a database management system can be unplugged and replaced by an alternative implementation. Cool, cool. So yeah, so I know in kind of following on from this then, so let's dive into Mutable a bit more.

Starting point is 00:14:37 So there are five sections in your paper and each one outlines a different problem and in Mutable, the vision for it and gives an example of how mutable approach is solving solving this problem so let's let's start off with with components and give us the the problem the vision and the example okay so as i've said we want to do a modular design, and we didn't call them modules. We called these things components, perhaps because I didn't like the name clash with the new upcoming C++ modules. I don't know. I just preferred the name components. Conceptually, it's the same.

Starting point is 00:15:17 What we want to achieve in Mutable is that we kind of dissect and hold database management system into its individual parts, the components. And now we can provide alternative implementations for either of those, plug them back into the system, and they just cooperate and they work together. And we have four software design goals in Mutable that we also want to achieve through this component design. So let me just elaborate the software design goals. First of all, we have extensibility. Our system should be extensible, meaning that you can always add another algorithm that solves a particular problem and i would say it's kind of different to

Starting point is 00:16:07 a plug-in mechanism or something like for example you can in postgres you can have define your data types something like something like that through plugins it's not the same thing it's like extensibility here means like you have a part of a system for which there might already be an implementation but now you you can provide another one. Then we have as a software design goal, a separation of concerns. So here we mean that if you design one component, you should not need to know about how any other component is being implemented. Merely the API of the other components might be of interest to you in case you make use of it. So let me give you an example. For example, if you do plan enumeration, what we call it, I think a more common term would be join ordering,

Starting point is 00:16:57 where you have to just come up with different join orders. To do so in an efficient and meaningful way, you would have to know cardinalities of subproblems. And there will be a component that does exactly this cardinality estimation in mutable. So now this plan enumeration component will use the given cardinality estimation component to estimate those cardinalities. However, there can be various implementations of this cardinality estimation component. You could use histograms, you could use, I don't know, neural networks, or what we have done actually is some product networks, which are also machine learning approach, but they are placed in the domain of interpretable ML,

Starting point is 00:17:40 and that's kind of cool. So no matter the implementation of this cardinality estimator component, the plan enumeration component should be separated from that. And that also guarantees us that we can just replace one thing and not breaking the other things in the system. Okay, so our third design goal is kind of the software design goal

Starting point is 00:18:03 that you find in every piece of software. It's called abstraction. Who would have thunk? But let me just try to make this a little bit more expressive. With abstraction, I want to achieve that we define a nomenclature in the system that fits our academical terms, like what we learn in the lectures, what we use in teaching, such that you have a very easy one-to-one correspondence from your lecture materials

Starting point is 00:18:33 to the implementation and the code. And this, of course, this makes it, if we would achieve that, if we can achieve that, it makes it very easy for newcomers to start working with the system. And then our fourth design goal is actually a follow-up on the third design goal, because usually abstraction comes at a cost. So now our fourth design goal is abstraction without regret. And also here, I think I should elaborate a little bit on what this regret part is.

Starting point is 00:19:10 So if you think of abstraction in an object-oriented programming language, then you can say you define an interface and this interface defines some method. And now you can have classes implementing this interface. And these classes overwrite this method. some method and now you can have classes implementing this interface and these classes overwrite this method. Now if you have the pointer to this to an object of this interface and you call a method at runtime you now have to deduce the runtime type of the object. So it's called a dynamic dispatch and a common implementation of virtual

Starting point is 00:19:46 methods which are then compiled using V tables for example. And this overhead is the regret part of abstraction. And it can be a pain if you have functions or methods such as virtual methods that are being called frequently but doing only little work. And there is one particular use case in database systems where that occurs, and these are interpreters. So query interpreters, they suffer from this overhead. And this has led away from the initial Volcano iterator design of query execution towards this vectorized interpretation. So there you accommodate for this call overhead by just processing a batch of values.

Starting point is 00:20:34 So yeah, these are these four software design goals that we want to achieve in this component design. Cool. I just wanted to roll back a sec to highlight what are the components, what did you decide on the boundaries would be in a database system and how easy was it to come to that decision? Was it obvious that you were going to split it up into storage engine, plan enumeration, etc. And was there other ways of which you could have divided the system up? Okay, that's a very interesting question. Some parts were kind of obvious, and some parts are not even done yet. I have to be honest here.

Starting point is 00:21:20 I mean, this is a work in progress. So for many things, we don't have a solution yet. Yet, that's the key word. Yes. Consider, for example, concurrency control. So modern concurrency control, you would say perhaps multiversion concurrency control. And then you look at implementations of those, you have these chains of versions of a tuple. And usually you need to integrate this with your storage engine. And this would kind of violate the separation of concerns part, right?

Starting point is 00:21:56 So I don't have an answer yet to how we will do this. But our research agenda is to find a solution to this problem of designing a multiversion concurrency control system without having to bake into the storage engine that you have these version chains. So it's an open problem. But I think this is also quite interesting

Starting point is 00:22:22 to have a look at. Yeah, it's definitely an interesting problem. And you've actually preempted one of my questions I had written down to ask you earlier on was about concurrency control and how you would solve problems associated with it like the one you just said. But I look forward to seeing how you tackle that. Oh, actually, on the components, we didn't enumerate what the components were. So maybe you could give us just a 30-second rundown of the different components. All right.

Starting point is 00:22:55 So the components that we have implemented in the system and for which you also provide alternative implementations are the data layout component. So data layout describes how would you lay out the data that is given by a table schema into a linear address space. And so the data layout component is actually not really a component in the sense of the other components, but it's more like the API that we use to communicate between a particular data layout and the remainder of the system. So how it works is that we have kind of like a domain-specific language in which you can express how you go about laying out data in a linear address space. For example, you could describe for a given schema,

Starting point is 00:23:41 you could describe how you would implement a row layout by computing padding between the attributes you could describe for a given schema, you could describe how you would implement a row layout by computing padding between the attributes and then padding between the rows. And then you'd give this back to Mutable and now Mutable understands, aha, so this is how I'm going to address the individual attributes and the individual rows of this table. And it's a recursive way of formulating this layout so So we can also express PAX layout, or we can express PAX and PAX layout. It's still not feature complete. So for example,

Starting point is 00:24:15 what we really would like to do is support arrow format, but for the arrow format, we need also to have various variable length arrays. And this is something that we don't have yet, but yeah, that's it about this data layout component. And when I later come to another component, we will also see how this is later picked up in the system and used. So then the next component that we have implemented is cardinality estimation. And for cardinality estimation,

Starting point is 00:24:45 we actually have only one implementation at the moment, which are some product networks. And these are a really interesting way of gathering statistics of tables. And these basically work in a fashion that you take a table and now you have two ways of modifying it. You can either partition it

Starting point is 00:25:06 horizontally, so building clusters of rows, or you can partition it vertically, splitting attributes from other attributes. And so then this algorithm for the Sun product network builds a tree of nodes where in each level you either split rows or you split columns. And it's done in such a way that it understands correlated columns and eliminates this correlation through this clustering process. I mean, it's hard to wrap up, but what's really cool about it is this model learns correlations between columns and you can exploit that by actually attaching the number of foreign keys

Starting point is 00:26:00 that match to a primary key and then learning on this kind of extended column and then you also can catch correlations over foreign key joints and that's really powerful and what's cool about the some product networks is i mean there was this is not our work right so we just picked this up and said let's implement it into this database system. Now, what we did is we used Eigen, the vector and metrics library, to do a kind of an efficient implementation. And there we spotted a neat trick. So we can actually tune these SPNs with just one configuration to just behave like

Starting point is 00:26:39 regular histograms. And so we just have histograms for free. Nice. Cool. And so we just have histograms for free. Nice. Cool. Then next we have cost functions. And here we have two kinds of cost functions. So first we have logical cost functions or algebraical cost functions. We take just a look at this algebraic tree of operations, operators, how you join the relations together.

Starting point is 00:27:03 And then we compute logically how costly is a plan and this is mostly concerned about the cardinality estimates that you get and the logical type of operation like if you have a join then you consider the input sizes things like that and then later during the optimization we also do a physical optimization step where we already have computed a tree shape. But now we have to compute for every node in the tree which physical operator will we use. And here we actually have linear regression models that are automatically trained on the system through benchmarks, such that Mutable learns about how efficient certain operations run on the system through benchmarks such that Mutable learns about how efficient certain operations run on the system.

Starting point is 00:27:49 And then it learns these cost functions for these physical operators and that these can be used at physical optimization time. Then we have plan enumeration. That's the name that we gave it. The common term is join ordering. We just extended it to plan generation because there can be other things than joins that need to be considered but basically it's join ordering and here we implemented many of the textbook algorithms like these dynamic programming

Starting point is 00:28:17 algorithms sites sub and ccp also more more recent ones like these top down but still all of also dynamic programming algorithms min cut advanced generate and test and min cut branch and then we also have our own algorithm this is based on heuristic search and i think we will briefly discuss it at the end of the session cool and then there's there's the query execution component and this was actually the the origin of mutable so to say um so in the query execution component we have one minor implementation that is deprecated already it's it was a stupid interpreter and i really want to get rid of the thing and then we have the compilation of sq to WebAssembly and compiling that to x86 using WebAssembly VM from Google. And that's actually the initial project that started Mutable.

Starting point is 00:29:20 These are the components. Fantastic. So just to summarize there there we've got query execution storage data layout plan enumeration or join ordering depending on your preference of terms um cardinality estimation and how you go about um doing cost function so cool awesome so let's let's let's move on to the next section in your paper. And the one where you talk about, I mentioned it a little bit there a second ago, code generation. So what's the problem? What's the vision?

Starting point is 00:29:55 And you mentioned slightly how it's tackled a little bit, but can we dig into a little bit more maybe? Of course, let's do it. So code generation we do in kind of two fashions so the first way we use code generation i mean code generation first of all we use it for this abstraction without regret part yeah so we want to get rid of this regret one way of doing it and it's kind of the easy way is using meta programming through your compiler and So mutable is written in C++ so we can use templates and templates are just a way of meta programming so we can compose different parts of the system at compilation time of the system and

Starting point is 00:30:36 that's one way to do it. Of course this doesn't always work and there are situations where we have to compose code at runtime of the database system but at the compilation time of the query and this is where the the compilation of the query comes in and i mean this this was already explored in the very first database relational database system right so the system R already had compilation of queries to machine code. They just abandoned it. I think if I remember correctly, it was because it wasn't maintainable at the time

Starting point is 00:31:12 because it didn't have compilation framework like LLVM. And then Thomas Neumann picked this up. In 2011, I think it was, there was this paper, Efficiently Compiling Efficient Query Plans. And there he uses LLVM, I think it was. There was this paper, efficiently compiling efficient query blends. And there he uses LLVM, which is like the most famous open source compilation framework, compiler framework.

Starting point is 00:31:38 And so he uses it in his system hyper to compile queries. And it works wonders. So now you can compile whole pipelines into tight loops. And you can pass values in registers rather than through memory. And it's a big gain on efficiency. However, LLVM is not a JIT compiler, in my opinion, at least. So, of course, there is a tool in LLVM to do just-in-time compilation, for sure. But LLVM wasn't designed as a just-in-time compiler as its primary goal.

Starting point is 00:32:21 And if you look at the LLVM code, you quickly realize that this design of LLVM is highly dynamic. It's many pointers in there, many dynamic data structures in there, because its main purpose is analysis and transformation of code. And to enable this transformation, you need these dynamic data structures. This is probably not what you would see in a just-in-time compiler. And so the just-in-time compilation with LLVM, you still see that there are high compilation times. And this is where we actually had a discussion at the reading group of the compiler's chair

Starting point is 00:33:00 at my university. And so we had this outrageous idea of just compiling sql to javascript yeah because javascript it's a highly dynamic language which is being just in time compiled in every browser and i mean it was a funny idea but i never gave it much thought until web assembly was released. And then I was like, oh yeah, WebAssembly, that's really the game changer here because now we have a fully typed low-level language and that could do it. And so I started writing it and I started writing code generation to WebAssembly and we hacked V8 into Mutable.

Starting point is 00:33:46 So we embedded V8 and we had to actually patch it a little bit. You can read it in the paper. And V8 is like, it's a toolbox that works just magic. Like you provided this WebAssembly code and it compiles it to x86 and executes it. But it does so much more because V8 actually ships with two compilers. It's called tiered compilation.

Starting point is 00:34:13 So you have one compiler that does a single pass over the WebAssembly code and compiles it immediately into non-optimized x86. And I mean, it can be slightly optimized, but the critical part is you don't have optimal register allocation because that's really a difficult optimization problem. And so you have this fast compilation tier, which does a single pass linear time over the WebAssembly code, produces x86,

Starting point is 00:34:40 and you can just run it. Low overhead, low delay. Now V8 in the background runs an optimizing compiler called turbofan and it does optimizing compilation and it produces optimized x86 and then because it's a vm and runs your code and instructs your code it can just behind the scenes switch from the slow running non-optimized code to the optimized fast running code and you get this for free and that's that's really good and if you look at umbra which is which is a new system from uh thomas neumann where they actually built their own ir and they built their own justin time compiler that's a lot of work to get this.

Starting point is 00:35:26 And if you can just take V8 and WebAssembly as an IR, you get basically these features for free without this development effort. That sounds a bit like a free lunch there, really. Taking V8. That's awesome.

Starting point is 00:35:41 So this is where that's going to appear at ED edbt right yes exactly that paper's called what's the title of it so we can the listener can look out for it it's called a simplified architecture for fast adaptive compilation and execution of sql queries there you go listen you've been told go check that paper out when it when it um when it gets published and let me add to that yeah this paper is already outdated okay so what we did in the meantime is so so what we what you see in the paper the technique is the same we do this so the way we do this is basically the same

Starting point is 00:36:21 but what we changed now is how we how we implemented it earlier we had to write the code generation kind of directly like writing a compiler for sql queries now what we did is we wrote a domain specific language that's embedded into c that looks like C, has almost the same syntax, similar semantics. You can just write it in between your C code, but when it's being executed, it actually generates the WebAssembly code in the background. So now this is another way of metaprogramming. So it looks like you're writing a regular program, but through the execution actually produces the code. And so this is one way that we hope we can make code generation

Starting point is 00:37:07 much more accessible to the database community. Fascinating stuff. I just wanted to make a real quick point as well. And you said that they'd already tried this back in System R in the 70s. And I'm pretty sure that is a rule for all ideas in databases since then. They already tried it in the system. All the good ideas were taken straight away in the 70s. But no, they were probably a bit ahead of the time.

Starting point is 00:37:34 Cool. I guess the next problem that you mentioned in the paper is physical optimization. So what is the problem here yeah so physical optimization I mean maybe it's a bit of a vague term you could think of joint ordering but now consider physical operators so usually if you would just say joint ordering by the textbook or how you would do it in a lecture, then you would just say, okay, I have my algebraic join operation. I have some very abstract cost model for this, this operation and using that, I now just compute an optimal plan. And doing so is already, it's, it's a complex problem,

Starting point is 00:38:21 depending on the cost function, it's already NP hard. But now if you change on the cost function it's already np hard but now if you change from this cost function which is just algebraic to physical implementations meaning that you have not just one join operation but you have various implementations and just considering binary joins now the complexity already explodes exponentially and And this is a problem, but on the same side, it's important to consider these physical implementations

Starting point is 00:38:51 because you can exploit some things. For example, if you know that a stream of data after some operation is sorted, it makes a merge of joint operation cheaper, right? So that should be considered during the optimization process. So how we tackle the problem of this highly,

Starting point is 00:39:12 very hard optimization problem in Mutable, how we tackle that is that we first do a purely algebraical optimization step. So we take this query, we represent it as a graph, then we compute a join order, so to say. This is what we call the plan enumeration. And this is done using an algebraic cost model. And what we compute is a partial order. It's a tree. And the tree

Starting point is 00:39:39 dictates a partial order of the joins. Now with that tree, we go into the physical optimization step. And in this physical optimization step, now we only determine two things. For the nodes in the tree, we determine which physical implementation to use. The children of each node can still be permuted depending on the operation. So if you have a commutative operation like an inner join then you can just permute them right so permuting them you decide which one becomes the build and which one becomes the probe relation for a hash join for example and so this is also still not determined by this algebraical optimization but now we can determine an order using physical optimization because for example for the hash join usually wants to build the hash table on the smaller relation yeah and what's actually quite funny is

Starting point is 00:40:34 that now after we have determined this tree structure during this algebraical optimization and we're and we are now left with this physical optimization optimization step the physical optimization step cannot be done in linear time actually because you could say okay but there are so many dimensions of this optimization problem yeah but if you think about them they're all constant in size the only thing that remains dynamic depending on the problem is the height of the tree. And so this becomes a linear time optimization problem, actually.

Starting point is 00:41:10 So, I mean, I guess building on this, how is this... So this is implemented in Mewble at the moment, and all this functionality exists in there at the minute? Or is this still sort of... Are we still in the vision sort of territory here at the moment not on the main

Starting point is 00:41:28 branch okay right yeah this is this physical optimization is currently a master thesis of one of my master students okay and uh parts are there parts are missing okay so very much coming soon then but one to look forward to. Yeah. Great stuff. So I guess it's moving on now to the next sort of way you've divided this up in your paper. You've got a section about the way that you've tackled physical data layout independence. Now, what's the problem here? And as we've done with a few previous sections, what's the vision and the approach? And maybe an example as well.

Starting point is 00:42:11 Yeah, so physical data independence is one of the key selling points or key optimization techniques in databases because the user describes the schema, but how you actually implement it, how you lay it out on memory or on disk, this can still be decided by the system. And this creates this independence between this implementation and the description of the schema by the user. So now what you can do about this, you can organize data in different ways on your storage, be it main memory or on disk. And I mean, we have seen this in the past

Starting point is 00:42:58 because initially we had these relational database systems. Naturally, I would say they organized their data in rows of tuples. Then came the column stores that just did it the opposite way, using a columnar layout. And this layout is actually much better consumable by your hardware because now we have these SIMD extensions

Starting point is 00:43:22 which can just batch process homogeneous chunks of data and Colnallayout just provides these chunks. And then we have hybrids like the Pucks, for example. And depending on what operation you're doing, one layout may be more suitable, the other may be less suitable. So, for example, transactional processing may benefit from a row layout, whereas analytical processing may benefit from a columnar layout. And we don't want to impose this design decision on the user of mutable. So we thought of how can we just decouple

Starting point is 00:44:00 this from the system and have this like a tuning knob. And it's actually quite difficult because if you implement a system with a particular layout, you can just hard code your whole system for that particular layout. Now, if you want to make this configurable, then loading the data suddenly becomes slow because at runtime of the database system, you now have to consider what is my configuration and then how do I load the data from my storage? And this is again an abstraction that we can achieve, but it comes with a regret part. And the solution that we found to this is that we provide a way of describing the layout. This is the data layout abstraction. You describe how you would lay out your data in a linear address space.

Starting point is 00:44:53 Now, this description is picked up by the query execution engine, and we compile this directly into optimized code for loading the data. So we don't have one particular implementation for loading pucks and one particular implementation for loading row and one for column layout. But instead, we have this, I mean, it's a big chunk of code, yeah, but it processes this description and it generates exactly that code that you would probably even handwrite to load this data layout. Cool, yeah. So I just wanted to touch on that. You said that there's like a tuning knob.

Starting point is 00:45:30 This kind of comes from, I was reading a paper a while back where they had like a hybrid layout in their system of some data was stored in row format, some was stored in column format based on, it was like a HTAP system that kind of dynamically configured itself based on what the workload was and changed the layout based on that is it possible to to do that or do you think it is possible to do that like kind of and have uh define your own custom layout right where you have stuff that may dynamically change on the fly as well or is it very much kind of you

Starting point is 00:46:01 configure it one way you're going to say i want to come on a column column in the layout and then you're stuck with that for the well not stuck with it but you know for the experiments whatever that's the that's what it has to be fixed for the duration of whatever or could you be dynamic maybe so at the moment it's it's static so you make this decision once for a table and this table is then locked to this layout cool we have thought of of this um migration from one layout to another. And what you actually can do at the moment is if you have a table with a particular layout, you can just augment that table by a new column. So that would be possible.

Starting point is 00:46:39 But just transforming it from one layout into another, that's not there yet. And what's also not supported, and I don't know whether we really want that, but perhaps someone wants to try that out, is like having a change in the physical layout at some point, like, I don't know, the first million records on row layout, and then you'd say, ah, but now from now on, I would like them to be in columnar layout.

Starting point is 00:47:05 I guess it's doable. Because if you know that the first 1 million records, this is a fixed amount, then you can just actually encode that in our data layout and say, okay, there's a block of 1 million records, which is in raw layout. And afterwards there comes a block in Colna layout.

Starting point is 00:47:20 Yeah, that's doable. Yeah, it's up to debate. Maybe it has no value or anything. I mean, yeah, it depends. But that's cool yeah it's up to debate yeah maybe it has no value or anything like i mean yeah it depends but that's cool that's cool so i guess we're on to the the kind of the last section that you wrote about would have automated evaluation and what's the problem here and again vision the approach and an example okay so i think on the vision part, there are kind of two things here. On the problem part, actually, there are two problems, I would say. So first of all, when we think back of this introduction with Bobby Tables having to do research and then evaluating the idea, Bobby has to evaluate related works as well.

Starting point is 00:48:06 And as I've said, this can be a pain. So what we wanted to do is right from the beginning, compare ourselves always to the major systems that are out there. And I mean, we just, some of them we do. So we compare ourselves to Postgres, Hyper and DuckDB. And we want to automate this process of comparing our system to the others such that we don't have to do this by hand every time that we change something. The other thing, the other problem that I see is that if you build a system, you have to be very careful about regression.

Starting point is 00:48:48 And when I say regression, you might initially think of testing, but there's also performance. And performance regression is also not to be considered or to be ignored. So we have to be aware that changes to the system might deteriorate the performance and we should do something about it. Because in the long run, if we make many small changes that deteriorate performance, in the end, we might end up with a system

Starting point is 00:49:20 that is of no use. So what we did is actually we stole this idea or we copied the idea from DuckDB because DuckDB has this continuous benchmarking website where they show their performance over some benchmarks over the past days and that's a great idea to do and so we also wanted to do that and we wanted to go one step further and integrate into this process, not only our system, but other systems as well. So what we have done now is we have built a script, basically, that runs a set of benchmarks that are written descriptively in YAML files.

Starting point is 00:50:01 And basically, it's just a YAML file telling you which data to load into the database, which queries to run, which properties to extract from the system. What we now have done is we have implemented, we call it database connectors, like you could know them from your, I don't know, preferred programming language having its own database connector.

Starting point is 00:50:23 But it's not really the same thing. So these connectors that we implement, they are meant to run a query in the benchmark setting and extract benchmark information from that system. So for example, we have one connector for Postgres, which then runs a query, does some timing, and reports back this timing information. And the same we have for Hyper and DuckDB and our system.

Starting point is 00:50:51 And now this script that runs these benchmarks picks up these descriptive YAML files, runs them in all these systems, and produces output that we just insert into a database. And so we have a database of all our measurements of all all the past since we started doing measurements and then what we did is we built a rest api on top of the database such that we can query this information from the web and also get immediately some aggregated data and then we built a web application in Flutter to visualize this benchmarking data.

Starting point is 00:51:29 And perhaps we could just have used Grafana. I don't know. Anyways, we did it. We built this web application in Flutter and you can now visualize the recent experiments or just of a particular date. You can just go to a particular date. You see all the experiments that were run, all the results that were gathered of Mutable and the other systems.

Starting point is 00:51:51 You can go to the continuous benchmarks tab and you can see over the past months or years how the performance has evolved, not only of our system, but also of the other systems. And now that's really the neat feature here is once you have that data, you can analyze it. And so what we do at the moment is we take a look at our performance over the past two weeks, and we compute the standard deviation and the performance. And then we apply some threshold, some multiplier to that standard deviation and the performance. And then we apply some threshold, some multiplier to that standard deviation. And whenever now the performance peaks out of this range,

Starting point is 00:52:31 we say there's a performance anomaly. And we issue a warning on the webpage, as well as we send mails to the developers that have committed to the system in that range of time since just before the anomaly. So we can actually find the people responsible for this performance anomaly. And you could blame them. I was going to say, it's like an early warning system to know you've got a problem on the horizon. You've pushed some dodgy code. But again, I guess if it's a

Starting point is 00:53:03 performance spike and it goes up, I mean, maybe it's a good thing, right? As long as it stays up there, right? Yes, we've thought of that already. So on the website, when you have this performance anomaly, you can actually sign into our website using our GitLab, for which

Starting point is 00:53:20 only we now at the moment have access. But anyways, you can just sign in through GitLab. And when you do this, you can actually configure or respond to such a performance anomaly. And you can say it's either expected. For example, you did some improvement, you ran it locally and saw,

Starting point is 00:53:36 ah, it increased performance. So now this performance anomaly is expected because I improved the code. You could say it's a false positive, for example, because someone at midnight was doing updates on this benchmark server. And then you can say it's a confirmed performance anomaly.

Starting point is 00:53:52 And when you just click that button on the website, it creates immediately an issue in our GitLab instance and assigns all the people that have changed the code in that time to this issue, such that these people are immediately made aware of. And then when you close the issue in that time to this issue such that these people are immediately made aware of and then when you close the issue in the git lab it also removes the performance anomaly from the web page that's really cool that's a really nice system that's here yeah i mean one of those you'd be panicking when you get a notification like oh wow have i done something good or something bad? But no, that's fantastic. So how easy is it to add either new systems to this framework or new benchmarks to this framework?

Starting point is 00:54:34 And what's the sort of coverage in terms of benchmarks that are in there at the minute? Are we talking all the classic ones like YCSB, TPCC, TPCH, etc., etc., etc.? Or is it just kind of handcrafted benchmarks at the moment? It's a mix. It started with handcrafted benchmarks, particularly because when we started doing the code generation part, so compiling queries, then we had to do benchmarking of individual operators. Then I did research on joint ordering. So I wrote benchmarks for that.

Starting point is 00:55:06 Then we integrated TPC-H or at least part of it into the system. But I must say that there are many parts of SQL that we don't support. So we have to take a look at the TPC-H queries and see which parts do we actually support. Some more features are being added over time but as this is a research project and it's only driven by research

Starting point is 00:55:31 there's no one really taking care of that to be honest you know but we have tpch in there and you can see the performance of a tpch and no one cared to add more benchmarks to be honest cool cool um cool yeah so that's great i mean i guess my next kind of question is um we've touched on all this all the all the good stuff but what are the what are the limitations of mutable and like what areas of it are currently i know this i know it's sort of a developing project and it's not feature complete. But yeah, so what are the limitations? And maybe I've kind of been primed for this a little bit, but maybe if you could touch on the limitations around concurrency control and things such as that. Yes. Okay. So in its current state, Mutable is a prototype system, and it's therefore lacking much functionality that you would expect from a complete database system.

Starting point is 00:56:32 For example, concurrency control simply isn't there. And the reason why that is, is because the development has been driven by the research, and we just haven't been researching on all parts of a system yet. But to make this more concrete, so when you consider, for example, concurrency control, it's not implemented, but how would you go about implementing it? And that's really a tough question because one major limitation of Mutable is actually its design. Because we said we want to do everything as components and these components have api's this means that any extension or algorithm that you want to add to the system must be squeezed

Starting point is 00:57:15 into this API but I cannot guarantee at the current state that this API is rock solid and will remain the same for the next decade. The problem really is that we might have to adapt. So I'm pretty certain we will refactor the API in the future because at some point we or someone else wants to use the system, wants to implement something and says, with the current API, it's simply not possible. It's not generic enough. So there's certainly a limitation that we are aware of,

Starting point is 00:57:48 but we have a happy culture of refactoring. So that's not a problem, I would say. So if we figure out that there's a problem in the API, we will rework it. On the other side, when it comes to limitations, it's a tough objective to split a complex system like a database system into small, independent, exchangeable components. And during this step of doing so, one might increase the development effort of the system. Because now if we have to do refactoring of the API, for example, this would also mean

Starting point is 00:58:28 that we have to force updating all the components that implement this API because now it's deprecated, the old one. So at this point, it might increase the maintenance cost. On the other side, one could perhaps argue that it may also reduce the maintenance cost because through this componentization,

Starting point is 00:58:50 the individual part of the system becomes simpler to understand and so they become more maintainable. But that's something that I cannot tell yet. The future has to show. Cool. Going

Starting point is 00:59:04 off that, obviously it's it's not like not all the features you want it's not like kind of the the final sort of product yet it's very still in a prototype state but is it usable from the point of view of um i can i can develop some new joint order algorithm and i can use your system tomorrow in my in my in my work to generate results for me like how i guess the question that i'm gonna get around to is how usable is it at the moment um the honest question is depends on what you're after if you're after something that we did research on i would say it's pretty usable and particularly fitted for research.

Starting point is 00:59:46 So it's really meant to just plug in a new experiment, run it, produce measurements. That's really the purpose and it does that job quite well in those areas where we did research ourselves and have evolved this system. In other parts, at the moment you might be left alone. If you say, now i want to do concurrency control and you go to mutable yeah there's nothing there okay so i have a master's now starting a master thesis on this particular topic of concurrency control but at the moment it's not there cool cool but in the future i'm sure it will be so I mean

Starting point is 01:00:27 obviously I'm imagining the code is publicly available right, we can go and find that I guess in the GitLab repo and we can link to that in the show notes and stuff Yes, so we made it just recently open sourced Mutable because we wanted it to

Starting point is 01:00:43 be open source for the CIDR conference. So, yeah, it's open-source and it's under the AGPL license. Cool. So have you had any feedback yet, any initial feedback from anybody outside of the creators? Have you had any? Yeah. Yeah, that's good. That's good. Feedback's good. Yeah, one guy came and he said, benchmarks aren't working. And I talked to him and it turns out like when you want to run the benchmarks,

Starting point is 01:01:12 you first have to get the benchmark data and there's a script that pulls the data, but it's nowhere written that you have to do that. Ah, okay. So it's just a documentation problem. Yes, fine. That's fine. That's no big deal. But that's good, right?

Starting point is 01:01:24 That's good. Getting feedback is always always good so this kind of leads on to my my next my next question in that obviously this this this podcast is obviously aimed at a whole why anyone who's basically interested in computer science right but we're like i always like to try and bring the research around to how people who work in industry can maybe leverage whatever it is we're talking about as well so is there a way as a software developer maybe someone working on databases for some database company or whatever can can leverage to kind of mutable in their day-to-day sort of workflow or do you think it's primarily obviously it was designed for academia so this it kind of makes sense for it just to be primarily used used in that domain but does it have any out um in like use uh

Starting point is 01:02:05 use outside of that that domain perhaps i would say for for a particular use case so as i've said my main interest in research was query processing and so these parts were more and more evolved so if you are now after weird complex queries which you need to run fast then mutable might be your thing because this thing is a heavy number cruncher you know but apart from that if you're looking for features or transactions no cool cool um this this next question is i imagine you're gonna have some interesting things to say on this one so what what has been probably i asked this to everyone as well it's a pretty standard question so what is the the most interesting and maybe perhaps like unexpected lesson that you

Starting point is 01:02:54 that you learned while working on mutable um the first one is systems complexity was something that I didn't expect. On the one side, systems complexity is a beautiful thing, but it's also very frightening. And it's frightening because one can get the impression of never being done, never being finished. There's always more to do. There are always issues around. It's really demanding, I would say. One really has to keep the spirits up to keep working on the project and not be,

Starting point is 01:03:34 how would you say? Demoralized. Take it one day at a time, right? Yeah, exactly. On the other side, it's very pleasing to see when the system components start interacting and cooperating together and forming the whole system. And when that works, it's really enjoyable. Then what I have learned during Mutable is that third parties can be a pain in the ass.

Starting point is 01:04:01 And mostly for the reason of automating the build process so i mean i wanted to build mutable such that it can also be easily deployed but then again we have to use third parties like for example v8 from google and you have to build that thing and just if you build it for the first time your computer will be will be like consumed for time, your computer will be consumed for one hour. It will be compiling one hour straight. And this itself, okay, you have to live with that. But getting this build process started the right way, that's really tough. Because then you look at another project like V8 and it has these millions of tuning knobs

Starting point is 01:04:45 for the compilation process. And you have to configure it just in the right way that you want to use that project. Then it comes in an update and suddenly things break. And that's really annoying. Yeah, yeah. And I've also learned that CMake can be so useful but when it's

Starting point is 01:05:07 misconfigured particularly by a third party it basically renders the third party unusable in your project so that is what I have seen and also in conjunction to that dependency management

Starting point is 01:05:22 in such a project with third parties is also annoying. I mean, we have mighty tools to do that, like good submodules, which we don't use, or CMake external projects. This is the way that we go.

Starting point is 01:05:40 But still, then some things change and it works on one machine, you commit it and it works on one machine. You commit it. It works on one build bot. It works. And suddenly on someone else's machine, it's not working. And then you have to think about, do you have to roll back the third party?

Starting point is 01:05:54 But then you're missing on a feature or something. Yeah, that was really annoying. Yeah, yeah. I can imagine. One thing that I have learned, though, which I'm very happy about, is building a team. So I had many people contributing to Mutable as part of being research assistant or bachelor or master student. And it was really a joy to work with those people on the project. And I hope we can continue working on the project and so i what i think i started learning

Starting point is 01:06:27 and i'm not done yet is identifying people's strength and their interests and then foster them and co-coordinate their works and spark their intrinsic motivation to work on this project um that's a skill that i didn't have before and i wouldn't say i have it yet but i'm on the way to learning that work Work in progress. But no, that's great. I mean, like, this is sort of, it's a very different answer to anything I've had before for that question. One of those sort of soft skills about how to, like, build a team and how to correctly allocate people to the right sort of tasks and keep everyone motivated. Especially when you were saying, like, systems development can be quite, no, what's the word?

Starting point is 01:07:03 Like, the moral item we said at the time right so trying to kind of keep everyone's spirits high and stuff yeah no that's that's that's a really nice answer to that question um kind of going going off that then so i mean obviously progress it is another question that i that i ask quite quite frequently is that like progressing research is is non-linear right there's ups and downs so from the initial sort of conception of the idea for immutable to the side of paper that we've been speaking about today what were the sort of war stories that you kind of could share along that and and the things that failed that maybe other people could benefit from from knowing about yeah i have a funny answer to that there are none because we actually we had the

Starting point is 01:07:48 submission for the sigmund paper on july 15th okay and so the cider deadline was just one month ahead and my boss was like huh there's cider let's write a paper and we were like one month oh that's going to be painful and so this really was just a sprint just writing down everything that we have done so far on mutable there wasn't really any time for problems okay it's such a power throw yeah they can't be problems just get that get it down get it written and fingers crossed okay cool cool it was only possible though because there was this prior work that's now published in EDBT and Sigmund. Only because of that, we were able to rush this out in one month.

Starting point is 01:08:31 Okay, yeah. That kind of makes sense. You've done most of the work for anywhere, and it's just a case of bringing it all together and writing it up. Cool. Just one thing. The major obstacle here, though, was the six-page CIDR limit. Is it six pages? I'm sure some papers are longer than six. major obstacle here, though, was the six-page CIDR limit. Is it six pages? I'm sure some papers are longer than six.

Starting point is 01:08:49 The mechanism there is funny. For submission, it's six pages. When the paper is being accepted, you're unlimited. Oh, okay. Right. I guess that makes the reviewer's life a little bit easier, or does

Starting point is 01:09:04 it get reviewed again after you write like an extended version no it's not being reviewed again so basically my boss was like let's just paste in the code of mutable so where do you

Starting point is 01:09:19 go next from here then I mean obviously we've spoken about this across the course of the podcast and there's a million and one directions for you to take um mutable in what's the next the initial next steps um yeah so if i if i were offered a postdoc position where i can continue my work on mutable i would say i take it even though I didn't have planned a career in academia. But I don't know, the project somehow has grown to my heart and also the people that I've been working with. So I would take the opportunity. And actually, we have started thinking about founding a company. But that's actually really hard because it's such an academic project with a research goal.

Starting point is 01:10:06 Who are the customers? That's the question there, right? Yeah, exactly. So what's the selling point here? Yeah, so that really was a frustrating thing to realize that you have to come up with this idea of how to make money with it. And yeah, we don't have one yet so that's why i think proceeding development in in academia first and taking this time to think of how to make it to a company that that would be one way but regarding research what i'm currently looking

Starting point is 01:10:42 at is on the one side, this physical optimization stuff. So how can we, what can we do during a physical optimization step? And I think we have some quite cool ideas here. Because, so we do this query compilation. And we have kind of like patterns that we apply. So we know how to compile a hash join. We know how to compile a hash join we know how to compile selection and whatnot so now what we are currently looking at is how can we leverage more information that we have on the data to steer this compilation process and so the idea came originally from

Starting point is 01:11:21 from looking at compilers and if you look at a compiler like Clang or GCC, you have to realize that these are general purpose compilers for any kinds of programs. And what's really the most important distinction between a compiler and the database system to me is that the compiler, the general purpose compiler, only gets the program as an input,

Starting point is 01:11:44 whereas the database system gets the query as an input whereas the database system gets the query as an input and will compile it but it already knows the data that the program is being run on so you already know the inputs to that program that you're compiling so if you do know that then you can use that information perhaps to steer the compilation process. So what information do we have on the data? And as I've said before, we have these fancy sum product networks where we can get statistics on the data in the tables. What we can do now is, for example, if we get a selection where x less than 42, we can actually look at the model and see, aha, so this predicate is highly selective.

Starting point is 01:12:28 It's like 0.01% of tuples will qualify. Now this makes sense to compile into a conditional branch. But then again, there might be a predicate, a selection predicate where X less than 42 with only selectivity of around 50 percent and here a conditional branch is actually not a good choice because the branch misprediction penalty might be very high and so this can now flow into this physical optimization process and then we can on one side having trained these three regression models on the particular hardware we know how

Starting point is 01:13:02 costly is a selection using a conditional branch and how costly is a selection using predication. And then we can steer this compilation process to compile just the right code for the particular machine that we're running on. I mean, that is a really interesting direction. Yeah. So my next question is, we've touched a little bit on your other research, but if you want to elaborate on any more, please go ahead now. If you want to touch on maybe the upcoming EDBT paper, I'll obviously mention the Sigma one that was last year. Or was it this year, Sigma? Is it Sigma 2022? It's this year, 2023.

Starting point is 01:13:38 I thought it was 2022. Ah, sorry. No, no. It's submitted in 2022. I mean, the cycle is quite long for Sigma. Yes, no. We submitted in 22. I mean, the cycle is quite long for SQL. Yes. Yeah. Okay. So for the EDBT paper, which is about this compilation of SQL queries to WebAssembly, I think I have stated most of the key takeaway messages already. So if any of your listeners is interested in how we compile this actually to WebAssembly, have a look at the paper, but bear in mind that the way that we implemented immutable is already overhauled in the sense that we don't write

Starting point is 01:14:10 the code generation directly now, but we have this DSL in between. It looks like you're writing C code, but actually you're doing code generation. Makes it a whole lot more accessible. Yes, and then the other research paper is the SIGMOD paper, and we haven't spoken of that yet, I think. So here it's a whole different topic.

Starting point is 01:14:33 So it's about joint ordering. And actually the interesting way is how this actually sparked. So you have to know that our chair, our database chair is co-located on the same floor with the foundations of artificial intelligence chair and their domain of expertise is automated planning. And so at some point we were sitting in the kitchen and I had a chat with a colleague from that chair and I was describing this problem of joint ordering to them.

Starting point is 01:15:06 And then he was thinking about this problem from his point of view of a planning task. And it turned out that this can be seen as a planning task, of course, but it's not one that you would usually find in their domain because it it has from the perspective of planning it has a particular property that the state space changes with every action and okay um anyways you could argue that it's it's a it fits into this lifted planning domain which it just had a quick search backwards i think it's like started only in 2019 so it's, so it's recent research. So what we then did is we decided that we perhaps should collaborate on this problem, and we had a bachelor student looking for a thesis. So we said, how about you implement joint ordering as a planning task in PDDL,

Starting point is 01:16:03 which is the problem description language that they use, and then you just feed it into a solver. It's named FastDownward. And then we just have a look what's coming out. So how well does a regular general purpose automated planning solver work on our problem? And results were quite disappointing in several ways. because on the one way on the one side the the solver has to translate from this description of the problem into an intermediate representation and thereby already doing exponential work amount of work so that's already not not a good idea but then once that was done and you could run the solver on this description well Well, the description is exponentially large in size, so there's not much to gain.

Starting point is 01:16:47 But what we saw is that the heuristics for planning that were provided in this off-the-shelf planner, they are general purpose, but they don't really help in our problem. They're not informative. And also another problem, and that's implementation detail this dissolver is implemented for all kinds of planning tasks and so it's not a domain specific solution and therefore it has it's low there's some overhead in it and so after the spatula thesis I decided to implement my own software in mutable that is domain specific for our problem. And basically I implemented many variants of the A-star search algorithm.

Starting point is 01:17:32 And then I did some memory and compute optimizations for the representation of this problem, because now we can do that. We are not domain specific. And also I got some statistics of this planning process and to understand what's actually going on in the search because we didn't really have that much insight in the beginning. And then we started just doing experiments and writing this up. And then at some point I spoke again to my colleague

Starting point is 01:18:01 from the artificial intelligence group, whose name is Daniel Gnade. And I'm very thankful for his advice during this time. So I discussed with him this problem, and then I spotted an optimization opportunity that already has a name in this planning community. It's called a partial order reduction. And basically what it means is you can figure out that you have a search problem, and there are several paths that go from the same start to the same goal. And these paths are actually identical up to permutation of the actions. And you don't want to consider them because they're of no value. So consider, for example, you have a Bashi join plan and you have A join B on the one side, C join D on the other side,

Starting point is 01:18:45 the order in which you do those doesn't really matter, right? It doesn't matter. And so you don't want to consider those. And then we figured out that there's an optimization potential and we also implemented that. And it turns out that now

Starting point is 01:18:59 the heuristic search works wonders. It's really fast and particularly fast on tough optimization problems. Like if you have dense queries, like, I don't know, like 20 relations in a clique shape or something similar to a clique, these are really tough problems and heuristic search can get away there much faster. Like we had experiments where it's a thousand times faster than state of the art. And this number is just, it's coming from the fact that we're just asymptotically faster now in solving this problem. And the crucial part is to understand this.

Starting point is 01:19:42 When you look at these textbook algorithms like dynamic programming and top-down dynamic programming, they are all exhaustive. They enumerate all these existing plans. They have to enumerate all of those that exist and pick the best. Now, what you can do with the heuristic search is you can prune some away and still prove that you remain optimal. And that's really a game changer, I would say. Yeah, wow, it sounds amazing.

Starting point is 01:20:03 It's the number itself a thousand x sounds yeah but but again let me put this in into into perspective it's on the tough nuts you know like if you really have a hard case where you have many relations and and many edges between those relations then yeah it excels but if you just have like a chain of relations or yeah then then your standard dynamic programming will be better cool but i mean obviously it has some utility right if if you can integrate that into a system somehow that can tell whether to use this approach or that approach based on what the query is like right then it feels like a win a big win as well and cool and so kind of just going off

Starting point is 01:20:41 like how this your research and whatnot how do you go about generating these ideas? I mean, a lot of it seems to be sort of just being in an environment where you're having conversations with people from other domains, maybe, and kind of that spark kind of happens. But how do you have a systematic way of approaching generating ideas and then selecting ones to work on um honestly i think i don't have have really much advice to give here i don't consider myself a visionary person to be honest i think of myself more as a problem solver an engineer and so as you've put it being in a situation where you're confronted with a problem so now this sparks sparks my interest and gets me thinking. Also, I should say that considering how slowly I progressed during my PhD, I was really slow.

Starting point is 01:21:34 I should add the disclaimer that what I'm about to say is to be taken with a grain of salt. So it's my personal experience. And please don't think like this can be advice. But what I think, if someone's listening and is an ongoing PhD student, then perhaps what I could say is do what you do best. Play out your strength and don't blindly follow the next trend. You will excel at what you do best. And if you don't know your strengths yet because you just started a PhD,

Starting point is 01:22:07 as a PhD, then follow your personal interests and don't let your PhD advisor get in your way. Find a way to convince him or her of your ideas. And looking back at the SIGMOD paper, what really has set this off was looking beyond the horizon, like speaking to people that were outside of my domain and also that where the conversation sometime

Starting point is 01:22:31 put me out of my comfort zone because you know i have like i have this nomenclature the terms and this way of thinking about databases but now we look at the problem from very different angle and and with discussing it with someone from a completely different community they use different terms and describe problems differently it can be uncomfortable to to assume their point of view but it may be very worthwhile sure i think those like you say putting yourself in an environment where you are gonna be out of your comfort zone just instantly sort of gives yeah it's a more fertile environment for i like new ideas to generate right i think because if you stay within your comfort zone on time i mean i don't know you need that spark and i think sometimes going out

Starting point is 01:23:21 stepping out of your comfort zone speaking to somebody else about a problem who's from a different area can yeah i mean it might end up with no ideas right but you never know you might end up with a really good one as well so i think it's um that's good solid advice it's trial and error exactly right yeah that's just a scientific way right um so i last question now um what do you think is the biggest challenge in your research area now and i don't know how you'd go about actually categorizing your research area because it's quite a novel sort of topic in itself right in the sense of building a an academic data but a database system that people can kind of can uh can use i mean yeah i mean yeah what do you think is the biggest

Starting point is 01:24:04 challenge in the area and maybe even in the wider sort of space of databases in a minute that's really a tough question i mean what i felt during my time as a phd student is that when talking to other PhD students, they were always interested in a communication and sharing ideas and perhaps even collaborating. And I think what would be cool is if we could motivate our community to work more collaboratively on software. And I think generally working on software

Starting point is 01:24:52 is currently a problem for research, I think, or our domain and databases, you know, because, and I must say I'm a bit copying or paraphrasing what Peter Bonsch and others have said at this memorial of Martin Kerstin at the CIDR. They said that Martin Kerstin was always thinking of how to defend his employees or PhD students that were working in software and getting credit for their work in the academic domain. And you may know this because you have been, or you still are a PhD, no, you have handed in, right? I'm kind of in the hinterland between sort of submitted, I haven't defended yet, but yeah, I'm still kind of technically one, even though I have a full-time job now. But yeah.

Starting point is 01:25:47 But I guess then you had the same experience. So you have to implement something in a system to do your evaluation. And this takes a lot of time. And this takes a lot of time that you're not writing a paper and you're not publishing. And this is a problem of systems research, right? And I think that we could perhaps at least help people

Starting point is 01:26:18 that are considering a PhD in this domain if we could somehow better balance this programming effort and systems development with the actual part of writing papers. And perhaps a system like Mutable that tries to be a system for researchers could help them just lift the burden off their shoulders. Totally agree. If you can minimize the overhead of the implementation effort effort whether whatever in just even i mean because i guess in systems it can be anything between like three months to two years right depending on what the system is what you're building if you're going to

Starting point is 01:26:53 build it from scratch yourself which like you say is a huge amount of time of a phd that that is just it's not dead time but it's time like you are not producing um papers and things that you basically eventually get judged by right um in terms of all that contribute towards um well your thesis or if you want to stay in academia and whatnot like they're important things you need if you want to progress in that in that in that career right so totally agree if you can minimize that overhead somewhere and make it like a more streamlined experience or less time consuming i think that is a massive win so yeah totally agree with that and i mean like we can we can we can end it on that note and thanks thanks so much

Starting point is 01:27:37 manuel for coming on the show and it's been a fascinating chat and if the listener is interested in knowing more about manuel's work i'll put links to all of the relevant materials in the show notes and we will see you next time for some more awesome computer science research thank you jack Thank you.

Disseminate: The Computer Science Research Podcast - Immanuel Haffner | mutable: A Modern DBMS for Research and Fast Prototyping | #21

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.