Disseminate: The Computer Science Research Podcast - Till Döhmen | DuckDQ: A Python library for data quality checks in ML pipelines

Starting point is 00:00:00 Disseminate the Computer Science Research Podcast, the DuckDB in Research series. Welcome to the DuckDB in Research series, the series that focuses on interviews with people who have used DuckDB in somewhere as part of their research, for example, maybe adding a new extension or something like that. I'm sure most of the listeners are already familiar with what DuckDB is if you're listening to this podcast, but for those who are new to the show, new to the concept of what DuckDB is, if you're listening to this podcast, but for those who are new to the show, new to the concept of what DuckDB is, it is an open source, in process SQL database that is designed for fast and efficient

Starting point is 00:00:31 analytical query processing. It's really simple to install, deploy to zero external dependencies, it's extensible, integrates really easily with all your data science tools, having APIs for R and Python, and it can directly read from file formats like Parquet and CSV.

Starting point is 00:00:48 And you may be thinking, why disseminate in DuckDB? What's happening here? So, well, Disseminate as a podcast, we're all about impact and helping bridge that gap between research and industry. And DuckDB is the brilliant example of these two communities working together in the sense that the ideas from decades of research are communities working together in the sense that the ideas from decades of research are at the core of the system and it's now influencing the research community itself as a platform for others to build on. And that is a nice segue into

Starting point is 00:01:15 you welcoming our guest today, Till Dohrman, who is a software engineer at Mother Duck, where he leads the AI department. Welcome to the show, Till. Thanks, Jack. Today, we're going to be talking about DuckDQ, which is a Python library for data quality checks in machine learning pipelines. But before we do that, Till, tell us a little bit more about yourself and yeah, and how you became interested in data management, machine learning, and yeah, this really cool space that we both work in. Okay. Okay.

Starting point is 00:01:42 Yeah. I guess my journey into data analytics started like 12 years ago, maybe. I was working as a software engineer in a team that were developing some business intelligence solutions. At some point, I went to a Java conference and there was a new track, data analytics, so I went there and saw the talks and I thought, oh, wow, that sounds exciting. So I decided to start my master's in Amsterdam and then ended up doing my thesis with Hannes Muehlreisen on CSV parsing, which ended up inspiring also the existing CSV sniffer in DuckDB. I also contributed that at some

Starting point is 00:02:28 point. Since then, I spent time in different companies working always on the intersection of data management and machine learning, never really committing to one of them, just doing what feels right. Since September last year,, I met Mother Duck and I kind of started the AI efforts at Mother Duck. I'm also still doing a PhD on the side. I'm still involved in research. I'm more wishing to be actively doing research. I think it's really, really, really nice, really helpful to still be involved in research and to go to conferences, to occasionally really take the time to write a paper. It really helps to sharpen the mind on certain problems. It's super exciting and you get invited to an interesting podcast. Yeah, but doing a PhD on the side there, it sounds like you'd like to keep

Starting point is 00:03:34 yourself nice and busy, put it that way. Yeah, for sure. I don't like the weekends. I really want to work Saturday, Sunday night. I really want to work on those papers. Yeah, that's it. Just tweaking that late tech a little bit, just to get that figure in their exact right place. Right. Yeah. No, but actually there was a point where I thought, well, I'm not sure if I still want to do this because I really don't like this writing part about it. I really like this ideation, thinking about stuff, trying stuff out, but actually getting it on paper is super, super hard for me. But when chat GPT came around, I'm not saying that I'm using it. It's just sometimes the writer's block, it's just so awesome to

Starting point is 00:04:19 have these kinds of tools available in daily life, daily work, whether it's writing a paper, thinking about ideas or for coding itself. I feel like that really took a little bit of this burden of doing both things off of me because it kind of shaves off sometimes these things that are really, really annoying. Yeah, yeah, I totally agree. Often the hardest thing with something like writing can be actually starting, right? Just sitting down and doing that, but having something like a track GPT that you can almost use to sort of seed that process a little bit and get the ball rolling and then be able to have something to iterate with almost.

Starting point is 00:05:01 Yeah, I mean, it's great. I use it more and more in my day-to-day life. So yeah, it's really great to see this sort of productivity enhancement you can get from it, which is great, right? Cool. But yeah, it's also really nice to hear your story there as well. So like everyone's got a different story. It's always nice to hear how you kind of ended up where you kind of have to there.

Starting point is 00:05:17 So that was really nice. Cool. So let's talk about DuckDQ then. So I guess before we do do that, give us a little bit more kind of a context or sort of the problem space in a way of like, yeah, what is the story at the moment with data validation and data quality in the sort of machine learning space and the machine learning world? Yeah, sure.

Starting point is 00:05:37 I mean, I think the paper is now to double check. Yeah, I probably wrote it four years ago from now, so things might have changed a little bit. Even though I think the last two, three years, probably most AI teams or ML teams were mostly focused on generative AI and kind of leveraging that new exciting technology in some ways. So I hope what I'm saying still applies today. But back then, the motivation was that the development process of machine learning models doesn't stop after the training. Often, the hard part starts when you actually put the machine learning model into production and you expose the model to live data where you

Starting point is 00:06:26 actually want to make the predictions. What maybe worked great on the evaluation dataset might at some point not work that great anymore in production. Often that goes back to this garbage in, garbage out. If the inputs that the model consumes are not good, the model will also not produce great outputs. I come from a software engineering background where I'm used to things like unit testing and so on. The equivalent for that on. And the equivalent for that, like an engineering approach to deploying machine learning models was kind of something I was interested in. And data quality is really the most obvious thing to go to. Yeah. It's funny. You say that too because actually when I was reading your paper,

Starting point is 00:07:20 I actually wrote, I don't think the phrase exists anywhere in the paper, but I actually did highlight section and write garbage in garbage out of it. So like, yeah, that's that kind of classic phrase, phrase there. Cool. So yeah, I guess given that then and kind of rolling the clock back to 2021 when you wrote this paper, I know things may have changed slightly in the intervening years, but back then, I mean still today probably in some, in some places, what were the state of the art approaches then for sort of trying to ensure this data quality in machine learning and getting a little bit more sort of closer to what you would be used to in a software engineering world really?

Starting point is 00:07:56 What were the state of the art approaches and why did they? Why were they terrible? So I guess there were some quite interesting solutions in the big data space. So my advisor used to work for AWS, my PhD advisor used to work for Amazon, AWS, and they were developing a framework based on Spark to do data quality evaluation at scale and to do it efficiently. So it's something that was used, I believe, internally at AWS, but it's nowadays also available as external service on there.

Starting point is 00:08:36 One of the many offerings of AWS. The solutions that existed for data scientists that don't work with Spark clusters that work on maybe smaller, we say moderately sized data, they had great expectations, for example. That's one of the libraries that was quite popular back then. In terms of features, I think Deque, which was the Spark-based system and great expectations of a very similar thing. They help you with determining distributions of your input data, of your training data, and give you a way to monitor or to profile your production data against these baseline distributions. So you can see a shift in distribution that might break the assumptions you were making

Starting point is 00:09:32 during training. And it can also highlight other more basic failures like, okay, one column is suddenly null in my data because some downstream connector just broke. There's some, you know, temporarily something broken. And if we bring that, if we get this data into the model, then either the inference pipeline will break because it doesn't account for nulls and columns, or it will maybe impute the nulls with zeros and then you will get really unexpected results. So these libraries help with monitoring these kind of data properties and with

Starting point is 00:10:08 then raising alerts in a similar way to how you would maybe trigger an alert with any other service degradation in the company architecture. So, yeah, and there are other tools, there are things like YLOGs or DeepCheck, Soda. I Googled a little bit just before to make sure I'm kind of up to date. If we focus on great expectations, which was the thing that was around at the time, it was based on pandas. It was using pandas as the engine to determine all the summary statistics we mentioned. To get a sense of the distribution of the data, to get min, mean, marks, whatever, standard deviations, these kinds of things, number of missing values.

Starting point is 00:11:00 It was using pandas as the compute engine. It was also determining each metric one at a time. So I want to get the min, the mean, and the max of a column. I run three individual pandas queries to determine those. The idea was we look at how Deque does it. When we look at how Deque does things, it is very, very different. It tries to determine which of those operators can be grouped into a single query. Then it only needs to scan the data once, which is much, much faster than scanning it multiple times for each metric.

Starting point is 00:11:45 Well, there happens to be an efficient database engine because I guess that's a database engine instead of a data frame engine felt more natural for this type of problem or for this type of optimization. And yeah, that happened to be DuckDB, which was kind of fitting this, this user group in terms of workloads, right? We didn't want to build, there was already a solution for Spark that scales indefinitely, but there was a missing solution to do data quality validation efficiently for small to medium sized data. And yeah. Yeah. I guess that that is also like a significantly large portion of the market as well,

Starting point is 00:12:30 like these people with small and medium large, medium sized data, right? Not everyone has big data, right? I know it's a bit of a, I cannot catchphrase, but like a saying that gets said a lot, but yeah, not everyone has big data, right? So yeah, we need to sort of build solutions for the other part of the data space as well. So yeah, I guess with that, you mentioned briefly that DuckDQ, I can kind of see where the name comes from a little bit as well. I can see how that DuckDB and DQ are sort of merged into DuckDQ. So yeah, give us the TLDR then on DuckDQ.

Starting point is 00:13:02 It depends on which angle we want to focus on. I can talk about the kind of optimization we were doing about the integration with scikit-learn. I guess on a very high level, the goal was to provide something that is super easy to use, has very little dependencies, and it just works. You can just plug it into your existing scikit-learn pipeline locally. And it just tests all the data that goes into that pipeline and gives you a warning when it doesn't adhere to the quality metrics you specified. There was the high-level idea.

Starting point is 00:13:41 Yeah. Cool. So yeah, I guess kind of off of that then, so let's, we can dig into, dig into the whole design, I guess, right? So maybe we should start off by telling us how it sort of, how it integrates with, with, with scikit-learn pipelines. And then we can focus on kind of the specific things you needed from DuckDB and why DuckDB was a good fit for the problem you're going to solve.

Starting point is 00:14:00 Right. So yeah, give us the high level overview on the integration with scikit-learn and then we can get stuck into the, how DuckDB actually solve. So yeah, give us the high level overview on the integration with scikit-learn and then we can get stuck into how DuckDB actually is used in the library. The scikit-learn integration was quite straightforward. So it has this notion of pipeline steps and we basically added two pipeline steps to a scikit-learn. And one of the steps is the input validation step. And the other step is the output validation step.

Starting point is 00:14:33 And the input validation step makes sure that the data that goes in adheres to the metrics you specified and the output step makes sure that the output of the model adheres to those. We wrapped it up into a Python package. You can, instead of importing a scikit-learn pipeline, you can import a duck-deque pipeline and use it in the exact same way you would use a scikit-learn pipeline. Then you can specify this data quality metrics on this pipeline object. And when it runs into an error, there are different levels of errors. So it can be, I think it can be a warning.

Starting point is 00:15:16 It can be info warning or error. So when you hit a warning, it would just output a log message. Same with info. And when it's a error, it will actually raise an exception. So, and you can also, what is maybe quite neat is that you can also serialize the entire pipeline. So you can, you can really just pick up the entire thing and have copied it into your deployment system.

Starting point is 00:15:44 And you will have the pipeline including the checks there everything within this serialized package. I know probably Pickle isn't nowadays the right standard to do that, but yeah, I know there are some security issues with Pickle. Okay. I really like the design, the design philosophy of it being dropping replacement as well. Right. So the user basically has to do no extra dependency or anything like that.

Starting point is 00:16:13 It's all just you plug it, you unplug that one, unplug this one and you get all the benefits of having these data quality checks for free, which I think is a really nice, it reduces the friction for use, right. Which is always really nice with something. I just thought it fits so nicely to the second question you asked about DuckDB. The input to the pipeline is a drop, drop and replacement, and the input to this pipeline is still a pandas data frame. Whatever happens under the hood, it doesn't really matter for the user.

Starting point is 00:16:49 But what's like a really cool thing that DuckDB enabled at the time. And it was a feature that was basically added right around the time when we wrote the paper. So it was really, really great timing was this reading, like zero copy reading from pandas data frames. What that enables is you can write a DuckDB query, like a select star from table, and the table is actually a pandas data frame that lives in memory in your Python process. The costs of doing that or the overheads of doing it are very, very minimal. It doesn't add memory overhead because it's DuckDB essentially reads from the address base of your pandas dataframe. The compute overhead

Starting point is 00:17:39 compared to querying it in the native Duckee database format is also not that high because the memory layout of PANAS data frames is not so much different from DaktiBee's memory layout. Yeah, so that was really an enabler for making it so nice and easy to use while also being very efficient. Did you sort of like, actually that kind of happened around the same time? Was that just good fortune or was that sort of this tackling this problem was partly because you knew this new feature was coming as well and would really sort of be a boon basically for such a solution that you kind of have this zero copy functionality?

Starting point is 00:18:21 So I guess it was actually the other way around that we, this scikit-learn integration or this PANAS integration was something that was an idea that developed in the process of developing this data call it or of bringing DQ to DuckDB. And so, what I said earlier about the motivation is actually something that evolved while working on the problem. We thought, oh, well, this would actually be really cool to develop. Around that time, we were then using that just merged feature basically. I think we were good test ground also for this functionality. I think there were some performance issues that we found around strings or something.

Starting point is 00:19:13 I guess it was helpful for DuckDB as well as for us to work on a little bit real-world workloads for this specific use case. And that was great to have like a short wire to, to the developers. Yeah, of course. Yeah. I've been that sort of feedback loop there and that communication link must be really, must be really helpful. And yeah, I guess also with this sort of being DuckDB focused, what, what other feature features from DuckDB did you specifically kind of rely on as part

Starting point is 00:19:44 of the DuckDQ solution? Yeah, I think the features that we are using are relatively basic because we do basic aggregations. So min, max, mean, and so on and so forth. One thing that we... I guess it's not really user-facing feature, but it's something implicit to the database which i think is really has been beneficial that's just the execution model of ducti b. Which is this vectorized execution so it kind of operates almost like a streaming engine every operator with its mean min or max. Every operator, whether it's mean, min or max, standard deviation is implemented in a way that it has intermediate state it will determine the max of those 2048 values and then remember that max as a state of the operator.

Starting point is 00:20:50 Then the next batch comes and so on and so forth and will always update this state. That of course gets a bit more complicated with multi-threading and so on, but in principle that's what happens. And that happens to be something that DuckDQ also uses and exploits. And for one particular function, we had to extend DuckDB to expose this internal state to us so that we can actually also use it and persist it because Deque's computations are also stateful. I can explain why. That comes from this notion of incremental loads into data warehouse or data lake where you get every day a new small additional batch of data that gets appended to your existing

Starting point is 00:21:43 table. Now when you're interested in the mean or max of the table, you wouldn't want to scan the entire table again. You would want to have some intermediate result of everything you have known so far and then only add the states of the neurons that were added, and yeah, exactly, at those states up. We persist those states in DQ, so that saves a lot of recomputation. For Max, that's easy. For standard deviation, I think we had to expose some internal values from the DuckDB

Starting point is 00:22:26 operators. And that's maybe also something that was nice in DuckDB that we could... I mean, it's open source, right? You can't do that with every database system. And it's also kind of easy to follow the code. It's well structured in a way. Now it has this community extension, where it's actually really, really well documented how you add custom functionality to DuckDB. It has the Discord community, or alsoub. I think back then there was also a lot happening on github.

Starting point is 00:23:08 Still is. I'm always impressed by the amount of care that Dr. B people put into responding to questions, to bug reports, to everything. So when you want to know how something works, you will likely find an answer from someone at DarkDB Labs that will tell you, okay, you should do this or that. Yeah. It makes for a nice developer experience, right? When you kind of have that, you can ask a question and you know you're going to get a very thoughtful, insightful answer back, which is really helpful. It's interesting there, as part of the process, that you ended up needing a feature and you know you're going to get a very thoughtful, insightful answer back, which is really helpful. It's interesting that you said as part of the process that

Starting point is 00:23:48 you ended up needing a feature from DuckDB that I didn't necessarily have and so I've been able to expose this intermediate state. And it's interesting that, because I like to ask, how did your research then influence the core product in the sense, right? And kind of that's a really nice thing that we needed this. And I guess, do you have any insight as to how that has been in the long term since I kind of have other people then relied on this ability to kind of have access to this intermediate state or not? It seems like an interesting thing to kind of you encountered it and it'd be nice to

Starting point is 00:24:15 know how many other people have then gone, oh, given I have this, I can now do the thing I wanted to do as well. Yeah. I mean, we never upstreamed that actually. So I think that's a bit of the shame with that project that it has never really gotten into a phase where I would consider it ready or mature enough to actually use in practice. Unfortunately, it's one of those artifacts from research artifacts that were created for paper, which didn't even get as much attention as I was hoping it to get. It was very hard actually to get it accepted, even

Starting point is 00:24:56 though it was quite a lot of engineering effort to build it. At the same time, the main purpose was to get research paper and afterwards I had to move on to other things. It brought me a lot of experience personally that I could use later on. I happened to work in a role later on where we were working with Deque with the Spark-based library. I could help the company I was working for with making it more efficient. Later on, we also introduced DuckDB in their stack to get rid of these overheads that Spark introduces. It was helpful for me, but I wish it would have been helpful to more people. And just when you reached out, also like a week later, I talked with my advisor and said, I really want to pick this up again.

Starting point is 00:25:58 We should just at least get it on PiPi released and figure out some small things. As I said, this custom extent like this kind of fork we did of DuckDB. We should do that. There's also a dependency to Apache data sketches, which is an amazing library. I don't know if you want me to talk about data sketches, but it's instrumental for a couple of things that we do in DuckDQ. And it's quite an interesting way of doing approximate query execution, basically. And so that's also dependency that doesn't feel really nice. Now with the extension mechanism of DuckDB, I really feel it's time to build a community extension that just incorporates

Starting point is 00:26:45 all of this into one package. Yeah. Yeah. It'd be really cool to see it get picked up again. So I'm sure this is this maybe it's kind of had a couple of years just sat there kind of on the low cooking, but now it's time to maybe turn up the heat a little bit and kind of turn it into something a product people can use. That'd be really, really nice. I mean, also with the kind of saturation in AI, or at least with the, let's say the growth curve

Starting point is 00:27:12 is like flattening a little bit. I mean, there's still new models every day. There's still new product announcements, whatever, every day. But maybe we're not going to see like so fundamental improvements in at least language model capabilities. The general purpose foundation models will probably go more into specific application domains and so on and so forth, where we see improvements or just building products based on this amazing technology. I hope it frees up some time also for teams to focus back on classical machine learning because I read this number somewhere, unfortunately, I don't know. I don't remember where, but 90% of the workloads that are like running like machine learning based workloads that are actually running in companies are still classical machine

Starting point is 00:28:10 learning. So like this classical machine. So they actually make the largest amount of money and impact. So yeah, I still think there's, there's a lot of room to provide better, better tooling for. Yeah, I agree on that. I mean, a lot of the time it's, I think you find a lot often in even within data management, within database, as I've said, certain maybe concurrency control protocols, for example, that are more complicated and kind of have been

Starting point is 00:28:39 demonstrated in the research literature. This can actually get you better throughput, low latency, whatever. But because there are a lot more of a kind of complicated intricate solution that lots of some still kind of still fall back to maybe Using a lot based system as in that which is sort of very simple to reason about and put out there But that's probably the reason why it's put out into practice right because people can solve a reason about it and a lot more It's a lot more explainable right and which probably is true as well of certain of some of these more advanced machine learning techniques compared to some of the classical sort of approaches, which are more understood, I guess, and you can kind of, you have more confidence putting

Starting point is 00:29:10 them out into production into the wild. And yeah, and it's interesting. You mentioned a second ago about implementation. So let's, let's talk on that a little bit more and cash your mind back to when you were working on this, he said it was quite a significant implementation effort. So yeah, tell us about what was that like? How long did it take? What sort of things did you encounter that were a bit like, oh, this is difficult, maybe

Starting point is 00:29:30 scope towards maybe DuckDB in particular. So, I mean, the first challenge was to actually understand, understand exactly the way in which Deque is built, the original system that inspired this work. And then to come up with an architecture that resembles what Deque is doing, but makes it possible to do that in SQL because Deque was actually built on PySpark. So the declarative kind of wrapper of Spark where you could write kind of like pandas code.

Starting point is 00:30:14 So, and we also wanted to build a solution that is kind of flexible where you can plug in other SQL databases, not necessarily only DuckDB. Well, yeah, I guess that required some thinking about how to architect it. We ended up, and I'm looking at the paper now at this system architecture that I have here. So we ended up having user-facing API that is pretty much the same as DQ's original API, which means you have a declarative API to specify certain quality constraints. Then we have a second layer in abstraction where we handle all this verification logic, the state management

Starting point is 00:31:06 of the intermediate states that I mentioned earlier, and to translate these different quality constraints into concrete operators that we want to use during the profiling. And there are two distinct sets of operators. One of them are scan sharing operators and the other one are group scan sharing operators. For the user, it is not important which quality metric maps to which of those operators. The user basically only says, okay, this column should never be null. This column should, the standard deviation should never be larger than this. Here, the histogram should look like this and that. And we map those to a certain set of operators that we need to execute and to type of operates we need to execute.

Starting point is 00:32:05 need to execute and to type of operators we need to execute. In the end, we essentially need to generate secret queries that execute those operators. The nice thing about the scan sharing operators, about the separation into these two different kinds of operators, is that in the end, you only need to run two queries, one for the scan sharing operators, one for the group scan sharing. When you compare that, imagine you have a dataset with 500 columns and with 50 different constraints, then in pandas, that would be a lot of individual operations. In DuckDQ, that would be only one or maybe two queries if you happen to have one that needs a group scan sharing operator. I mean, I can talk

Starting point is 00:32:52 about the distinction because it's maybe interesting to understand. So, I mean, the main idea is we want to. So, the main assumption is scanning data is expensive because the data is, even though it might be in memory with pandas, but still reading every single data point, it will bring it into the CPU cache and it will do something with it. If it does that 20 times at different points in time, that's much worse than just keeping it in the cache and doing all the different operations on it while it's there. There's still a big difference between reading from CPU cache and reading from memory. Nice.

Starting point is 00:33:43 I guess sort of leading on from that then. so that it feels like there's going to be a really big speed up here. So let's talk about some evaluation results from the experiments you ran when you were evaluating with VQ's performance. Yeah. How much faster is it? How much more performant is it over the state of the art with having to basically do a full table scan every time I want to, on a 500 columns every time I want to evaluate

Starting point is 00:34:05 some, I'll perform some check. Yeah. So we compared it to, obviously to DQ, to the original DQ on workloads that are small or moderately sized with, which we defined as, you know, a few million rows. I think like the first is on three million rows and eight quality constraints. When we run this quality evaluation on the Spark-based Deque version, this takes around seven seconds to do this evaluation. If you think about the production machine learning pipeline, where you want to do predictions, that's quite a big amount of time.

Starting point is 00:34:48 Well, you don't make three million predictions at once, maybe, but that's why it's important to also look at the static overhead of running a Spark-based pipeline. In the experiments here, it was 47 seconds just for reading the data from starting the Spark process and then reading the data from CSV into Spark. But usually, no matter how much data you read, it will take at least like 8, 9, 10 seconds to start up the Spark cluster. DuckDB compared to that only took 0.6 seconds or so to do the evaluation. So even if you are in a production setting where you want low latency predictions, you can still for like very decent batch sizes, you still have quite a minimal overhead to do this quality validation.

Starting point is 00:35:54 This is firing up the cranking up the Apache Spark to sort of run run run one every time, right? Like in eight, eight, eight seconds, just to get the damn thing up to actually be able to do it. So, yeah, it's a massive improvement over over that for sure. I guess I guess you mentioned that obviously the project has sort of been on the back bar for the last couple of years and that you would maybe now think about picking it up again and seeing if you can make it into a tool that people could actually use in, in a production setting. So I guess the question then is what do you think would be needed to, to, to actually get to that point from where we are today

Starting point is 00:36:25 with it? I think what it would really need is a C++ developer in the first place. So that's always a bit of a rare find among machine learning people or people that are very interested in machine learning and data science because they are obviously, Python is the most common language there. DuckDB has SQL-only extensions as well where you need only a very minimal amount of C++ knowledge to implement new functionality for DuckDB. But that only really makes sense if the functionality is in some way can be expressed in the existing SQL functions. I

Starting point is 00:37:12 think in that case, we would need some custom functions. I would really wish that Apache data sketches would be wrapped up into a DuckDBB extension that would be super useful because DACTiB has an approximate count distinct operator. So you can determine the number of distinct values in a column by only scanning once. That's an optimization that Deque uses, but Deque also uses an optimization for quantiles or to determine basically column distributions, that thing called KLL sketches. And having this as a DuckDB function to be able to compute a KLL sketch over a column would be one of the enablers for Deque. And I think it would be super super, super like useful beyond that.

Starting point is 00:38:05 Cool. There we go. That's the steps when you do the roadmap. So 2025 Q2, we're going to have it done till. Yeah, exactly. Let's have a call then again. Let's check in again then. I would just ask Chetjipati to write it. I would just ask Chetty Pudu to write it. There you go, there's your C++ developer. I guess next question I want to ask you is of reflecting, obviously it's been quite a few years now since you worked on this, but to gauge what your overall experience was like working with Doug DB, obviously it's probably a very different sort of experience back then than it would be now, maybe because things like the community extension have come around

Starting point is 00:38:50 and the documentation has improved a lot over the intervening years. But yeah, what was that experience like when you reflect on it? And also, what would be your advice to would-be extenders of DuckDB or people who would want to maybe use it in their research or in their side projects? Yeah. So that's an interesting question. I guess you asked multiple questions. Sorry. Yeah. I have a habit of doing that. What was your experience like reflecting on it? Where can we deduct EB? That was the first question. Then we'll do the next one. So there's two for the price of one.

Starting point is 00:39:21 My short-term memory is corrupted already. I think the experience was great at the time. So I expect it to be similar today. Of course, now working at Mother Duck, we have a professional relationship with DuckDB Labs. So we have a DuckDB Labs partnership that gives us access to actual development resources. Back then it was more informal. So it was just like, okay, yeah, that sounds great. Let's do it. Or maybe not. Or please make a PR. I think that's something that still works great today. If something is missing in DuckDB and you feel it should be there, why not open a PR to edit?

Starting point is 00:40:10 So, that's always the way I think that also still works today. Yeah, I mean, there's so much to learn by looking at the DuckDB source code as well. Especially the way DuckDB does testing, I that's very, very good example of how software should be tested. It's just so, so extensive. Every single feature has a test suite. Every edge case is tested. I guess I spent more time developing tests for when I merged the CSV sniffer. I guess I spent more time developing tests and finding my own bugs in the process and

Starting point is 00:40:53 so on than I spent on the initial implementation. That's good. That's the way they say it should be, right? So you should be spending more time writing tests. The actual writing of the production code is probably the smaller portion of the time. Yeah. I guess the second question I asked you a second ago was what would your advice be

Starting point is 00:41:12 to anybody thinking about using DuckDB or extending it or using it in the research? Yeah, I guess from it, it really depends on where you're coming from. I think it's interesting for people working on systems like data management systems. It's kind of a breeding ground for master students, PhD students coming to the CWI who develop various extensions, optimizations, things for DuckDB or on top of DuckDB. But then on the other hand, there are also researchers in the geospace or in chemistry or in bioinformatics that have just the need to process relatively large amounts of data in an efficient way locally. DuckDB just offers a great solution for this.

Starting point is 00:42:07 And I mean, there is a great geo extension, but yeah, I've not seen a chemistry extension, for example, for DuckDB, even though I'm sure there are lots. I had a bioinformatics course at my university during studies. It was a super old Java program that we used to do some sequence alignment or something. I'm sure a tremendous amount of domain knowledge went into this. For being what it is, it was probably really, really well thought through and optimized. Maybe there's still huge potentials to do these kinds of analysis much more efficiently by putting on the database management glasses and thinking, well, could we maybe actually put this in a DuckDB extension?

Starting point is 00:43:11 Can we fit this to DuckDB's execution model and maybe just use that? And yeah, I hope DuckDB will develop a bit more into this ecosystem, into this open ecosystem, a bit like R maybe. People just develop new our plugins. Why not develop a new deck to be plugged in? Yeah, no, it's interesting that you mentioned that because I, uh, I've experienced a similar, similar thing where a friend who she, um, was doing her PhD and it was in bioinformatics or it was in genomics, some, some more in that space, can't really exact specific domain and the same sort of

Starting point is 00:43:43 thing she's talking, you help me with this program. And I was like, okay, this is janky old like Java based thing. That was like just a complete, like, I mean, obviously it was very focused at doing this one specific task. I can't remember what it was, but I was like, this feels like it's such a bad user experience. Like it wasn't a nice API or anything, but obviously it was the state of the art in that specific field.

Starting point is 00:44:03 But yeah, this probably is like loads of opportunities there to apply all the really awesome techniques within DuckDB and all from data management to those sorts of spaces as well. Yeah. And it's not only about just making it fast, right? It's all okay. It's cool if it's faster, but sometimes I think what is much worse is, oh, that's a type of analysis that I wish I could do, but my program always

Starting point is 00:44:26 crashes when I do it. And I think then really, yeah, it really enhances the possibilities here. Cool. I've just got a couple more questions now, before we wrap things up. The next sort of, I guess, high level section, shall we call it, is about impact. So obviously this paper's been around for three or four years now. What do you think the impacts of the paper have been? And have you had any feedback from people over the years on it? Yeah, so I've heard from people from time to time saying, yeah, okay, well, this looks interesting.

Starting point is 00:45:07 I would want to use it. In which state is it? Is there a PiPi package over it? That's kind of where my future plans were coming from. Actually I haven't checked the citation counts on it. That would actually maybe be an interesting thing. I guess it's relatively, yeah, there have been two citations, but they're not really all too related. I guess one problem with that is it was basically like a development,

Starting point is 00:45:49 like an engineering problem in the end. I did not really solve a research problem. I made some benchmarks around it and so on, but it was not something revolutionary, novel. But it was not something revolutionary, novel. And I guess that was something that was not appreciated that much from the research community. But then on the other end, to make it useful to actually achieve real-world impact with the work, it really shows that you need to put in more effort than you

Starting point is 00:46:28 would put into a research paper. I think DuckDB is also an example of that, where the focus is really on the engineering excellence, on making it a great tool for users, thinking about the user problems and trying to solve them and not necessarily about only innovation. Sometimes it's great to pick up existing things that people might have neglected for a long time, but in that context, they are suddenly useful. That's kind of what I think I've learned from it. Sometimes it's worth putting the extra to focus on the users and think about what can we do for them, how can we make it easy for them to

Starting point is 00:47:19 adopt things like that. There are also great companies nowadays. I think YLOCKS, which is another company that works on data quality that actually also looked at Deque, I think, to some extent. Their library also got inspired to some extent by how Deque works. It's a company, it's a business around that. And as like, you know, PhD developing a project, it really takes a lot, a lot of dedication. And I guess sometimes it's open source project maintainers are really not appreciated enough for the amount of work that they're doing to give us these free open source projects that are well maintained.

Starting point is 00:48:07 Yeah, I completely agree with you on that. Another question that maybe leads on from this one a little bit about lessons you learned and stuff whilst having this experience. It was, what was the most surprising thing that you kind of encountered while working around data quality and with DuckDB and in this space, what's been the biggest sort of thing? Like, I did not expect that. That's a good question.

Starting point is 00:48:33 I guess what I didn't expect that it's actually so, I mean, it's not something I necessarily learned in this particular project, but a little bit earlier, like leading up to it was that it was quite... So I thought Spark is this tool that is for big data and it's the most efficient thing you could think of. But then when I started using it, actually, I just realized, oh, wow, there's something like so insane, it's much faster. So it's kind of opening a new world, thinking, well, okay, sometimes maybe it's really good to focus on a specific slice of the problem and thinking, okay, we don't solve.

Starting point is 00:49:18 We focus on small to moderate-sized data and try to make a great experience for these type of users instead of saying like, oh, we want to solve everything. We want to scale indefinitely and that there are trade-offs when you do that. Cool. So yeah, just last, I'd like to touch on your other work that you've got going on as well. But let's do that first. And then I want to get like, I want to have a little recommendation feature where you recommend an extension, a plugin or a feature related to DuckDB that you like

Starting point is 00:49:49 from this bit, from the other things out there. But before we do that, obviously do a lot of other really cool work as well. So yeah, and two of those things that are related heavily to DuckDB, a feature store and text to SQL. So give us the quick rundown on those two tools. So the work on DuckDQ brought me to the topic of feature stores. So feature stores are basically data management systems for machine learning, for data scientists and machine learning engineers that manage machine learning features. And one integral part of a feature store system is also data quality validation. I think feature stores are a great solution for the social aspects of developing machine learning models where you have maybe code repositories and software development.

Starting point is 00:50:40 Then the feature store feels like the equivalent for the machine learning engineers and data engineers and data scientists collaborating together on building and productionizing machine learning models, enabling reusing features and so on, and discovering features that others have prepared. source, I mean, on Texas sequel, which is completely like orthogonal topic in a way. It is something that, yeah, I kind of started working on, you know, when the large language model thing took off one and a half, two years ago. It's really hard to make a connection to data quality, even though I've worked so much more with language models. In the end, as I said, a lot of the teams that were building the machine learning models are now maybe building LLM-based pipelines. The dynamics there, in principle, there are similar problems. You have this kind of thing that is behaving

Starting point is 00:51:54 in a weird way, which is on one end maybe the language model or on the other end the machine learning model. You give something in and you get something out and somehow it's a little bit hard to verify whether that's correct or not. So yeah, the ML ops or now LLM ops space is evolving and trying also to provide tools for these GenAI or AI engineers to harness these large language models in production. So yeah, it's kind of, it's a super interesting problem also, I think, from the data management perspective, so research perspective there. And I guess with Texas SQL, I've come full circle from my interest, database and machine learning that just brings everything together. Nice.

Starting point is 00:52:47 Cool. In terms of why the listener can maybe go and play with these two tools, they're available. Where can we go and find Feature Store and Tech SQL? How can I go and play with those two? You can... Before joining Maldac, I've worked for Hobbs Works, which is a company that builds an open source feature store. So if you go to HobbsWorks.ai, you will probably find it.

Starting point is 00:53:15 It's also run by DuckDB partly. And for Texas Sequel, I guess you could sign up to Mother Duck and try it out there. There's tons of research about Texas SQL. It's really a hot research topic. I don't think the best method has been found yet. And it's still super hard to do things that produce verifiable, correct results in the sense of not only the query parses that comes out of a natural language to sequence system, but also it semantically makes sense.

Starting point is 00:53:52 It's really a tricky problem. Nice. I guess, yeah. So the penultimate question then is the recommendation. Oh, then obviously all the cool work you've done over the years. So what's your favorite DuckDB plugin extension or just the feature that it has? So we, not so long ago, we added a functionality or function to MotherDuck, which allows you to call language models from TrueSQL.

Starting point is 00:54:21 So then you can do something like, if you're a text column, you can write a SQL function, select prompt, you say, or summarize this column as summary from my table. And then for each row in your table, it will call open AI GPT for Omini, which is like low cost language model and it will produce, for example, a summary. It also works with structured outputs, so you can actually convert text to structured JSON and Tech2B happens to support JSON. You can unwrap them and suddenly you can convert an unstructured text into a new set of columns in the database. I think that's quite interesting. There's still lots of ways this can be optimized and so on. I really

Starting point is 00:55:14 like the fact that one day after we released that, someone published an open version of that extension, a StuckDB community extension. So that's called open prompt. So yeah, I was, I was pretty stunned by how fast that happens. Yeah. Yeah. That's the place of the LLM space and the AI space. I see it's so fast moving. Right. I mean, but yeah, as an example of moving fast, cool. I guess the last question then tell is, and what's the one thing you'd like the listener to get away from this podcast chat today? Looking like, I guess this is goes mostly to research, to research audience, like

Starting point is 00:55:59 people, people being, being involved in, in research from different fields, I guess, of many computer science research are listening. I personally think this intersection of machine learning and data management systems is super interesting. There are so many new ideas and possibilities that are interplayed with machine learning, generative AI, and data management systems in combination or at the intersections of those different worlds. So many interesting things that can be done, whether that's making data exploration easier or making text analytics easier. It goes also the other way around.

Starting point is 00:56:55 How can we maybe use data management systems to make certain ML problems easier, like this data quality checks or data validation. Yeah. And of course, if you do any research, of course you should do it based on DuckDB. It's a community extension, so everyone can benefit from it and not make the same mistake that I did. That's a great message to end on there. Thank you very much. It's been a lovely chat today. Where can we find you on social media? Where's best for listeners to reach out with you if they are interested in the things we spoke about? Yeah, oh, and now I forget the end.

Starting point is 00:57:45 No, what is it at is usually not in the domain, right? It's not. No, we're not. Well, and my Twitter handles are the, are the same. I, I suppose you can, you can add them to the video, to the audio description. Yeah. I'll drop, drop a link to everything we've mentioned in the show notes today. And yeah, the listen, go and check out, check them out.

Starting point is 00:58:10 But yeah, it's been a great chat. Thank you very much, Till.

Disseminate: The Computer Science Research Podcast - Till Döhmen | DuckDQ: A Python library for data quality checks in ML pipelines

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.