Disseminate: The Computer Science Research Podcast - Till Döhmen | DuckDQ: A Python library for data quality checks in ML pipelines
Episode Date: March 13, 2025In this episode we kick off our DuckDB in Research series with Till Döhmen, a software engineer at MotherDuck, where he leads AI efforts. Till shares insights into DuckDQ, a Python library designed f...or efficient data quality validation in machine learning pipelines, leveraging DuckDB’s high-performance querying capabilities.We discuss the challenges of ensuring data integrity in ML workflows, the inefficiencies of existing solutions, and how DuckDQ provides a lightweight, drop-in replacement that seamlessly integrates with scikit-learn. Till also reflects on his research journey, the impact of DuckDB’s optimizations, and the future potential of data quality tooling. Plus, we explore how AI tools like ChatGPT are reshaping research and productivity. Tune in for a deep dive into the intersection of databases, machine learning, and data validation!Resources:GitHubPaperSlidesTill's Homepagedatasketches extension (released by a DuckDB community member 2 weeks after we recorded!) Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Disseminate the Computer Science Research Podcast, the DuckDB in Research series.
Welcome to the DuckDB in Research series, the series that focuses on interviews with
people who have used DuckDB in somewhere as part of their research, for example, maybe
adding a new extension or something like that.
I'm sure most of the listeners are already familiar with what DuckDB is if you're listening
to this podcast, but for those who are new to the show, new to the concept of what DuckDB is, if you're listening to this podcast, but for those who are new to the show, new to the concept of what DuckDB is,
it is an open source, in process SQL database
that is designed for fast and efficient
analytical query processing.
It's really simple to install,
deploy to zero external dependencies,
it's extensible, integrates really easily
with all your data science tools,
having APIs for R and Python,
and it can directly read from file formats
like Parquet and CSV.
And you may be thinking, why disseminate in DuckDB?
What's happening here?
So, well, Disseminate as a podcast, we're all about impact and helping bridge that gap
between research and industry.
And DuckDB is the brilliant example of these two communities working together in the sense
that the ideas from decades of research are communities working together in the sense that the ideas
from decades of research are at the core of the system and it's now influencing the research
community itself as a platform for others to build on. And that is a nice segue into
you welcoming our guest today, Till Dohrman, who is a software engineer at Mother Duck,
where he leads the AI department. Welcome to the show, Till. Thanks, Jack.
Today, we're going to be talking about DuckDQ, which is a Python library for data quality checks in machine learning pipelines.
But before we do that, Till, tell us a little bit more about yourself and yeah,
and how you became interested in data management, machine learning, and yeah,
this really cool space that we both work in.
Okay.
Okay.
Yeah.
I guess my journey into data analytics started like 12 years ago, maybe.
I was working as a software engineer in a team that were developing some business intelligence
solutions.
At some point, I went to a Java conference and there was a new track, data analytics,
so I went there and saw the talks and I thought, oh, wow, that sounds
exciting. So I decided to start my master's in Amsterdam and then ended up doing my thesis with
Hannes Muehlreisen on CSV parsing, which ended up inspiring also the existing CSV sniffer in DuckDB. I also contributed that at some
point. Since then, I spent time in different companies working always on the intersection
of data management and machine learning, never really committing to one of them, just doing
what feels right. Since September last year,, I met Mother Duck and I kind of started
the AI efforts at Mother Duck. I'm also still doing a PhD on the side. I'm still involved in research.
I'm more wishing to be actively doing research. I think it's really, really, really nice, really helpful to still be
involved in research and to go to conferences, to occasionally really take the time to write a paper.
It really helps to sharpen the mind on certain problems. It's super exciting and you get invited to an interesting podcast.
Yeah, but doing a PhD on the side there, it sounds like you'd like to keep
yourself nice and busy, put it that way. Yeah, for sure. I don't like the weekends.
I really want to work Saturday, Sunday night. I really want to work on those papers.
Yeah, that's it. Just tweaking that late tech a little bit, just to get that figure in their
exact right place. Right. Yeah.
No, but actually there was a point where I thought, well, I'm not sure if I still want
to do this because I really don't like this writing part about it. I really like this
ideation, thinking about stuff, trying stuff out, but actually getting it on paper is super, super hard for me. But when chat GPT came around, I'm not saying
that I'm using it. It's just sometimes the writer's block, it's just so awesome to
have these kinds of tools available in daily life, daily work, whether it's writing a paper,
thinking about ideas or for coding itself. I feel like that really took a little bit of this
burden of doing both things off of me because it kind of shaves off sometimes these things that
are really, really annoying. Yeah, yeah, I totally agree.
Often the hardest thing with something like writing can be actually starting, right?
Just sitting down and doing that, but having something like a track GPT that you can almost
use to sort of seed that process a little bit and get the ball rolling and then be able
to have something to iterate with almost.
Yeah, I mean, it's great.
I use it more and more in my day-to-day life.
So yeah, it's really great to see this sort of productivity enhancement
you can get from it, which is great, right?
Cool.
But yeah, it's also really nice to hear your story there as well.
So like everyone's got a different story.
It's always nice to hear how you kind of ended up where you kind of have to there.
So that was really nice.
Cool.
So let's talk about DuckDQ then.
So I guess before we do do that, give us a little bit more kind of a context or
sort of the problem space in a way of like, yeah, what is the story at the
moment with data validation and data quality in the sort of machine learning
space and the machine learning world?
Yeah, sure.
I mean, I think the paper is now to double check.
Yeah, I probably wrote it four years ago from now, so things might
have changed a little bit. Even though I think the last two, three years, probably most AI
teams or ML teams were mostly focused on generative AI and kind of leveraging that new exciting
technology in some ways. So I hope what I'm saying still applies today. But
back then, the motivation was that the development process of machine learning models doesn't
stop after the training. Often, the hard part starts when you actually put the machine learning
model into production and you expose the model to live data where you
actually want to make the predictions.
What maybe worked great on the evaluation dataset might at some point not work that
great anymore in production.
Often that goes back to this garbage in, garbage out. If the inputs that the model consumes are not
good, the model will also not produce great outputs. I come from a software engineering
background where I'm used to things like unit testing and so on. The equivalent for that
on. And the equivalent for that, like an engineering approach to deploying machine learning models was kind of something I was interested in. And data quality is really the most obvious thing to
go to. Yeah. It's funny. You say that too because actually when I was reading your paper,
I actually wrote, I don't think the phrase exists anywhere in the paper, but I actually did highlight section and write garbage in garbage out of it. So like, yeah,
that's that kind of classic phrase, phrase there. Cool. So yeah, I guess given that then and kind of
rolling the clock back to 2021 when you wrote this paper, I know things may have changed slightly in
the intervening years, but back then, I mean still today probably in some, in some places,
what were the state of the art
approaches then for sort of trying to ensure this data quality in machine learning and
getting a little bit more sort of closer to what you would be used to in a software engineering
world really?
What were the state of the art approaches and why did they?
Why were they terrible?
So I guess there were some quite interesting solutions in the big data space.
So my advisor used to work for AWS, my PhD advisor used to work for Amazon, AWS, and
they were developing a framework based on Spark to do data quality evaluation at scale
and to do it efficiently.
So it's something that was used, I believe, internally at AWS, but it's nowadays also
available as external service on there.
One of the many offerings of AWS.
The solutions that existed for data scientists that don't work with Spark clusters that
work on maybe smaller, we say moderately sized data, they had great expectations, for example.
That's one of the libraries that was quite popular back then. In terms of features, I think Deque,
which was the Spark-based system and great expectations of a very similar thing. They
help you with determining distributions of your input data, of your training data, and give you
a way to monitor or to profile your production data against these baseline distributions.
So you can see a shift in distribution that might break the assumptions you were making
during training.
And it can also highlight other more basic failures like, okay, one column is suddenly
null in my data because some downstream connector just broke.
There's some, you know, temporarily
something broken. And if we bring that, if we get this data into the model, then either
the inference pipeline will break because it doesn't account for nulls and columns,
or it will maybe impute the nulls with zeros and then you will get really unexpected results.
So these libraries help with monitoring these kind of data properties and with
then raising alerts in a similar way to how you would maybe trigger an alert
with any other service degradation in the company architecture.
So, yeah, and there are other tools, there are things like YLOGs or DeepCheck, Soda. I Googled a little
bit just before to make sure I'm kind of up to date. If we focus on great expectations,
which was the thing that was around at the time, it was based on pandas. It was using
pandas as the engine to determine all the summary statistics we
mentioned. To get a sense of the distribution of the data, to get min, mean, marks,
whatever, standard deviations, these kinds of things, number of missing values.
It was using pandas as the compute engine.
It was also determining each metric one at a time.
So I want to get the min, the mean, and the max of a column.
I run three individual pandas queries to determine those.
The idea was we look at how Deque does it.
When we look at how Deque does things, it is very, very different.
It tries to determine which of those operators can be grouped into a single query. Then it
only needs to scan the data once, which is much, much faster than scanning it multiple times for each metric.
Well, there happens to be an efficient database engine because I guess that's a database engine
instead of a data frame engine felt more natural for this type of problem or for this type
of optimization. And yeah, that happened to be DuckDB, which was kind of fitting this, this
user group in terms of workloads, right?
We didn't want to build, there was already a solution for Spark that scales indefinitely,
but there was a missing solution to do data quality validation efficiently
for small to medium sized data.
And yeah. Yeah. I guess that that is also like a significantly large portion of the market as well,
like these people with small and medium large, medium sized data, right?
Not everyone has big data, right? I know it's a bit of a, I cannot catchphrase,
but like a saying that gets said a lot, but yeah, not everyone has big data, right?
So yeah, we need to sort of build solutions for the other part of the data space as well.
So yeah, I guess with that, you mentioned briefly that DuckDQ, I can kind of see where
the name comes from a little bit as well.
I can see how that DuckDB and DQ are sort of merged into DuckDQ.
So yeah, give us the TLDR then on DuckDQ.
It depends on which angle we want to focus on.
I can talk about the kind of optimization we were doing about the integration with scikit-learn.
I guess on a very high level, the goal was to provide something that is super easy to
use, has very little dependencies, and it just works.
You can just plug it into your existing scikit-learn pipeline locally.
And it just tests all the data that goes into that pipeline and gives you a warning when
it doesn't adhere to the quality metrics you specified.
There was the high-level idea.
Yeah.
Cool.
So yeah, I guess kind of off of that then, so let's, we can dig into, dig into the whole
design, I guess, right?
So maybe we should start off by telling us how it sort of, how it integrates
with, with, with scikit-learn pipelines.
And then we can focus on kind of the specific things you needed from
DuckDB and why DuckDB was a good fit for the problem you're going to solve.
Right.
So yeah, give us the high level overview on the integration with scikit-learn
and then we can get stuck into the, how DuckDB actually solve. So yeah, give us the high level overview on the integration with scikit-learn and then
we can get stuck into how DuckDB actually is used in the library.
The scikit-learn integration was quite straightforward.
So it has this notion of pipeline steps and we basically added two pipeline steps to a scikit-learn.
And one of the steps is the input validation step.
And the other step is the output validation step.
And the input validation step makes sure that the data that goes in adheres to the metrics
you specified and the output step makes sure that the output of the model adheres to those.
We wrapped it up into a Python package.
You can, instead of importing a scikit-learn pipeline, you can import a duck-deque pipeline
and use it in the exact same way you would use a scikit-learn pipeline.
Then you can specify this data quality metrics on this pipeline object.
And when it runs into an error, there are different levels of errors.
So it can be, I think it can be a warning.
It can be info warning or error.
So when you hit a warning, it would just output a log message.
Same with info.
And when it's a error, it will actually raise an exception.
So, and you can also, what is maybe quite neat is that you can also
serialize the entire pipeline.
So you can, you can really just pick up the entire thing and have
copied it into your deployment system.
And you will have the pipeline including the checks
there everything within this serialized package.
I know probably Pickle isn't nowadays the right standard to do that, but yeah, I know
there are some security issues with Pickle.
Okay.
I really like the design, the design philosophy of it being dropping replacement as well.
Right.
So the user basically has to do no extra dependency or anything like that.
It's all just you plug it, you unplug that one, unplug this one and you get all the benefits
of having these data quality checks for free, which I think is a really nice,
it reduces the friction for use, right.
Which is always really nice with something.
I just thought it fits so nicely to the second question you asked about DuckDB.
The input to the pipeline is a drop, drop and replacement, and the input to this pipeline
is still a pandas data frame.
Whatever happens under the hood, it doesn't really matter for the user.
But what's like a really cool thing that DuckDB enabled at the time.
And it was a feature that was basically added right around the time when we wrote the paper.
So it was really, really great timing was this reading, like zero copy reading from pandas data frames.
What that enables is you can write a DuckDB query, like a select star from table, and
the table is actually a pandas data frame that lives in memory in your Python process.
The costs of doing that or the overheads of doing it
are very, very minimal. It doesn't add memory overhead because it's
DuckDB essentially reads from the address base of your pandas dataframe. The compute overhead
compared to querying it in the native Duckee database format is also not that high because the memory
layout of PANAS data frames is not so much different from DaktiBee's memory layout.
Yeah, so that was really an enabler for making it so nice and easy to use while also being
very efficient.
Did you sort of like, actually that kind of happened around the same time?
Was that just good fortune or was that sort of this tackling this problem was partly because
you knew this new feature was coming as well and would really sort of be a boon basically
for such a solution that you kind of have this zero copy functionality?
So I guess it was actually the other way around that we, this scikit-learn integration
or this PANAS integration was something that was an idea that developed in the process
of developing this data call it or of bringing DQ to DuckDB.
And so, what I said earlier about the motivation is actually something that evolved while working
on the problem. We thought, oh, well, this would actually be really cool to develop.
Around that time, we were then using that just merged feature basically. I think we
were good test ground also for this functionality.
I think there were some performance issues that we found around strings or something.
I guess it was helpful for DuckDB as well as for us to work on a little bit real-world
workloads for this specific use case. And that was great to have like a short wire to, to the developers.
Yeah, of course.
Yeah.
I've been that sort of feedback loop there and that communication
link must be really, must be really helpful.
And yeah, I guess also with this sort of being DuckDB focused, what, what other
feature features from DuckDB did you specifically kind of rely on as part
of the DuckDQ solution?
Yeah, I think the features that we are using are relatively basic because we do basic aggregations.
So min, max, mean, and so on and so forth.
One thing that we...
I guess it's not really user-facing feature, but it's something implicit to the database which i think is really has been beneficial that's just the execution model of ducti b.
Which is this vectorized execution so it kind of operates almost like a streaming engine every operator with its mean min or max.
Every operator, whether it's mean, min or max, standard deviation is implemented in a way that it has intermediate state it will determine the max of those 2048 values
and then remember that max as a state of the operator.
Then the next batch comes and so on and so forth and will always update this state.
That of course gets a bit more complicated with multi-threading and so on, but in principle
that's what happens. And that happens to be something
that DuckDQ also uses and exploits. And for one particular function, we had to extend
DuckDB to expose this internal state to us so that we can actually also use it and persist it because Deque's computations are also stateful.
I can explain why.
That comes from this notion of incremental loads into data warehouse or data lake where
you get every day a new small additional batch of data that gets appended to your existing
table.
Now when you're interested in the mean or max of the table, you wouldn't want to
scan the entire table again.
You would want to have some intermediate result of everything you have known so far
and then only add the states of the neurons that were added, and yeah, exactly, at those states up.
We persist those states in DQ, so that saves a lot of recomputation.
For Max, that's easy.
For standard deviation, I think we had to expose some internal values from the DuckDB
operators.
And that's maybe also something that was nice in DuckDB that we could...
I mean, it's open source, right?
You can't do that with every database system.
And it's also kind of easy to follow the code.
It's well structured in a way. Now it has this community extension,
where it's actually really, really well documented how you add custom functionality to DuckDB.
It has the Discord community, or alsoub. I think back then there was also a lot happening on github.
Still is. I'm always impressed by the amount of care that Dr. B people put into responding to questions,
to bug reports, to everything.
So when you want to know how something works, you will likely find an answer from someone
at DarkDB Labs that will tell you, okay, you should do this or that.
Yeah.
It makes for a nice developer experience, right?
When you kind of have that, you can ask a question and you know you're going to get
a very thoughtful, insightful answer back, which is really helpful. It's interesting there, as part of the process, that you ended up needing a feature and you know you're going to get a very thoughtful, insightful answer back, which is really helpful. It's interesting that you said as part of the process that
you ended up needing a feature from DuckDB that I didn't necessarily have and so I've
been able to expose this intermediate state. And it's interesting that, because I like
to ask, how did your research then influence the core product in the sense, right? And
kind of that's a really nice thing that we needed this. And I guess, do you have any
insight as to how that has been in the long term since I
kind of have other people then relied on this ability to kind of have access to this intermediate
state or not?
It seems like an interesting thing to kind of you encountered it and it'd be nice to
know how many other people have then gone, oh, given I have this, I can now do the thing
I wanted to do as well.
Yeah.
I mean, we never upstreamed that actually.
So I think that's a bit of the shame with that project that it has never really gotten
into a phase where I would consider it ready or mature enough to actually use in practice.
Unfortunately, it's one of those artifacts from research artifacts that were created for paper, which didn't even get as much
attention as I was hoping it to get. It was very hard actually to get it accepted, even
though it was quite a lot of engineering effort to build it. At the same time, the main purpose was to get research paper and afterwards I had
to move on to other things. It brought me a lot of experience personally that I could
use later on. I happened to work in a role later on where we were working with Deque
with the Spark-based library. I could help the company I was working for with making it more
efficient. Later on, we also introduced DuckDB in their stack to get rid of these overheads that
Spark introduces. It was helpful for me, but I wish it would have been helpful to more people.
And just when you reached out, also like a week later, I talked with my advisor and said,
I really want to pick this up again.
We should just at least get it on PiPi released and figure out some small things. As I said, this custom
extent like this kind of fork we did of DuckDB. We should do that. There's also a dependency
to Apache data sketches, which is an amazing library. I don't know if you want me to
talk about data sketches, but it's instrumental for a couple of things that we do in DuckDQ.
And it's quite an interesting way of doing approximate query execution, basically.
And so that's also dependency that doesn't feel really nice. Now with the extension
mechanism of DuckDB, I really feel it's time to build a community extension
that just incorporates
all of this into one package.
Yeah.
Yeah.
It'd be really cool to see it get picked up again.
So I'm sure this is this maybe it's kind of had a couple of years just sat there kind
of on the low cooking, but now it's time to maybe turn up the heat a little bit and kind
of turn it into something a product people can use.
That'd be really, really nice. I mean, also with the kind of saturation in AI, or at least with the, let's say the growth curve
is like flattening a little bit. I mean, there's still new models every day. There's still new
product announcements, whatever, every day. But maybe we're not going to see like so fundamental improvements in at least language model capabilities.
The general purpose foundation models will probably go more into specific application
domains and so on and so forth, where we see improvements or just building products based on this amazing technology.
I hope it frees up some time also for teams to focus back on classical machine learning
because I read this number somewhere, unfortunately, I don't know. I don't remember where, but
90% of the workloads that are like running like machine learning based
workloads that are actually running in companies are still classical machine
learning. So like this classical machine.
So they actually make the largest amount of money and impact.
So yeah, I still think there's, there's a lot of room to provide better,
better tooling for.
Yeah, I agree on that.
I mean, a lot of the time it's, I think you find a lot often in even within
data management, within database, as I've said, certain maybe concurrency control
protocols, for example, that are more complicated and kind of have been
demonstrated in the research literature.
This can actually get you better throughput, low latency, whatever.
But because there are a lot more of a kind of complicated intricate solution that lots of some still kind of still fall back to maybe
Using a lot based system as in that which is sort of very simple to reason about and put out there
But that's probably the reason why it's put out into practice right because people can solve a reason about it and a lot more
It's a lot more explainable right and which probably is true as well of certain of some of these more advanced machine learning
techniques compared to some of the classical sort of approaches, which are
more understood, I guess, and you can kind of, you have more confidence putting
them out into production into the wild.
And yeah, and it's interesting.
You mentioned a second ago about implementation.
So let's, let's talk on that a little bit more and cash your mind back to when
you were working on this, he said it was quite a significant implementation effort.
So yeah, tell us about what was that like?
How long did it take?
What sort of things did you encounter that were a bit like, oh, this is difficult, maybe
scope towards maybe DuckDB in particular.
So, I mean, the first challenge was to actually understand, understand exactly the way in
which Deque is built, the original system that inspired
this work.
And then to come up with an architecture that resembles what Deque is doing, but makes it
possible to do that in SQL because Deque was actually built on PySpark.
So the declarative kind of wrapper of Spark where you could write kind of like pandas
code.
So, and we also wanted to build a solution that is kind of flexible where you can plug
in other SQL databases, not necessarily only DuckDB.
Well, yeah, I guess that required some thinking about how to architect it.
We ended up, and I'm looking at the paper now at this system architecture that I have
here.
So we ended up having user-facing API that is pretty much the same as DQ's original API, which
means you have a declarative API to specify certain quality constraints.
Then we have a second layer in abstraction where we handle all this verification logic, the state management
of the intermediate states that I mentioned earlier, and to translate these different
quality constraints into concrete operators that we want to use during the profiling.
And there are two distinct sets of operators. One of them are scan sharing operators and the other one are group scan sharing operators.
For the user, it is not important which quality metric maps to which of those operators.
The user basically only says, okay,
this column should never be null. This column should, the standard deviation should never be
larger than this. Here, the histogram should look like this and that. And we map those to
a certain set of operators that we need to execute and to type of operates we need to execute.
need to execute and to type of operators we need to execute. In the end, we essentially need to generate secret queries that execute those operators. The nice thing about the
scan sharing operators, about the separation into these two different kinds of operators,
is that in the end, you only need to run two queries, one for the scan sharing
operators, one for the group scan sharing.
When you compare that, imagine you have a dataset with 500 columns and with 50 different
constraints, then in pandas, that would be a lot of individual operations.
In DuckDQ, that would be only one or maybe two queries
if you happen to have one that needs a group scan sharing operator. I mean, I can talk
about the distinction because it's maybe interesting to understand. So, I mean, the
main idea is we want to. So, the main assumption is scanning data is expensive because the data is, even though it might
be in memory with pandas, but still reading every single data point, it will bring it
into the CPU cache and it will do something with it.
If it does that 20 times at different points in time, that's much worse than just keeping
it in the cache and doing all the different operations on it while it's there.
There's still a big difference between reading from CPU cache and reading from memory.
Nice.
I guess sort of leading on from that then. so that it feels like there's going to be a
really big speed up here.
So let's talk about some evaluation results from the experiments you ran when you were
evaluating with VQ's performance.
Yeah.
How much faster is it?
How much more performant is it over the state of the art with having to basically do a full
table scan every time I want to, on a 500 columns every time I want to evaluate
some, I'll perform some check. Yeah. So we compared it to, obviously to DQ, to the original DQ on
workloads that are small or moderately sized with, which we defined as, you know, a few million
rows. I think like the first is on three million rows and eight quality
constraints.
When we run this quality evaluation on the Spark-based Deque version, this takes around
seven seconds to do this evaluation.
If you think about the production machine learning pipeline, where you want to do predictions,
that's quite a big amount of time.
Well, you don't make three million predictions at once, maybe, but that's why it's important
to also look at the static overhead of running a Spark-based pipeline.
In the experiments here, it was 47 seconds just for reading the data from
starting the Spark process and then reading the data from CSV into Spark. But usually,
no matter how much data you read, it will take at least like 8, 9, 10 seconds to start up the Spark cluster. DuckDB compared to that only took 0.6 seconds or so to do the evaluation.
So even if you are in a production setting where you want low latency
predictions, you can still for like very decent batch sizes, you still have
quite a minimal overhead to do this quality validation.
This is firing up the cranking up the Apache Spark to sort of run run run one every time, right? Like in eight, eight, eight seconds, just to get the damn thing up to actually be able to do it.
So, yeah, it's a massive improvement over over that for sure.
I guess I guess you mentioned that obviously the project has sort of been on the back
bar for the last couple of years and that you would maybe now think about picking it
up again and seeing if you can make it into a tool that people could actually use in,
in a production setting.
So I guess the question then is what do you think would be needed to, to, to actually
get to that point from where we are today
with it?
I think what it would really need is a C++ developer in the first place.
So that's always a bit of a rare find among machine learning people or people that are very interested
in machine learning and data science because they are obviously, Python is the most common
language there.
DuckDB has SQL-only extensions as well where you need only a very minimal amount of C++
knowledge to implement new functionality for DuckDB. But that only really makes sense
if the functionality is in some way can be expressed in the existing SQL functions. I
think in that case, we would need some custom functions. I would really wish that Apache
data sketches would be wrapped up into a DuckDBB extension that would be super useful because DACTiB has an approximate
count distinct operator. So you can determine the number of distinct values in a column
by only scanning once. That's an optimization that Deque uses, but Deque also uses an optimization for quantiles or to determine basically column
distributions, that thing called KLL sketches.
And having this as a DuckDB function to be able to compute a KLL sketch over a column
would be one of the enablers for Deque.
And I think it would be super super, super like useful beyond that.
Cool. There we go. That's the steps when you do the roadmap. So 2025 Q2, we're going to have it done till.
Yeah, exactly. Let's have a call then again. Let's check in again then.
I would just ask Chetjipati to write it.
I would just ask Chetty Pudu to write it. There you go, there's your C++ developer.
I guess next question I want to ask you is of reflecting, obviously it's been quite a
few years now since you worked on this, but to gauge what your overall experience was
like working with Doug DB, obviously it's probably a very different sort of experience back then
than it would be now, maybe because things like the community extension have come around
and the documentation has improved a lot over the intervening years. But yeah, what was
that experience like when you reflect on it? And also, what would be your advice to would-be
extenders of DuckDB or people who would want to maybe use it in their research or in their
side projects? Yeah. So that's an interesting question. I guess you asked multiple questions.
Sorry. Yeah. I have a habit of doing that.
What was your experience like reflecting on it? Where can we deduct EB?
That was the first question. Then we'll do the next one.
So there's two for the price of one.
My short-term memory is corrupted already.
I think the experience was great at the time.
So I expect it to be similar today.
Of course, now working at Mother Duck, we have a professional relationship with DuckDB Labs. So we have a DuckDB Labs partnership that gives us
access to actual development resources. Back then it was more informal. So it was just like,
okay, yeah, that sounds great. Let's do it. Or maybe not. Or please make a PR. I think that's
something that still works great today. If something is missing in DuckDB and you feel it should be there, why not open a PR
to edit?
So, that's always the way I think that also still works today.
Yeah, I mean, there's so much to learn by looking at the DuckDB source code as well.
Especially the way DuckDB does testing, I that's very, very good example of how software should be tested.
It's just so, so extensive.
Every single feature has a test suite.
Every edge case is tested.
I guess I spent more time developing tests for when I merged the CSV sniffer.
I guess I spent more time developing tests and finding my own bugs in the process and
so on than I spent on the initial implementation.
That's good.
That's the way they say it should be, right?
So you should be spending more time writing tests.
The actual writing of the production code is probably the
smaller portion of the time.
Yeah.
I guess the second question I asked you a second ago was what would your advice be
to anybody thinking about using DuckDB or extending it or using it in the research?
Yeah, I guess from it, it really depends on where you're coming from. I think it's interesting for people working on systems like data management systems.
It's kind of a breeding ground for master students, PhD students coming to the CWI who
develop various extensions, optimizations, things for DuckDB or on top of DuckDB.
But then on the other hand, there are also researchers in the geospace or in chemistry
or in bioinformatics that have just the need to process relatively large amounts of data
in an efficient way locally.
DuckDB just offers a great solution for this.
And I mean, there is a great geo extension, but yeah, I've not seen a chemistry extension,
for example, for DuckDB, even though I'm sure there are lots.
I had a bioinformatics course at my university during studies. It was a super
old Java program that we used to do some sequence alignment or something. I'm sure a tremendous
amount of domain knowledge went into this. For being what it is, it was probably really, really well thought through and optimized.
Maybe there's still huge potentials to do these kinds of analysis much more efficiently
by putting on the database management glasses and thinking, well, could we maybe actually
put this in a DuckDB extension?
Can we fit this to DuckDB's execution model and maybe just use that? And yeah, I hope DuckDB will develop a bit more into this ecosystem,
into this open ecosystem, a bit like R maybe.
People just develop new our plugins.
Why not develop a new deck to be plugged in?
Yeah, no, it's interesting that you mentioned that because I, uh, I've
experienced a similar, similar thing where a friend who she, um, was doing
her PhD and it was in bioinformatics or it was in genomics, some, some more
in that space, can't really exact specific domain and the same sort of
thing she's talking, you help me with this program.
And I was like, okay, this is janky old like Java based thing.
That was like just a complete, like, I mean, obviously it was very focused
at doing this one specific task.
I can't remember what it was, but I was like, this feels like it's
such a bad user experience.
Like it wasn't a nice API or anything, but obviously it was the state of
the art in that specific field.
But yeah, this probably is like loads of opportunities there to apply all the really
awesome techniques within DuckDB and all from data management to those sorts of spaces as
well.
Yeah.
And it's not only about just making it fast, right?
It's all okay.
It's cool if it's faster, but sometimes I think what is much worse is, oh, that's a
type of analysis that I wish I could do, but my program always
crashes when I do it. And I think then really, yeah, it really enhances the possibilities
here.
Cool. I've just got a couple more questions now, before we wrap things up. The next sort
of, I guess, high level section, shall we call it, is about impact. So obviously this
paper's been around for three or four years now.
What do you think the impacts of the paper have been?
And have you had any feedback from people over the years on it?
Yeah, so I've heard from people from time to time saying, yeah, okay, well, this looks interesting.
I would want to use it.
In which state is it?
Is there a PiPi package over it?
That's kind of where my future plans were coming from.
Actually I haven't checked the citation counts on it.
That would actually maybe be an interesting thing.
I guess it's relatively, yeah, there have been two citations, but they're not really
all too related. I guess one problem with that is it was basically like a development,
like an engineering problem in the end. I did not really solve a research problem.
I made some benchmarks around it and so on, but it was not something revolutionary, novel.
But it was not something revolutionary, novel. And I guess that was something that was not appreciated
that much from the research community.
But then on the other end, to make
it useful to actually achieve real-world impact
with the work, it really shows that you
need to put in more effort than you
would put into a research paper.
I think DuckDB is also an example of that, where the focus is really on the engineering
excellence, on making it a great tool for users, thinking about the user problems and trying to solve them and not
necessarily about only innovation.
Sometimes it's great to pick up existing things that people might have neglected for a long
time, but in that context, they are suddenly useful.
That's kind of what I think I've learned from it. Sometimes it's worth putting the extra
to focus on the users and think about what can we do for them, how can we make it easy for them to
adopt things like that. There are also great companies nowadays. I think
YLOCKS, which is another company that works on data quality that actually also looked
at Deque, I think, to some extent. Their library also got inspired to some extent by how Deque
works. It's a company, it's a business around that.
And as like, you know, PhD developing a project, it really takes a lot, a lot of dedication.
And I guess sometimes it's open source project maintainers are really not appreciated enough
for the amount of work that they're doing to give us these free open source projects
that are well maintained.
Yeah, I completely agree with you on that.
Another question that maybe leads on from this one a little bit about lessons you learned
and stuff whilst having this experience.
It was, what was the most surprising thing that you kind of encountered while working
around data quality and with DuckDB and in this space, what's been the biggest sort of
thing?
Like, I did not expect that.
That's a good question.
I guess what I didn't expect that it's actually so, I mean, it's not something I necessarily
learned in this particular project, but a little bit earlier,
like leading up to it was that it was quite...
So I thought Spark is this tool that is for big data and it's the most efficient thing
you could think of.
But then when I started using it, actually, I just realized, oh, wow, there's something
like so insane, it's much faster. So it's kind of opening a new world, thinking, well, okay, sometimes maybe it's really good
to focus on a specific slice of the problem and thinking, okay, we don't solve.
We focus on small to moderate-sized data and try to make a great experience for these type of users instead of saying
like, oh, we want to solve everything.
We want to scale indefinitely and that there are trade-offs when you do that.
Cool.
So yeah, just last, I'd like to touch on your other work that you've got going on as well.
But let's do that first.
And then I want to get like, I want to have a little recommendation feature where you
recommend an extension, a plugin or a feature related to DuckDB that you like
from this bit, from the other things out there. But before we do that, obviously do a lot of other
really cool work as well. So yeah, and two of those things that are related heavily to DuckDB,
a feature store and text to SQL. So give us the quick rundown on those two tools.
So the work on DuckDQ brought me to the topic of feature stores. So feature stores are basically
data management systems for machine learning, for data scientists and machine learning engineers
that manage machine learning features. And one integral part of a feature store system is also data quality validation.
I think feature stores are a great solution for the social aspects of developing machine
learning models where you have maybe code repositories and software development.
Then the feature store feels like the equivalent for the machine learning engineers and data
engineers and data scientists collaborating together on building and productionizing machine learning models,
enabling reusing features and so on, and discovering features that others have prepared. source, I mean, on Texas sequel, which is completely like orthogonal topic in a way.
It is something that, yeah, I kind of started working on, you know, when the large language
model thing took off one and a half, two years ago. It's really hard to make a connection to data quality, even
though I've worked so much more with language models. In the end, as I said, a lot of the
teams that were building the machine learning models are now maybe building LLM-based pipelines. The dynamics
there, in principle, there are similar problems. You have this kind of thing that is behaving
in a weird way, which is on one end maybe the language model or on the other end the
machine learning model. You give something in and you get something out and somehow it's a little bit hard to verify
whether that's correct or not. So yeah, the ML ops or now LLM ops space is evolving and trying also to provide tools for these GenAI or AI engineers to harness these large language models
in production. So yeah, it's kind of, it's a super interesting problem also, I think, from the data management perspective,
so research perspective there.
And I guess with Texas SQL, I've come full circle from my interest, database and machine
learning that just brings everything together.
Nice.
Cool.
In terms of why the listener can maybe go and play with these two tools, they're available.
Where can we go and find Feature Store and Tech SQL?
How can I go and play with those two?
You can...
Before joining Maldac, I've worked for Hobbs Works, which is a company that builds
an open source feature store.
So if you go to HobbsWorks.ai, you will probably find it.
It's also run by DuckDB partly.
And for Texas Sequel, I guess you could sign up to Mother Duck and try it out there.
There's tons of research about Texas SQL.
It's really a hot research topic.
I don't think the best method has been found yet.
And it's still super hard to do things that produce verifiable, correct results in the
sense of not only the query parses that comes
out of a natural language to sequence system, but also it semantically makes sense.
It's really a tricky problem.
Nice.
I guess, yeah.
So the penultimate question then is the recommendation.
Oh, then obviously all the cool work you've done over the years.
So what's your favorite DuckDB plugin extension or just the feature that it has?
So we, not so long ago, we added a functionality or function to MotherDuck,
which allows you to call language models from TrueSQL.
So then you can do something like, if you're a text column, you can write a SQL
function, select prompt, you say, or summarize this column as summary from my table. And
then for each row in your table, it will call open AI GPT for Omini, which is like low cost
language model and it will produce, for example, a summary.
It also works with structured outputs, so you can actually convert text to
structured JSON and Tech2B happens to support JSON.
You can unwrap them and suddenly you can convert an unstructured text into a new set of columns in the database.
I think that's quite interesting. There's still lots of ways this can be optimized and so on. I really
like the fact that one day after we released that, someone published an open version of
that extension, a StuckDB community extension. So that's called open prompt. So
yeah, I was, I was pretty stunned by how fast that happens. Yeah.
Yeah.
That's the place of the LLM space and the AI space. I see it's so fast moving. Right. I mean,
but yeah, as an example of moving fast, cool. I guess the last question then tell is, and what's the one thing you'd like the
listener to get away from this podcast chat today?
Looking like, I guess this is goes mostly to research, to research audience, like
people, people being, being involved in, in research from different fields, I guess, of many computer science
research are listening. I personally think this intersection of machine learning and
data management systems is super interesting. There are so many new ideas and possibilities that are interplayed with machine learning,
generative AI, and data management systems in combination or at the intersections of
those different worlds.
So many interesting things that can be done, whether that's making data exploration easier or making
text analytics easier.
It goes also the other way around.
How can we maybe use data management systems to make certain ML problems easier, like this data quality checks or data
validation.
Yeah.
And of course, if you do any research, of course you should do it based on DuckDB.
It's a community extension, so everyone can benefit from it and not make the same mistake
that I did.
That's a great message to end on there. Thank you very much. It's been a lovely chat today. Where can we find you on social media? Where's best for listeners to reach out with you if they
are interested in the things we spoke about? Yeah, oh, and now I forget the end.
No, what is it at is usually not in the domain, right?
It's not.
No, we're not.
Well, and my Twitter handles are the, are the same.
I, I suppose you can, you can add them to the video, to the audio description.
Yeah.
I'll drop, drop a link to everything we've mentioned in the show notes today.
And yeah, the listen, go and check out, check them out.
But yeah, it's been a great chat.
Thank you very much, Till.