Disseminate: The Computer Science Research Podcast - Daniël ten Wolde | DuckPGQ: A graph extension supporting SQL/PGQ
Episode Date: March 20, 2025In this episode, we sit down with Daniël ten Wolde, a PhD researcher at CWI’s Database Architectures Group, to explore DuckPGQ—an extension to DuckDB that brings powerful graph querying capabilit...ies to relational databases. Daniel shares his journey into database research, the motivations behind DuckPGQ, and how it simplifies working with graph data. We also dive into the technical challenges of implementing SQL Property Graph Queries (SQL PGQ) in DuckDB, discuss performance benchmarks, and explore the future of DuckPGQ in graph analytics and machine learning. Tune in to learn how this cutting-edge extension is bridging the gap between research and industry!Links:DuckPGQ homepageCommunity extensionDaniel's homepage Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Disseminate the computer science research podcast, the DuckDB in Research series.
Welcome to the DuckDB in Research series.
This is a new show and for maybe some of our listeners, we're not actually sure what DuckDB is.
So let's start off there.
So DuckDB is an open source in-process SQL database designed for fast and efficient analytical query processing.
It's really simple to install, deployers of Xero, external dependencies, it's extensible,
integrates easily with all your data science tools, it has APIs for things like R and Python,
and you can directly read file formats like Parker and CSV, so it's jam-packed full of
features.
And then you kind of might ask, why are we doing this DuckDB and research series for the head?
So what's it all about?
So well, this seminar as a podcast is all about impact and helping to bridge the gap
between research and industry.
And DuckDB is the perfect example of these two communities working together in the sense
that our ideas from decades of research are at its core, and it's now actually
influencing the research gender itself as a platform for others to build on. And that's kind
of a nice segue into today's episode. And I'd really like to welcome our guest, Daniel Ntem-Wold,
who'll be telling us about Duck PGQ. So welcome to the show, Daniel.
Thank you very much for an introduction.
Cool. So before we get talking about Duck PGQ, let's find out a little bit more about yourself
and yeah, how you ended up researching databases.
PhD student at the CWI Database Architectures Group. And before that, I did a bachelor and
master degree at the computer science also in Amsterdam. And it was quite a machine learning study,
but was one course, large scale data engineering by Peter Bonks, who is now my daily supervisor.
And he had master topics to choose. And one of them was some projects with this database
called DuckDB. And I wasn't familiar with DuckDB. I was only familiar with like the
basic SQL and databases were kind of like a black box
to me.
But anyway, I chose this topic because I thought it was a fun, fun project.
And then I learned how complex they actually are, these systems.
And I got kind of sucked into them.
And then I started to think about doing a PhD to continue on this topic.
The master thesis was on graph on multi-source pathfinding.
And then I could continue in a PhD.
And yeah, that's kind of how I rolled into it.
Nice.
So yeah, you peeked behind the curtain and that was that you were hooked.
Yeah, I am.
And yeah, I got kind of hooked by my daily supervisor at the time.
Yeah, so we're here to talk about Duck PGQ today, which I've seen in way from as part of your PhD
And it is actually it's an extension to duck DB that people can go and download and use them
I'm sure we'll get on to that at some point and let in the show
But before we do that, how about we give the the listener the TLDR on duck PGQ?
And yeah, tell us give us the elevator pitch for it.
What is it?
Yes.
So edit score, DuckPGQ makes it easier to work with graph-like data from your relational
database.
So you already have your data loaded in the database and then DuckPGQ can just make the
graph pattern matching a bit easier and pathfinding a bit easier.
And graph data, it's in quite a lot of places.
So for example, social networks, ranking systems often have graph systems,
road networks for example.
But querying these in a relational system with the SQL that many people are familiar with
can be quite challenging. So for instance, trying to find the shortest path
between two nodes in standard SQL,
it's already quite a complex query
that I don't think anybody can write.
So previously the alternative when you have graph data
would have been to switch to a dedicated graph database
system and that often means that you have to learn
a new query language, you have to learn a new query language.
You have to manage separate setups.
You have to deal with performance trade-offs.
You have to transfer your data, obviously.
And DuckPGQ kind of changes this by bringing graph querying into your DuckDB.
And we extend SQL with SQL property graph queries or PGQ, which is part of the
new SQL standard. And this adds a sort of visual graph syntax that makes it easier to
do pattern matching and pathfinding. And DuckPGQ is one of the first systems to support this
new standard, but it's being picked up by other systems as well, such as Oracle and
Spanner from Google. They recently announced this.
So it's starting to become a bit more popular.
And we also have some built-in graph algorithms such as PageRank, just to
make the graph analytics easier from your relational system.
So I guess there's a few words and phrases you mentioned in that, which
you probably go into a little bit more detail for the listeners so we can
really set the background and the context for us
going into the meat later on.
So tell us a little bit more about, let's start off with the query language and SQL
PGQ.
Give us a brief history on the language story there.
So how did this come about and what's the space there looking like?
Yeah, so I mentioned there's all these other graph database systems out there.
And previously each of them had their own query language, they own their own implementation.
So they might have a different syntax or different semantics, different
definitions of what everything means.
So it was kind of a tower of Babylon.
It's some I sometimes compared to, but it's a mix of everything out there.
And then it was some committee came together to discuss a standard, a common
query language that's most big vendors would agree on, and sometimes a bit
reminded by the XKCD of, oh, there's 14 standards.
Let's make a new one that ruled them all.
And now we're 15.
We have 15 standards.
In a nutshell, you can say that's for SQL PGQ, but it's just a standard that's now part of the actual SQL standard.
So SQL has a standard defined by the ISO. And yeah, you can use that
document. It's all written down in a document, all the definitions, the grammar rules, for example,
they're defined. So you can implement directly from that. And that should make it easier to
have the same implementation across different systems.
So they would use the same query language, similar to like SQL.
Well, almost every relational system uses SQL, maybe with some slight variations.
But now you can also add PGQ to the SQL standard.
Nice guy.
You kind of mentioned as well a second ago, I think when you were
answering the last question, how I actually model, I have some tables and I want to sort of then base off
top of that sort of kind of model a graph, right?
How do I get my tables to be graphs essentially?
So how would you, maybe you can give a rundown on like a simple example of like I have some
data and I want to create some graph representation of that in the system.
How would I actually do that?
Yeah.
So for example, if you have a social network,
I think that's the easiest example.
Your persons, your people are represented in one table.
We call this the vertex table.
And these people can know each other.
So that would be a second table called an edge table.
And in this edge table, you have the source,
the source ID and the destination ID.
So the source and the destination of the edge.
And then in PDQ you define a property graph where you say this is my vertex table, this is my edge table.
And for every edge table you also have to define the primary key, foreign key kind of relationship to the nodes.
Like this is the source and this is the destination columns that you should use.
And that's basically a layer on top of your existing tables.
So you also, when you update your data, you update your tables.
This is automatically also in your property graph synchronized.
So you don't need to do any synchronization extra,
but under the hoods, yeah, it's already done.
Nice, cool.
So yeah, also as well, just to sort of,
there's another thought of 10 that maybe the listener
might not be familiar with,
and you mentioned in this,
and it was pattern, the idea of pattern matching.
And the shortest path kind of, you can kind of reason about,
so, okay, I want to go from here to there
in the shortest possible way.
Yeah, what is pattern matching?
How does that shake down?
So pattern matching in SQLPGQ is done with, they call it a visual graph syntax or sometimes
ASCII art.
And it was actually inspired by, at the moment, I think the most popular query language,
graph query language called Cypher from Neo4j.
And in that you have notes which are represented with round brackets and
edges represented with the square brackets.
And then you can have an arrow, for example, or you have a node to an edge, to another
node, and that's just a pattern, but you can also have, for example, maybe there's
multiple hops between the two nodes, but you don also have for example maybe there's multiple hops between the two
nodes but you don't know how many then you can say for example a a clean star or a plus sign
to say there are at least one hop between this but i don't know how many or you can give an
upper limit or a lower limit if you want this. So that's one way to make the path finding easier in, in PGQ.
You can also say you want any shortest path.
You can say, I want all shortest paths.
I want, I want the top five shortest paths, for example, this is all possible in,
in PGQ.
So yeah.
Cool.
So we have this nice sort of like visual syntax to just kind of drawing these queries out
almost.
So yeah, so that sort of, so I guess kind of a really high level DuckPGQ is a combination
of two really high level things like SQLPGQ and then DuckDB.
So we've kind of, we've boxed off SQLPGQ there.
So let's talk about DuckDB some more.
So yeah, so to kind of discuss the, I guess, the changes you've had to make to accommodate SQL PG in DuckDB,
we should probably go on and kind of, I guess, discuss,
discuss or probably talk about what the, what a query sort of
looks like in like normal DuckDB without any sort of extension.
So can you maybe give us a run through what are the kind of,
I guess, the query journey essentially, so we can have some
groundwork so we can discuss what you, what you kind of then
go and build on in your extension.
Yeah, so the standard pipeline for a DuckDB query is that first you parse it, so you just
check is the syntax correct?
You don't check does this table exist?
Is everything in the right place?
So first you select from where, whatever.
That's the first stage, the parser.
And then this generates an abstract syntax tree.
So a tree out of nodes.
And that gets a bound at the bind phase.
And there you check, do the tables exist?
Do the column exist?
Are the correct function names used with the correct arguments, etc.
And this sort of creates a logical query plan.
And this logical query plan exists out of like nodes, for example, a table
scan is one node and then goes perhaps into a filter, maybe have a join somewhere.
And then you can optimize this plan just to make it faster to execute.
And then you have a physical plan that goes into the execution engine.
And there you actually scan the table, there you actually execute the join.
And then at the end you return the result to the user.
So that's a, in a high level, just a pipeline of a standard duct to B query.
Cool.
So parser abstract syntax into history abstract syntax tree.
There we go.
I can't speak.
But then we do some binding logical plan.
We try and optimize that when we have our physical plan, which is the operators we're
going to use, then we actually send that over to the execution.
Enjoy actually executes that query altogether results.
I was all together.
Jobs are good.
And so we've got the query journey there.
How would you go about taking SQL PGGQ and mapping that to this pipeline?
Essentially, what parts of it did you have to change of, of duck DB?
And yeah, and what things did you actually need to change?
I guess would be a good place to start.
Yeah.
So I wouldn't say change.
I would say extend duck DB while we extended almost every step of the way.
So the main thing is the parser.
We need to support this new query language.
So for that, we have to extend DuckDB parser.
And actually, we do a little bit of a dirty trick here.
We actually forked DuckDB to get its parser.
And then we implement our own PGQ parser in that.
So our parser is a superset of everything that DuckDB can parse,
plus whatever the new syntax is of PGQ.
So that is one step where we extended DuckDB.
And then in the Bind phase, whenever we see this new syntax,
this new PGQ syntax, there we use something called a bind replace, where
we replace the match statements, the visual graph syntax.
We replace it actually with a standard relational query plan.
So we formulate the joins, we formulate the filters.
So at the end of this bind replace, you have the same plan as you would have if you would
write the SQL query without this PGQ syntax.
Because you need to remember that this PGQ is essentially a syntactic sugar over your
standard SQL.
You can already write graph queries in the SQL that many people are familiar with.
It's just a little bit more cumbersome to write.
So this bind replace replaces the match statement
and transforms it into a standard relational query plan.
That's another where we extend that to be.
And then for pathfinding, we do something special.
For pathfinding, you can do it with called a recursive join, where you take the base results and then you recursively join over this results until you match some condition.
But that's slow and it runs the risk of exploding results, which is not good. So there we actually use scalar user defined functions to first of all create a data
structure, compressed sparse row data structure, which is an efficient way to represent your graph.
And then we do multi-source breadth-first search over it. So for say 256 sources,
we start to be a breadth-first search and they explore the graph simultaneously.
So those are ways in which we extend DuckDB.
And actually I'm working on some research that also adds optimized rules and a custom
operator, but that's, we'll go into that a bit later maybe, but yeah, we extended almost
every step of the
pipeline of a DuckDB query, of a SQL query in DuckDB.
Nice, cool.
So yeah, if I should just go back to the first thing that we're talking about, kind of introducing
this new parser and the decision to actually fork DuckDB rather than kind of, I guess,
go the other way.
What was the primary reason for that decision to fork it?
Because I guess it becomes a maintainability question then as
well, because as the duck DB parts are still evolving as well, you
need to incorporate those changes.
So yeah, what was the main reason behind that and how are you finding the
maintainability aspects of it as well?
Yeah.
So one of the big reasons is that we want to.
PGQ, SQL, PGQ is part of SQL.
So it can be that you have a match statement somewhere
within your SQL query, but outside of that,
you still have the normal SQL syntax.
So you still need to be able to parse all of that.
When we started the project, the extension framework of DuckDB
was not as well defined as it is now.
Like it was a lot more bare bones.
So we figured it was easiest also for maintainability to fork it.
And then whenever an update comes in, we just merge whatever DuckDB
changes they made into our fork.
So it's actually really, really simple to stay up to date that way.
And the alternative is to, I don't know, completely write your own SQL parser, but then you run
the risk of not being able to parse everything that DuckDB can. And then, I don't know, the
maintainability aspect becomes a lot more harder.
Yeah. Yeah. You said that when you first started this, obviously the extension framework was
a bit a little bit more bare bones than it is today.
Do you think that if it was as it is today that things might have been, different decisions
might have been made or do you think it would have just made your life a lot easier?
What's your take on that?
So, I think I've rewrote the extension repository three times over because the framework changed every time.
Right, yeah.
Now that was, it would have been easier, yes. I don't know, we still would have forked the
parser I think, because the parser is really difficult to work with in general. That's a bit
of an old piece of code in DuckDB relative to the parts of duct to be. And there's plans to change this in the future, but that's that's future work.
Would we have done something different?
I'm not sure.
Maybe, maybe we wouldn't have used scalar use defined functions, but it's a bit
hard to tell with hindsight, maybe yes, but I don't think we would have known.
No.
Okay.
Cool.
Yeah.
So then the next sort of thing you kind of, you, you, you opted to, I guess,
not do is sort of map the, or to not use the with, with, with, with recursive
sort of syntax and sort of, instead you mentioned this, this compressed sparse
row data structure.
So can you, can you actually, for the listen, maybe explain what a complex
sparse row actually is, you know, it's an efficient representation of a graph.
So what, why is it efficient or what, how is it different it's an efficient representation of a graph. So
why is it efficient? How is it different to other representations of a graph?
Yeah, so a compressed sparse row is essentially two arrays. One array is the size of the number
of vertices in your graph. The other one is the size of the amount of edges in your graph.
And at every index in the vertex array, this first array, you keep track of the amount of edges in your graph. And at every index in the vertex
array, this first array, you keep track of the index into the edge array where the corresponding
edges start for that vertex. And that makes it really easy to scan sequentially over this
edge array, because we need to keep in mind that we try to make it performance, and you
kind of want to avoid any random accesses into the memory, you want to have as
many sequential accesses as possible.
And it's difficult with a graph because a graph notes and edges can be all over
the place, there's no sorting and there's no sorted graph per se, or at least
that's probably quite expensive.
So this compressed part row is actually quite a common way to represent the graph and then
to work over this.
Cool, yeah.
So you mentioned that you did this with the scalar UDF.
So can you tell us a little bit more about what they are in DuckDB and how you actually
mapped this to those?
Yeah, so scalars use defined functions. So Scalar Use-Defined Functions, they have some rows of input and they return the same
rows of output.
And this CSR is kind of, this is also created using Use-Defined Functions and we have a
special query to get the right inputs.
And then we create this CSR, this compressive parser rule, in memory. So there's some space in DuckDB called the client config that we can store the CSR in memory.
And then when we see the user defined function for the multi-source breadth-first search,
we simply point to this CSR that was just created.
We keep track of it with an ID.
Very, very simple actually.
Then we know which graph to iterate over basically.
Nice.
Yeah.
I kind of wanted to touch on, because you mentioned it, I think a second ago that
the using or mapping the using these CSRs and mapping them to, you know, using
scale UGS to represent them is, is really good because of the way sort of internally
duckDB process things.
The fact that it's a vectorized engine, I guess this plays really nice with that.
So could you maybe elaborate on what is a vectorized query engine and why this works
well and why is it different to another type of engine, a non-vectorized engine?
Yeah.
Yeah.
So vectorized engine gets, it's actually created here at CWI.
It was invented together with my supervisor now, the Peter Boggs.
I think he was the first to, to, to research this or to implement this.
And basically you handle a bunch of tuples, a chunk of say 2048 tuples at a time, instead of one tuple at a time.
Because that saves on the overhead of function calls and you can do, for example, SIMD operations over this.
So if you take SIMD operations as single instruction, multiple data. So if you do the same operation repeatedly on some data, then it's easier.
Or sometimes you can use this optimization to just batch process this.
And yeah, this plays well with, with our implementation because we also handle
batches of source destination pairs at a time.
Actually we go duct toDB has vectors of 2048
tuples, so that's one vector. But we go even smaller, we go to 256 source destination pairs
at a time. Because pathfinding is a really, really, really expensive operation compared
to a filter or a join, for example.
But we use the same technique of just handling a batch of source destination
pairs at a time, and then you avoid this cost of function calls and yeah.
Nice.
The CSR also sort of structure having that in there, having that induct DB
will also opens up some other possible directions of work as well,
which we probably can touch on now while we're talking about CSRs.
And we have that enables the multi-source BFS.
But there are other things as well, like in these things I mentioned
in some of your work, I think it's called worst case optimal joins and factorization.
And how do how does CSRs enable that?
I mean, I don't know how what the state of the implementation is on that sort of stuff, how
close that is to being real yet, but how, how does that, how did that enable that?
And then are there any other features of DuckDB that kind of would naturally dovetail with
those things as well, essentially?
Yeah, so we've been working on this worstimum join, which is a type of join.
We actually don't use the CSR in this case, but this year we had a master student, Paul Gross,
and he worked on factorized with an FA, factorized query processing.
And this is an optimization you can do with many to many relationships.
So if you do joins on many to many relationships you get often many duplicate values or duplicated
values and this is very inefficient because you're basically repeating the same value
many times and what you can do is you take out this, this duplicate value only represented
once together with all the combinations in which it was present. And while we have a
cider paper, this is a conference in Amsterdam in January, 2025, where we are going to present
this. But we did this by, by changing the, the way that
join is actually done in, in ducti B. So it was quite a, quite a hefty operation, but
yeah.
Nice.
Cool.
Yeah.
I think this is also one more future thing that these CSR data structures allow.
And that's obviously ducti B is really popular with in a lot of data science use cases and
it's used really heavily there.
And obviously in the machine learning space and things like graph neural network libraries.
How do they enable that, open it up as a possible use case for DuckPGQ as well?
Yes, we had another project on this actually.
Because these graph neural network libraries, they often also use a CSR
under the hood to represent their neural network.
So DuckDB is an in-process system, meaning that it operates within the same memory
as, as for example, your Python script.
And this allows for a nice optimization that you can do so-called zero copying.
Zero copying means that you
don't actually need to copy any data, you just need to point to the memory
location without having to do any copy. And so we created this CSR on our own.
And then from a neural network library such as PyTorch, you can then point to
the CSR directly and
skip the whole creation process.
So that is one way.
And then you can store your neural network in inductive if you want to.
But this was all very experimental.
I don't know.
We're, we're not sure if we're going to pursue this, this avenue of research, but
there was a, it was actually a fun exercise to try this out.
Let's talk about some performance numbers then, some evaluation of DuckPGQ,
kind of its inaction. How did it perform? What have you compared it against?
How good is it at Graph Analytics?
Yeah, well, DuckPGQ really, really builds on the performance already of Duck2B.
And Duck2B already performs pretty well,
actually very well, I would say.
So DuckBGQ thankfully also performs pretty well
on analytical queries.
We compared it with Neo4j,
which is the most popular graph database system out there.
And although Neo4j is more towards transactional queries,
so really simple, short queries,
that PGQ was able to outperform it,
in some cases by 10 or 100X.
So that was very nice to see.
And we also compared it to Kuzu.
Kuzu is the system from the University of Waterloo.
I think they've now spun off into a separate startup company, but they are a
graph system that is very similar to DuckDB, also for analytics, also embedded.
And they take a lot of inspiration from DuckDB.
So no dependencies, et cetera.
And also with Kuzu, which is a bit more fair comparison, we outperform them on
most queries and there are some cases where they are faster, but I think on
most queries we outperform them.
And this is mostly for the pattern matching and then for pathfinding, which
is not easy to always compare, but there we also outperform Neo4j.
And I don't think we did comparisons with Kuzu on pathfinding, but that's something
I want to do in the future as well.
What makes you difficult to compare between systems with pathfinding, Daniel?
I think comparing in general between systems is really difficult.
That's very true. Yeah. Because all these systems have different setups,
all these knobs you can tweak.
I got comments that the environment wasn't directly,
like one was a Docker container and the other one was a Python environment
or Python scripts, and even that arguably would have made a difference.
So in general, benchmarking different systems is really hard.
So also take this with a grain of salt, these results.
You should do your own benchmarking right for your own use case and see and make your own decision.
Yeah, I can do general benchmarking, but you know your own use case the best, I think.
Exactly. I completely agree with you on that one.
Cool.
So I kind of wanted to ask you more of a philosophical question about sort of the
design process of adding all these features to DuckDB.
And I kind of want to get your ideas.
How much did DuckDB's actual existing architecture actually influence your
design decisions?
Were you always trying to be sympathetic to the current, like,
way that DuckDB is structured?
Yeah, I think, well, duct to be actually originates from our research group.
So it started five years ago now, I think.
And it was also quite a natural selection for us to work in, in duct to be, because
we, well, we know the developers, it was really easy to get, get help with your,
with your implementation. And still we are in quite close contact. But I think their general
architecture also really suits graph analytics. So they have this columnar storage format. So they
store columns together instead of rows of data together. They have a vectorized execution engine, which is, I think, implemented by most state-of-the-art
analytical systems.
They have morsel-driven parallelism, which is a parallelism model that suits us nicely.
So yeah, it was just a quite natural fit for us to, to work with DuckDB for our SQL
PGQ implementation.
Yeah. Yeah. Are there any alternatives out there that are even close to being able to
allow you to do what you've done in DuckDB? So obviously it's like I said, it's designed
in such a way that it is meant to be extensible and has all these really cool, awesome
modern features in there. Are there any other systems that are kind of even comparable in any way?
I think Postgres, which is also one of the most popular open source database
systems, they also have this extensive extension framework.
I think even more extensive than DuckDB, but Postgres is also a lot more
mature, I think in that sense.
They would also allow you to do this, but I don't know if you would get the same performance out of
Postgres compared to DuckDB. That would actually be an interesting comparison, but I think Postgres
is one other system that would allow you to do this. For closed source systems, it's really
difficult already to do. Yeah, it's difficult.
You don't have actually too many choices at the moment.
Right, yeah.
And, of course, yeah, also as well, sort of thinking about this and why you're working on it.
If you're going to have a wish list of things that you wish DuckDB had,
or certain, I don't know, things in its framework or certain additional features you wish it had,
what would that be? What would the wish list be? I was coming up to Christmas, right? So yeah.
That's a good timing. I think I'm already very happy with all the things you can do with the
extension framework. I think parser extension does need a bit of a facelift. That's quite the tricky part to extend.
And in general, I wish there was more internal documentation
on how to work with extensions.
We're thankful that we have the know-how in this group
to know how a database system works, how we can extend it.
We have the Duck2B people quite close by, but figuring things out isn't always easy because
you really need to look at how other extensions have implemented certain features to actually
figure out how you should do it.
So that's the main thing where I see quite an improvement possible.
Over all with your sort of your DuckDB experience, if we're going to review it and reflect on
the whole process, what would you kind of, you take on the overall experience of working
with the extension module?
Well, yes, like I said, it's great to see where it's going.
The new edition of community extensions is really nice because before you had to either build the extension
from source, which let's be honest, not many people are going to do that, or you had to
opt into some unsigned flag and allow the extension to be downloaded from an S3 bucket,
which is also not very nice.
So this community extension is really a great addition.
Overall the experience has been, I think, yeah, just very good in general.
It's not always easy to work with, but once you get it working, it's actually very nice.
And people actually start to use the extension as well, which is nice to see and also motivating for me.
Yeah, before we get onto that and talk a little bit more about the actual impact, do you have any,
while we're on this sort of topic of your experience of using DuckDB and building on top of it,
is there any advice you'd give to would-be implementers and would-be developers who are thinking of sort of doing their own extensions to DuckDB. What would your advice be to them? I'd just try it out and see if your extension
is possible because there's a lot of things actually possible and sometimes I see an extension.
For example, I saw the Google Sheets extension recently where you can take a Google Sheet and
directly read it into your DuckDB.
I thought that was a great example of a new extension that is coming to life.
I just, well, take a good look at how other extensions have implemented certain features,
scalar functions or table functions or optimizer rules, et cetera,
if you want to get that deep into your database internals.
But I think there's a lot more possible than, than people realize at the
moment with, with extensions.
So I'm really curious to see what people come up with.
I guess going back to the impact then.
So you said that it was really nice because you get to get to see
people use your products.
Obviously because DuckDB is used by so many people.
It then obviously means that the extensions by kind of, I guess, are going to be used quite a lot as well.
A lot of people are going to be like, oh yeah, I'll plug this in and see how it goes.
So yeah, what has the impact been like for DuckPGQ and what has, have you had any feedback
from users?
Yeah, tell us about that experience.
Yeah, so I didn't expect many people to use PGQ yet, but Doug
De Beard now releases stats.
And I think it's been downloaded 3000 times just in the last week,
which just blew me away.
Actually, I thought there were a couple of researchers maybe trying
this out and then, I don't know, leaving it behind again.
But to see it get adopted is, or starting to get picked up,
it is really nice to see and motivating.
I get issues every now and then, which is also good
because people use this extension in ways
I haven't thought about.
So sometimes people ask me,
is this possible in the extension?
I'm like, I don't know, try it out.
And if it breaks, it breaks.
And if it works, it works.
It's just, it's in that state at the moment.
So I wouldn't, I wouldn't trust it in every situation, but it works for most of the cases.
So I'm a PhD researcher at the moment, which means I should focus on
publishing papers and doing research.
which means I should focus on publishing papers and doing research. But this DuckPGQ extension really allows me to also implement something that people can use.
So that's a bit of a balance I have to maintain. I have to do bug fixes,
so to make sure that the extension actually works. But I also have to work on my next paper.
So I tried to balance this, but it's sometimes a bit difficult.
Hey, you've got two jobs, Daniel, basically.
Now that's it.
I tried to balance them, but.
But what you said there was really nice because I mean, this is obviously one of
the main reasons why we're doing this podcast is that because DuckDB provides
such a nice platform for you to sort of build on top of it and then has that
added benefit of, has such a wide adoption that people can actually use it and put what you've made into practice.
Because a lot of the time, if you do research, you build something, you put it on GitHub,
no one's ever going to look at it, right? Whereas this way, you've got a way to actually
put your ideas out into the wild and people that give you feedback. I mean, how many other
people's PhD projects are there that have been downloaded 3000 times in the last week?
I imagine not many, right? And this is about one of the really nice things of the way that like
DuckDB can be extended.
It enables this sort of thing to happen, which I think is a really nice
thing we have in that, in, in, in this space.
Yeah.
Yeah.
And I think DuckDB also with this extension framework allows you to really
quite simple, quite easily implement your research ID and see if it works in a
real system.
And, um, yeah, and that's also something for researchers to realize that they can use the system
also to do their research.
And if it's, if it's actually a, something that's feasible, that actually works, it
can be a community extension or maybe even something that is integrated into Duck2B
itself.
So it's a really nice platform and it's really easy to use actually.
Yeah.
And I guess kind of go about the other way on that for a second.
Has there been anything kind of you've done in this project that's actually had
any influence on the core product itself is obviously cause you're building on
top of it, but have any of yours, cause you obviously work closely with, speak a
lot to the kind of core developers.
Is there any of your ideas kind of ended up influencing the core products?
I guess is the question I'm trying to get to.
Yeah.
Yeah.
I think so, duct to be labs or the team behind that to be now as a project
ongoing or starting to extend the parser or to rework the parser.
And I'm not going to say it's, it's due to me, but they realized that it's actually not that
easy to extend the parser and that they should adopt a more modern style of parsing.
So that's one way.
I think also the community extensions, they've kind of realized that this unsigned flag that
you had to set before and download from an S3 bucket. That was not really nice.
So this community extension is also, I don't want to, yeah, I don't want to
claim anything, but, uh, it is not nice to do this.
So.
Yeah.
There you go.
You've got two presents for Christmas, right?
Already is perfect.
Solid, right?
Guys.
Nice. So how right guys. Nice.
So how long has this project been from sort of start to finish?
And so the actual, I guess the actual implementation of working with it and
trying to get something that I can actually run a shortest path query on.
Like how long was that from day one to, yeah, this works.
I think it's been quite a while, but it was, it's where master projects before.
Like we took it actually serious.
So I think about six months, but that was without the syntactic sugar of PGQ.
And then I took another six months or so to, to actually implement most of the PGQ syntax.
So and then there's a lot of bug fixing and, and doing research in between.
So I'm not 100% spending the time on, on DuckPGQ development, but yeah, I would say
around, around one year.
Yeah.
Nice.
Yeah.
I guess, I guess across that sort of cumulation of time, what's been the most interesting
thing you've kind of encountered while working, I guess in, in, in the graph space and also
with DuckDB, have you been like, oh,, in the graph space and also with DuckDB.
Oh, that's really cool.
I didn't expect that.
Oh yeah.
What was the biggest surprise?
I was actually surprised at the ease of use with DuckDB and that it hadn't been done in
that way before because using DuckDB is, people are starting to adopt it now, but it's already
two or three years ago.
It was really nice to use just with this.
If you wanted to read a CSV file or read a parquet file, it was already just so simple
to set up to no dependencies.
You could run it on any architecture, any machine, whatever, even very old architecture
were supported, I think.
So that was something that surprised me that it hadn't been done in that way
before.
And then I'm also surprised.
Well, now I'm surprised by the adoption that DuckPGQ is starting to get.
And that's apparently there is a, a need to do graph work in, in your relational
system, because there is quite a lot of graph
data in relational systems.
It's just that SQL is not really nice to work with before PGQ.
So those are things that surprised me a bit.
Nice, like a building on long-term vision.
Where do you see DuckPGQ in the next couple of years?
I hope supporting more difficult features.
I want to improve performance on the pathfinding queries in particular.
That's something I'm working on right now with a research project.
I don't know where I would like to see.
I think most of the query language, the basic of the query language has been implemented now,
but these really difficult queries
are still something that I need to implement.
But I think just supporting more graph algorithms
would be nice, just so you can do all your graph analytics
within the DuckDB ecosystem.
So you don't have to switch to another library or another database system.
Just ease of use is something that's that I would like to improve in
DuckDB or in DuckQG.
Yeah, for sure.
I mean, I think you mentioned it a second ago, the usability factor of
DuckDB being such a massive kind of win factor for it.
And I guess if you can kind of emulate that and make it so that, like you said, it'd be a one-stop shop in a sense that I can do all my pattern matching, all my pathfinding, all my graph algos, all from the same terminal or whatever interface you want to sort of do algorithms, it requires a different
setup to something else, which obviously again, it's just friction. It makes sort of development
a bit harder. So if you can sort of reduce those barriers there, then I can only see positive things
and it's a good vision to aim for. Yeah. I think usability is really underestimated,
especially in research. In research, we really like to optimize the last 10th of a, of a join algorithm
or some new hash join implementation, which, which adds value.
But I think we sometimes forget that there's users that use the system.
And well, they have to spend time setting stuff up, tweaking, tweaking knobs,
whatever, just to get the
performance out of that.
And that's something where DuckDB really shines is that you set it up, it's easy to use.
And that's something that I also aim for with DuckBGQ is that it's easy to use and yeah,
doesn't require a lot of setup.
Yeah.
I mean, it's kind of, I think, from the research world, we're sort of optimizing
that last one-tenth of it, or a last, I don't know, microsecond of a join algorithm, or
something like that, right?
Because that's, we're really passionate about sort of the internals and that specific thing,
but the vast, if you take a step back, the vast majority of users are actually using
these systems.
They just want things to work, right?
And it's work reliably, and it's easy to use.
Whether it goes like that much faster, they're probably not that much bothered for most cases, which I think is often an oversight
in the research.
We're very focused.
Well, then again, because I mean, the goalposts are different, right?
Where you get a paper accepted by saying it was faster rather than it was easier to use,
right?
Which is wrong, I think.
I think there should be some value placed on usability, but it's kind of a hard thing
to measure, I guess, probably.
Yeah. Usability is really hard to measure and seeing that your join is, is faster.
It's really easy to encapsulate in a, in a, in a figure.
Yeah.
Actually, I was going to have a plot, right?
For researchers, the publications matter quite a bit.
So yeah, this is a bit of a wider topic or a wider discussion on how as a research field
we should approach this sort of issue, but I think it's really important.
Yeah, I completely agree.
I guess bringing it back to this question, then it's where you go next in the vision.
And there's this setting, places already where you've sort of gone past the standard in the
sense of the SQL PGQ standard and added new keywords and specifically around
cheapest paths. So maybe we can briefly touch on that and how that was implemented in DuckDB
and then we can wrap things up after that.
Yeah, so the PGQ standards didn't specify how to do weighted shortest path. So we extended this with a keyword
where you can define what is the edge weight.
And then under the hood, it actually
works the same as unweighted shortest path.
But instead of breadth-first search,
we implement or we execute Bellman-Ford,
which is one that works on weighted graphs.
So in that sense, it's very similar in terms of execution.
But I think it's also a nice addition because, well, not every
every graph is unweighted.
There's also weighted graphs out there.
And yeah, it's good to support those as well.
Nice. Yeah. I've just had a random thought.
So I want to ask you this question.
And it was about the tuning databases and configuring them can be a bit of a pain.
We mentioned earlier on, are there any knobs that need tuning to get the best
performance out of DuckPGQ in the sense of, because you mentioned there's this
sort of this space in memory that DuckDB has where we actually put, we sort of
materialize these data structures.
I think it was the data structures that go there.
Yeah.
Is that something the user has to be aware of or how much is, are we all
abstracted away from that and it all happens magically for us?
I think I have options that you can tune, for example,
the number of pairs that is done in one batch. So by default, this is 256,
but you could tune it to, to be less or be more, whatever you,
whatever you feel like, but you don't need to,
you don't need to do any tuning or
set any parameters or whatever.
So that's in that sense also just load it and go basically.
Very nice.
Cool.
So time for the last word.
What's the one takeaway you want the listener to get from this podcast today?
So if you have graph data in your relational system, you don't need to
switch to a graph database system.
You can do it all from your relational system.
DuckDB.
It's really easy to install, install DuckPGQ from community and then
load DuckPGQ and you can get going.
And I think that's something that people should try out, give feedback if they
want, find some bugs and yeah.
Very nice. Yeah. I could vouch for it. I've used it myself and it's very simple to get
it going. Yeah. And so, yeah, thank you very much Daniel. We'll leave things there. Where
can we find you on social media?
Yeah. So I'm on X, Daniel Tenwalder on LinkedIn, also on Blue Sky nowadays, which is the new
alternative for X, I think.
People can reach out.
The extension called DuckPGQ is on our community.
There's a page on the community extensions of DuckDB.
I have my own Notion page with more documentation and there's the GitHub for DuckPGQ extension
as well. If people want to dig into the source code, see how I implemented the breath first search.
Nice cool. We'll drop links to all that in the show notes and obviously as well as some, if you want to, if the listener wants to know more about the actual details,
then it's obviously there's the demo paper from VLDB and there's a side of paper as well from a year or two ago.
We can go and read up some more about it.
But yeah, thank you very much, Daniel.
It's been a fantastic chat today.
I'm sure the listener will have really enjoyed it as well.
And you've all been told, go out there, play with DuckPGQ and
yeah, get some graphs in your life.
Thank you for having me.