Disseminate: The Computer Science Research Podcast - Daniël ten Wolde | DuckPGQ: A graph extension supporting SQL/PGQ

Starting point is 00:00:00 Disseminate the computer science research podcast, the DuckDB in Research series. Welcome to the DuckDB in Research series. This is a new show and for maybe some of our listeners, we're not actually sure what DuckDB is. So let's start off there. So DuckDB is an open source in-process SQL database designed for fast and efficient analytical query processing. It's really simple to install, deployers of Xero, external dependencies, it's extensible, integrates easily with all your data science tools, it has APIs for things like R and Python, and you can directly read file formats like Parker and CSV, so it's jam-packed full of

Starting point is 00:00:40 features. And then you kind of might ask, why are we doing this DuckDB and research series for the head? So what's it all about? So well, this seminar as a podcast is all about impact and helping to bridge the gap between research and industry. And DuckDB is the perfect example of these two communities working together in the sense that our ideas from decades of research are at its core, and it's now actually influencing the research gender itself as a platform for others to build on. And that's kind

Starting point is 00:01:10 of a nice segue into today's episode. And I'd really like to welcome our guest, Daniel Ntem-Wold, who'll be telling us about Duck PGQ. So welcome to the show, Daniel. Thank you very much for an introduction. Cool. So before we get talking about Duck PGQ, let's find out a little bit more about yourself and yeah, how you ended up researching databases. PhD student at the CWI Database Architectures Group. And before that, I did a bachelor and master degree at the computer science also in Amsterdam. And it was quite a machine learning study, but was one course, large scale data engineering by Peter Bonks, who is now my daily supervisor.

Starting point is 00:01:54 And he had master topics to choose. And one of them was some projects with this database called DuckDB. And I wasn't familiar with DuckDB. I was only familiar with like the basic SQL and databases were kind of like a black box to me. But anyway, I chose this topic because I thought it was a fun, fun project. And then I learned how complex they actually are, these systems. And I got kind of sucked into them. And then I started to think about doing a PhD to continue on this topic.

Starting point is 00:02:25 The master thesis was on graph on multi-source pathfinding. And then I could continue in a PhD. And yeah, that's kind of how I rolled into it. Nice. So yeah, you peeked behind the curtain and that was that you were hooked. Yeah, I am. And yeah, I got kind of hooked by my daily supervisor at the time. Yeah, so we're here to talk about Duck PGQ today, which I've seen in way from as part of your PhD

Starting point is 00:02:49 And it is actually it's an extension to duck DB that people can go and download and use them I'm sure we'll get on to that at some point and let in the show But before we do that, how about we give the the listener the TLDR on duck PGQ? And yeah, tell us give us the elevator pitch for it. What is it? Yes. So edit score, DuckPGQ makes it easier to work with graph-like data from your relational database.

Starting point is 00:03:14 So you already have your data loaded in the database and then DuckPGQ can just make the graph pattern matching a bit easier and pathfinding a bit easier. And graph data, it's in quite a lot of places. So for example, social networks, ranking systems often have graph systems, road networks for example. But querying these in a relational system with the SQL that many people are familiar with can be quite challenging. So for instance, trying to find the shortest path between two nodes in standard SQL,

Starting point is 00:03:47 it's already quite a complex query that I don't think anybody can write. So previously the alternative when you have graph data would have been to switch to a dedicated graph database system and that often means that you have to learn a new query language, you have to learn a new query language. You have to manage separate setups. You have to deal with performance trade-offs.

Starting point is 00:04:11 You have to transfer your data, obviously. And DuckPGQ kind of changes this by bringing graph querying into your DuckDB. And we extend SQL with SQL property graph queries or PGQ, which is part of the new SQL standard. And this adds a sort of visual graph syntax that makes it easier to do pattern matching and pathfinding. And DuckPGQ is one of the first systems to support this new standard, but it's being picked up by other systems as well, such as Oracle and Spanner from Google. They recently announced this. So it's starting to become a bit more popular.

Starting point is 00:04:49 And we also have some built-in graph algorithms such as PageRank, just to make the graph analytics easier from your relational system. So I guess there's a few words and phrases you mentioned in that, which you probably go into a little bit more detail for the listeners so we can really set the background and the context for us going into the meat later on. So tell us a little bit more about, let's start off with the query language and SQL PGQ.

Starting point is 00:05:14 Give us a brief history on the language story there. So how did this come about and what's the space there looking like? Yeah, so I mentioned there's all these other graph database systems out there. And previously each of them had their own query language, they own their own implementation. So they might have a different syntax or different semantics, different definitions of what everything means. So it was kind of a tower of Babylon. It's some I sometimes compared to, but it's a mix of everything out there.

Starting point is 00:05:48 And then it was some committee came together to discuss a standard, a common query language that's most big vendors would agree on, and sometimes a bit reminded by the XKCD of, oh, there's 14 standards. Let's make a new one that ruled them all. And now we're 15. We have 15 standards. In a nutshell, you can say that's for SQL PGQ, but it's just a standard that's now part of the actual SQL standard. So SQL has a standard defined by the ISO. And yeah, you can use that

Starting point is 00:06:28 document. It's all written down in a document, all the definitions, the grammar rules, for example, they're defined. So you can implement directly from that. And that should make it easier to have the same implementation across different systems. So they would use the same query language, similar to like SQL. Well, almost every relational system uses SQL, maybe with some slight variations. But now you can also add PGQ to the SQL standard. Nice guy. You kind of mentioned as well a second ago, I think when you were

Starting point is 00:07:02 answering the last question, how I actually model, I have some tables and I want to sort of then base off top of that sort of kind of model a graph, right? How do I get my tables to be graphs essentially? So how would you, maybe you can give a rundown on like a simple example of like I have some data and I want to create some graph representation of that in the system. How would I actually do that? Yeah. So for example, if you have a social network,

Starting point is 00:07:27 I think that's the easiest example. Your persons, your people are represented in one table. We call this the vertex table. And these people can know each other. So that would be a second table called an edge table. And in this edge table, you have the source, the source ID and the destination ID. So the source and the destination of the edge.

Starting point is 00:07:49 And then in PDQ you define a property graph where you say this is my vertex table, this is my edge table. And for every edge table you also have to define the primary key, foreign key kind of relationship to the nodes. Like this is the source and this is the destination columns that you should use. And that's basically a layer on top of your existing tables. So you also, when you update your data, you update your tables. This is automatically also in your property graph synchronized. So you don't need to do any synchronization extra, but under the hoods, yeah, it's already done.

Starting point is 00:08:30 Nice, cool. So yeah, also as well, just to sort of, there's another thought of 10 that maybe the listener might not be familiar with, and you mentioned in this, and it was pattern, the idea of pattern matching. And the shortest path kind of, you can kind of reason about, so, okay, I want to go from here to there

Starting point is 00:08:41 in the shortest possible way. Yeah, what is pattern matching? How does that shake down? So pattern matching in SQLPGQ is done with, they call it a visual graph syntax or sometimes ASCII art. And it was actually inspired by, at the moment, I think the most popular query language, graph query language called Cypher from Neo4j. And in that you have notes which are represented with round brackets and

Starting point is 00:09:10 edges represented with the square brackets. And then you can have an arrow, for example, or you have a node to an edge, to another node, and that's just a pattern, but you can also have, for example, maybe there's multiple hops between the two nodes, but you don also have for example maybe there's multiple hops between the two nodes but you don't know how many then you can say for example a a clean star or a plus sign to say there are at least one hop between this but i don't know how many or you can give an upper limit or a lower limit if you want this. So that's one way to make the path finding easier in, in PGQ. You can also say you want any shortest path.

Starting point is 00:09:51 You can say, I want all shortest paths. I want, I want the top five shortest paths, for example, this is all possible in, in PGQ. So yeah. Cool. So we have this nice sort of like visual syntax to just kind of drawing these queries out almost. So yeah, so that sort of, so I guess kind of a really high level DuckPGQ is a combination

Starting point is 00:10:12 of two really high level things like SQLPGQ and then DuckDB. So we've kind of, we've boxed off SQLPGQ there. So let's talk about DuckDB some more. So yeah, so to kind of discuss the, I guess, the changes you've had to make to accommodate SQL PG in DuckDB, we should probably go on and kind of, I guess, discuss, discuss or probably talk about what the, what a query sort of looks like in like normal DuckDB without any sort of extension. So can you maybe give us a run through what are the kind of,

Starting point is 00:10:39 I guess, the query journey essentially, so we can have some groundwork so we can discuss what you, what you kind of then go and build on in your extension. Yeah, so the standard pipeline for a DuckDB query is that first you parse it, so you just check is the syntax correct? You don't check does this table exist? Is everything in the right place? So first you select from where, whatever.

Starting point is 00:11:03 That's the first stage, the parser. And then this generates an abstract syntax tree. So a tree out of nodes. And that gets a bound at the bind phase. And there you check, do the tables exist? Do the column exist? Are the correct function names used with the correct arguments, etc. And this sort of creates a logical query plan.

Starting point is 00:11:26 And this logical query plan exists out of like nodes, for example, a table scan is one node and then goes perhaps into a filter, maybe have a join somewhere. And then you can optimize this plan just to make it faster to execute. And then you have a physical plan that goes into the execution engine. And there you actually scan the table, there you actually execute the join. And then at the end you return the result to the user. So that's a, in a high level, just a pipeline of a standard duct to B query. Cool.

Starting point is 00:12:01 So parser abstract syntax into history abstract syntax tree. There we go. I can't speak. But then we do some binding logical plan. We try and optimize that when we have our physical plan, which is the operators we're going to use, then we actually send that over to the execution. Enjoy actually executes that query altogether results. I was all together.

Starting point is 00:12:18 Jobs are good. And so we've got the query journey there. How would you go about taking SQL PGGQ and mapping that to this pipeline? Essentially, what parts of it did you have to change of, of duck DB? And yeah, and what things did you actually need to change? I guess would be a good place to start. Yeah. So I wouldn't say change.

Starting point is 00:12:39 I would say extend duck DB while we extended almost every step of the way. So the main thing is the parser. We need to support this new query language. So for that, we have to extend DuckDB parser. And actually, we do a little bit of a dirty trick here. We actually forked DuckDB to get its parser. And then we implement our own PGQ parser in that. So our parser is a superset of everything that DuckDB can parse,

Starting point is 00:13:09 plus whatever the new syntax is of PGQ. So that is one step where we extended DuckDB. And then in the Bind phase, whenever we see this new syntax, this new PGQ syntax, there we use something called a bind replace, where we replace the match statements, the visual graph syntax. We replace it actually with a standard relational query plan. So we formulate the joins, we formulate the filters. So at the end of this bind replace, you have the same plan as you would have if you would

Starting point is 00:13:46 write the SQL query without this PGQ syntax. Because you need to remember that this PGQ is essentially a syntactic sugar over your standard SQL. You can already write graph queries in the SQL that many people are familiar with. It's just a little bit more cumbersome to write. So this bind replace replaces the match statement and transforms it into a standard relational query plan. That's another where we extend that to be.

Starting point is 00:14:16 And then for pathfinding, we do something special. For pathfinding, you can do it with called a recursive join, where you take the base results and then you recursively join over this results until you match some condition. But that's slow and it runs the risk of exploding results, which is not good. So there we actually use scalar user defined functions to first of all create a data structure, compressed sparse row data structure, which is an efficient way to represent your graph. And then we do multi-source breadth-first search over it. So for say 256 sources, we start to be a breadth-first search and they explore the graph simultaneously. So those are ways in which we extend DuckDB. And actually I'm working on some research that also adds optimized rules and a custom

Starting point is 00:15:18 operator, but that's, we'll go into that a bit later maybe, but yeah, we extended almost every step of the pipeline of a DuckDB query, of a SQL query in DuckDB. Nice, cool. So yeah, if I should just go back to the first thing that we're talking about, kind of introducing this new parser and the decision to actually fork DuckDB rather than kind of, I guess, go the other way. What was the primary reason for that decision to fork it?

Starting point is 00:15:44 Because I guess it becomes a maintainability question then as well, because as the duck DB parts are still evolving as well, you need to incorporate those changes. So yeah, what was the main reason behind that and how are you finding the maintainability aspects of it as well? Yeah. So one of the big reasons is that we want to. PGQ, SQL, PGQ is part of SQL.

Starting point is 00:16:03 So it can be that you have a match statement somewhere within your SQL query, but outside of that, you still have the normal SQL syntax. So you still need to be able to parse all of that. When we started the project, the extension framework of DuckDB was not as well defined as it is now. Like it was a lot more bare bones. So we figured it was easiest also for maintainability to fork it.

Starting point is 00:16:33 And then whenever an update comes in, we just merge whatever DuckDB changes they made into our fork. So it's actually really, really simple to stay up to date that way. And the alternative is to, I don't know, completely write your own SQL parser, but then you run the risk of not being able to parse everything that DuckDB can. And then, I don't know, the maintainability aspect becomes a lot more harder. Yeah. Yeah. You said that when you first started this, obviously the extension framework was a bit a little bit more bare bones than it is today.

Starting point is 00:17:06 Do you think that if it was as it is today that things might have been, different decisions might have been made or do you think it would have just made your life a lot easier? What's your take on that? So, I think I've rewrote the extension repository three times over because the framework changed every time. Right, yeah. Now that was, it would have been easier, yes. I don't know, we still would have forked the parser I think, because the parser is really difficult to work with in general. That's a bit of an old piece of code in DuckDB relative to the parts of duct to be. And there's plans to change this in the future, but that's that's future work.

Starting point is 00:17:49 Would we have done something different? I'm not sure. Maybe, maybe we wouldn't have used scalar use defined functions, but it's a bit hard to tell with hindsight, maybe yes, but I don't think we would have known. No. Okay. Cool. Yeah.

Starting point is 00:18:04 So then the next sort of thing you kind of, you, you, you opted to, I guess, not do is sort of map the, or to not use the with, with, with, with recursive sort of syntax and sort of, instead you mentioned this, this compressed sparse row data structure. So can you, can you actually, for the listen, maybe explain what a complex sparse row actually is, you know, it's an efficient representation of a graph. So what, why is it efficient or what, how is it different it's an efficient representation of a graph. So why is it efficient? How is it different to other representations of a graph?

Starting point is 00:18:29 Yeah, so a compressed sparse row is essentially two arrays. One array is the size of the number of vertices in your graph. The other one is the size of the amount of edges in your graph. And at every index in the vertex array, this first array, you keep track of the amount of edges in your graph. And at every index in the vertex array, this first array, you keep track of the index into the edge array where the corresponding edges start for that vertex. And that makes it really easy to scan sequentially over this edge array, because we need to keep in mind that we try to make it performance, and you kind of want to avoid any random accesses into the memory, you want to have as many sequential accesses as possible.

Starting point is 00:19:11 And it's difficult with a graph because a graph notes and edges can be all over the place, there's no sorting and there's no sorted graph per se, or at least that's probably quite expensive. So this compressed part row is actually quite a common way to represent the graph and then to work over this. Cool, yeah. So you mentioned that you did this with the scalar UDF. So can you tell us a little bit more about what they are in DuckDB and how you actually

Starting point is 00:19:38 mapped this to those? Yeah, so scalars use defined functions. So Scalar Use-Defined Functions, they have some rows of input and they return the same rows of output. And this CSR is kind of, this is also created using Use-Defined Functions and we have a special query to get the right inputs. And then we create this CSR, this compressive parser rule, in memory. So there's some space in DuckDB called the client config that we can store the CSR in memory. And then when we see the user defined function for the multi-source breadth-first search, we simply point to this CSR that was just created.

Starting point is 00:20:19 We keep track of it with an ID. Very, very simple actually. Then we know which graph to iterate over basically. Nice. Yeah. I kind of wanted to touch on, because you mentioned it, I think a second ago that the using or mapping the using these CSRs and mapping them to, you know, using scale UGS to represent them is, is really good because of the way sort of internally

Starting point is 00:20:46 duckDB process things. The fact that it's a vectorized engine, I guess this plays really nice with that. So could you maybe elaborate on what is a vectorized query engine and why this works well and why is it different to another type of engine, a non-vectorized engine? Yeah. Yeah. So vectorized engine gets, it's actually created here at CWI. It was invented together with my supervisor now, the Peter Boggs.

Starting point is 00:21:12 I think he was the first to, to, to research this or to implement this. And basically you handle a bunch of tuples, a chunk of say 2048 tuples at a time, instead of one tuple at a time. Because that saves on the overhead of function calls and you can do, for example, SIMD operations over this. So if you take SIMD operations as single instruction, multiple data. So if you do the same operation repeatedly on some data, then it's easier. Or sometimes you can use this optimization to just batch process this. And yeah, this plays well with, with our implementation because we also handle batches of source destination pairs at a time. Actually we go duct toDB has vectors of 2048

Starting point is 00:22:07 tuples, so that's one vector. But we go even smaller, we go to 256 source destination pairs at a time. Because pathfinding is a really, really, really expensive operation compared to a filter or a join, for example. But we use the same technique of just handling a batch of source destination pairs at a time, and then you avoid this cost of function calls and yeah. Nice. The CSR also sort of structure having that in there, having that induct DB will also opens up some other possible directions of work as well,

Starting point is 00:22:47 which we probably can touch on now while we're talking about CSRs. And we have that enables the multi-source BFS. But there are other things as well, like in these things I mentioned in some of your work, I think it's called worst case optimal joins and factorization. And how do how does CSRs enable that? I mean, I don't know how what the state of the implementation is on that sort of stuff, how close that is to being real yet, but how, how does that, how did that enable that? And then are there any other features of DuckDB that kind of would naturally dovetail with

Starting point is 00:23:16 those things as well, essentially? Yeah, so we've been working on this worstimum join, which is a type of join. We actually don't use the CSR in this case, but this year we had a master student, Paul Gross, and he worked on factorized with an FA, factorized query processing. And this is an optimization you can do with many to many relationships. So if you do joins on many to many relationships you get often many duplicate values or duplicated values and this is very inefficient because you're basically repeating the same value many times and what you can do is you take out this, this duplicate value only represented

Starting point is 00:24:06 once together with all the combinations in which it was present. And while we have a cider paper, this is a conference in Amsterdam in January, 2025, where we are going to present this. But we did this by, by changing the, the way that join is actually done in, in ducti B. So it was quite a, quite a hefty operation, but yeah. Nice. Cool. Yeah.

Starting point is 00:24:34 I think this is also one more future thing that these CSR data structures allow. And that's obviously ducti B is really popular with in a lot of data science use cases and it's used really heavily there. And obviously in the machine learning space and things like graph neural network libraries. How do they enable that, open it up as a possible use case for DuckPGQ as well? Yes, we had another project on this actually. Because these graph neural network libraries, they often also use a CSR under the hood to represent their neural network.

Starting point is 00:25:08 So DuckDB is an in-process system, meaning that it operates within the same memory as, as for example, your Python script. And this allows for a nice optimization that you can do so-called zero copying. Zero copying means that you don't actually need to copy any data, you just need to point to the memory location without having to do any copy. And so we created this CSR on our own. And then from a neural network library such as PyTorch, you can then point to the CSR directly and

Starting point is 00:25:46 skip the whole creation process. So that is one way. And then you can store your neural network in inductive if you want to. But this was all very experimental. I don't know. We're, we're not sure if we're going to pursue this, this avenue of research, but there was a, it was actually a fun exercise to try this out. Let's talk about some performance numbers then, some evaluation of DuckPGQ,

Starting point is 00:26:10 kind of its inaction. How did it perform? What have you compared it against? How good is it at Graph Analytics? Yeah, well, DuckPGQ really, really builds on the performance already of Duck2B. And Duck2B already performs pretty well, actually very well, I would say. So DuckBGQ thankfully also performs pretty well on analytical queries. We compared it with Neo4j,

Starting point is 00:26:38 which is the most popular graph database system out there. And although Neo4j is more towards transactional queries, so really simple, short queries, that PGQ was able to outperform it, in some cases by 10 or 100X. So that was very nice to see. And we also compared it to Kuzu. Kuzu is the system from the University of Waterloo.

Starting point is 00:27:05 I think they've now spun off into a separate startup company, but they are a graph system that is very similar to DuckDB, also for analytics, also embedded. And they take a lot of inspiration from DuckDB. So no dependencies, et cetera. And also with Kuzu, which is a bit more fair comparison, we outperform them on most queries and there are some cases where they are faster, but I think on most queries we outperform them. And this is mostly for the pattern matching and then for pathfinding, which

Starting point is 00:27:40 is not easy to always compare, but there we also outperform Neo4j. And I don't think we did comparisons with Kuzu on pathfinding, but that's something I want to do in the future as well. What makes you difficult to compare between systems with pathfinding, Daniel? I think comparing in general between systems is really difficult. That's very true. Yeah. Because all these systems have different setups, all these knobs you can tweak. I got comments that the environment wasn't directly,

Starting point is 00:28:13 like one was a Docker container and the other one was a Python environment or Python scripts, and even that arguably would have made a difference. So in general, benchmarking different systems is really hard. So also take this with a grain of salt, these results. You should do your own benchmarking right for your own use case and see and make your own decision. Yeah, I can do general benchmarking, but you know your own use case the best, I think. Exactly. I completely agree with you on that one. Cool.

Starting point is 00:28:46 So I kind of wanted to ask you more of a philosophical question about sort of the design process of adding all these features to DuckDB. And I kind of want to get your ideas. How much did DuckDB's actual existing architecture actually influence your design decisions? Were you always trying to be sympathetic to the current, like, way that DuckDB is structured? Yeah, I think, well, duct to be actually originates from our research group.

Starting point is 00:29:12 So it started five years ago now, I think. And it was also quite a natural selection for us to work in, in duct to be, because we, well, we know the developers, it was really easy to get, get help with your, with your implementation. And still we are in quite close contact. But I think their general architecture also really suits graph analytics. So they have this columnar storage format. So they store columns together instead of rows of data together. They have a vectorized execution engine, which is, I think, implemented by most state-of-the-art analytical systems. They have morsel-driven parallelism, which is a parallelism model that suits us nicely.

Starting point is 00:29:59 So yeah, it was just a quite natural fit for us to, to work with DuckDB for our SQL PGQ implementation. Yeah. Yeah. Are there any alternatives out there that are even close to being able to allow you to do what you've done in DuckDB? So obviously it's like I said, it's designed in such a way that it is meant to be extensible and has all these really cool, awesome modern features in there. Are there any other systems that are kind of even comparable in any way? I think Postgres, which is also one of the most popular open source database systems, they also have this extensive extension framework.

Starting point is 00:30:38 I think even more extensive than DuckDB, but Postgres is also a lot more mature, I think in that sense. They would also allow you to do this, but I don't know if you would get the same performance out of Postgres compared to DuckDB. That would actually be an interesting comparison, but I think Postgres is one other system that would allow you to do this. For closed source systems, it's really difficult already to do. Yeah, it's difficult. You don't have actually too many choices at the moment. Right, yeah.

Starting point is 00:31:08 And, of course, yeah, also as well, sort of thinking about this and why you're working on it. If you're going to have a wish list of things that you wish DuckDB had, or certain, I don't know, things in its framework or certain additional features you wish it had, what would that be? What would the wish list be? I was coming up to Christmas, right? So yeah. That's a good timing. I think I'm already very happy with all the things you can do with the extension framework. I think parser extension does need a bit of a facelift. That's quite the tricky part to extend. And in general, I wish there was more internal documentation on how to work with extensions.

Starting point is 00:31:53 We're thankful that we have the know-how in this group to know how a database system works, how we can extend it. We have the Duck2B people quite close by, but figuring things out isn't always easy because you really need to look at how other extensions have implemented certain features to actually figure out how you should do it. So that's the main thing where I see quite an improvement possible. Over all with your sort of your DuckDB experience, if we're going to review it and reflect on the whole process, what would you kind of, you take on the overall experience of working

Starting point is 00:32:31 with the extension module? Well, yes, like I said, it's great to see where it's going. The new edition of community extensions is really nice because before you had to either build the extension from source, which let's be honest, not many people are going to do that, or you had to opt into some unsigned flag and allow the extension to be downloaded from an S3 bucket, which is also not very nice. So this community extension is really a great addition. Overall the experience has been, I think, yeah, just very good in general.

Starting point is 00:33:15 It's not always easy to work with, but once you get it working, it's actually very nice. And people actually start to use the extension as well, which is nice to see and also motivating for me. Yeah, before we get onto that and talk a little bit more about the actual impact, do you have any, while we're on this sort of topic of your experience of using DuckDB and building on top of it, is there any advice you'd give to would-be implementers and would-be developers who are thinking of sort of doing their own extensions to DuckDB. What would your advice be to them? I'd just try it out and see if your extension is possible because there's a lot of things actually possible and sometimes I see an extension. For example, I saw the Google Sheets extension recently where you can take a Google Sheet and directly read it into your DuckDB.

Starting point is 00:34:05 I thought that was a great example of a new extension that is coming to life. I just, well, take a good look at how other extensions have implemented certain features, scalar functions or table functions or optimizer rules, et cetera, if you want to get that deep into your database internals. But I think there's a lot more possible than, than people realize at the moment with, with extensions. So I'm really curious to see what people come up with. I guess going back to the impact then.

Starting point is 00:34:36 So you said that it was really nice because you get to get to see people use your products. Obviously because DuckDB is used by so many people. It then obviously means that the extensions by kind of, I guess, are going to be used quite a lot as well. A lot of people are going to be like, oh yeah, I'll plug this in and see how it goes. So yeah, what has the impact been like for DuckPGQ and what has, have you had any feedback from users? Yeah, tell us about that experience.

Starting point is 00:35:00 Yeah, so I didn't expect many people to use PGQ yet, but Doug De Beard now releases stats. And I think it's been downloaded 3000 times just in the last week, which just blew me away. Actually, I thought there were a couple of researchers maybe trying this out and then, I don't know, leaving it behind again. But to see it get adopted is, or starting to get picked up, it is really nice to see and motivating.

Starting point is 00:35:31 I get issues every now and then, which is also good because people use this extension in ways I haven't thought about. So sometimes people ask me, is this possible in the extension? I'm like, I don't know, try it out. And if it breaks, it breaks. And if it works, it works.

Starting point is 00:35:48 It's just, it's in that state at the moment. So I wouldn't, I wouldn't trust it in every situation, but it works for most of the cases. So I'm a PhD researcher at the moment, which means I should focus on publishing papers and doing research. which means I should focus on publishing papers and doing research. But this DuckPGQ extension really allows me to also implement something that people can use. So that's a bit of a balance I have to maintain. I have to do bug fixes, so to make sure that the extension actually works. But I also have to work on my next paper. So I tried to balance this, but it's sometimes a bit difficult.

Starting point is 00:36:27 Hey, you've got two jobs, Daniel, basically. Now that's it. I tried to balance them, but. But what you said there was really nice because I mean, this is obviously one of the main reasons why we're doing this podcast is that because DuckDB provides such a nice platform for you to sort of build on top of it and then has that added benefit of, has such a wide adoption that people can actually use it and put what you've made into practice. Because a lot of the time, if you do research, you build something, you put it on GitHub,

Starting point is 00:36:51 no one's ever going to look at it, right? Whereas this way, you've got a way to actually put your ideas out into the wild and people that give you feedback. I mean, how many other people's PhD projects are there that have been downloaded 3000 times in the last week? I imagine not many, right? And this is about one of the really nice things of the way that like DuckDB can be extended. It enables this sort of thing to happen, which I think is a really nice thing we have in that, in, in, in this space. Yeah.

Starting point is 00:37:12 Yeah. And I think DuckDB also with this extension framework allows you to really quite simple, quite easily implement your research ID and see if it works in a real system. And, um, yeah, and that's also something for researchers to realize that they can use the system also to do their research. And if it's, if it's actually a, something that's feasible, that actually works, it can be a community extension or maybe even something that is integrated into Duck2B

Starting point is 00:37:42 itself. So it's a really nice platform and it's really easy to use actually. Yeah. And I guess kind of go about the other way on that for a second. Has there been anything kind of you've done in this project that's actually had any influence on the core product itself is obviously cause you're building on top of it, but have any of yours, cause you obviously work closely with, speak a lot to the kind of core developers.

Starting point is 00:38:03 Is there any of your ideas kind of ended up influencing the core products? I guess is the question I'm trying to get to. Yeah. Yeah. I think so, duct to be labs or the team behind that to be now as a project ongoing or starting to extend the parser or to rework the parser. And I'm not going to say it's, it's due to me, but they realized that it's actually not that easy to extend the parser and that they should adopt a more modern style of parsing.

Starting point is 00:38:32 So that's one way. I think also the community extensions, they've kind of realized that this unsigned flag that you had to set before and download from an S3 bucket. That was not really nice. So this community extension is also, I don't want to, yeah, I don't want to claim anything, but, uh, it is not nice to do this. So. Yeah. There you go.

Starting point is 00:38:58 You've got two presents for Christmas, right? Already is perfect. Solid, right? Guys. Nice. So how right guys. Nice. So how long has this project been from sort of start to finish? And so the actual, I guess the actual implementation of working with it and trying to get something that I can actually run a shortest path query on.

Starting point is 00:39:16 Like how long was that from day one to, yeah, this works. I think it's been quite a while, but it was, it's where master projects before. Like we took it actually serious. So I think about six months, but that was without the syntactic sugar of PGQ. And then I took another six months or so to, to actually implement most of the PGQ syntax. So and then there's a lot of bug fixing and, and doing research in between. So I'm not 100% spending the time on, on DuckPGQ development, but yeah, I would say around, around one year.

Starting point is 00:39:53 Yeah. Nice. Yeah. I guess, I guess across that sort of cumulation of time, what's been the most interesting thing you've kind of encountered while working, I guess in, in, in the graph space and also with DuckDB, have you been like, oh,, in the graph space and also with DuckDB. Oh, that's really cool. I didn't expect that.

Starting point is 00:40:08 Oh yeah. What was the biggest surprise? I was actually surprised at the ease of use with DuckDB and that it hadn't been done in that way before because using DuckDB is, people are starting to adopt it now, but it's already two or three years ago. It was really nice to use just with this. If you wanted to read a CSV file or read a parquet file, it was already just so simple to set up to no dependencies.

Starting point is 00:40:36 You could run it on any architecture, any machine, whatever, even very old architecture were supported, I think. So that was something that surprised me that it hadn't been done in that way before. And then I'm also surprised. Well, now I'm surprised by the adoption that DuckPGQ is starting to get. And that's apparently there is a, a need to do graph work in, in your relational system, because there is quite a lot of graph

Starting point is 00:41:06 data in relational systems. It's just that SQL is not really nice to work with before PGQ. So those are things that surprised me a bit. Nice, like a building on long-term vision. Where do you see DuckPGQ in the next couple of years? I hope supporting more difficult features. I want to improve performance on the pathfinding queries in particular. That's something I'm working on right now with a research project.

Starting point is 00:41:42 I don't know where I would like to see. I think most of the query language, the basic of the query language has been implemented now, but these really difficult queries are still something that I need to implement. But I think just supporting more graph algorithms would be nice, just so you can do all your graph analytics within the DuckDB ecosystem. So you don't have to switch to another library or another database system.

Starting point is 00:42:09 Just ease of use is something that's that I would like to improve in DuckDB or in DuckQG. Yeah, for sure. I mean, I think you mentioned it a second ago, the usability factor of DuckDB being such a massive kind of win factor for it. And I guess if you can kind of emulate that and make it so that, like you said, it'd be a one-stop shop in a sense that I can do all my pattern matching, all my pathfinding, all my graph algos, all from the same terminal or whatever interface you want to sort of do algorithms, it requires a different setup to something else, which obviously again, it's just friction. It makes sort of development a bit harder. So if you can sort of reduce those barriers there, then I can only see positive things

Starting point is 00:42:55 and it's a good vision to aim for. Yeah. I think usability is really underestimated, especially in research. In research, we really like to optimize the last 10th of a, of a join algorithm or some new hash join implementation, which, which adds value. But I think we sometimes forget that there's users that use the system. And well, they have to spend time setting stuff up, tweaking, tweaking knobs, whatever, just to get the performance out of that. And that's something where DuckDB really shines is that you set it up, it's easy to use.

Starting point is 00:43:32 And that's something that I also aim for with DuckBGQ is that it's easy to use and yeah, doesn't require a lot of setup. Yeah. I mean, it's kind of, I think, from the research world, we're sort of optimizing that last one-tenth of it, or a last, I don't know, microsecond of a join algorithm, or something like that, right? Because that's, we're really passionate about sort of the internals and that specific thing, but the vast, if you take a step back, the vast majority of users are actually using

Starting point is 00:43:57 these systems. They just want things to work, right? And it's work reliably, and it's easy to use. Whether it goes like that much faster, they're probably not that much bothered for most cases, which I think is often an oversight in the research. We're very focused. Well, then again, because I mean, the goalposts are different, right? Where you get a paper accepted by saying it was faster rather than it was easier to use,

Starting point is 00:44:17 right? Which is wrong, I think. I think there should be some value placed on usability, but it's kind of a hard thing to measure, I guess, probably. Yeah. Usability is really hard to measure and seeing that your join is, is faster. It's really easy to encapsulate in a, in a, in a figure. Yeah. Actually, I was going to have a plot, right?

Starting point is 00:44:36 For researchers, the publications matter quite a bit. So yeah, this is a bit of a wider topic or a wider discussion on how as a research field we should approach this sort of issue, but I think it's really important. Yeah, I completely agree. I guess bringing it back to this question, then it's where you go next in the vision. And there's this setting, places already where you've sort of gone past the standard in the sense of the SQL PGQ standard and added new keywords and specifically around cheapest paths. So maybe we can briefly touch on that and how that was implemented in DuckDB

Starting point is 00:45:12 and then we can wrap things up after that. Yeah, so the PGQ standards didn't specify how to do weighted shortest path. So we extended this with a keyword where you can define what is the edge weight. And then under the hood, it actually works the same as unweighted shortest path. But instead of breadth-first search, we implement or we execute Bellman-Ford, which is one that works on weighted graphs.

Starting point is 00:45:43 So in that sense, it's very similar in terms of execution. But I think it's also a nice addition because, well, not every every graph is unweighted. There's also weighted graphs out there. And yeah, it's good to support those as well. Nice. Yeah. I've just had a random thought. So I want to ask you this question. And it was about the tuning databases and configuring them can be a bit of a pain.

Starting point is 00:46:04 We mentioned earlier on, are there any knobs that need tuning to get the best performance out of DuckPGQ in the sense of, because you mentioned there's this sort of this space in memory that DuckDB has where we actually put, we sort of materialize these data structures. I think it was the data structures that go there. Yeah. Is that something the user has to be aware of or how much is, are we all abstracted away from that and it all happens magically for us?

Starting point is 00:46:27 I think I have options that you can tune, for example, the number of pairs that is done in one batch. So by default, this is 256, but you could tune it to, to be less or be more, whatever you, whatever you feel like, but you don't need to, you don't need to do any tuning or set any parameters or whatever. So that's in that sense also just load it and go basically. Very nice.

Starting point is 00:46:55 Cool. So time for the last word. What's the one takeaway you want the listener to get from this podcast today? So if you have graph data in your relational system, you don't need to switch to a graph database system. You can do it all from your relational system. DuckDB. It's really easy to install, install DuckPGQ from community and then

Starting point is 00:47:17 load DuckPGQ and you can get going. And I think that's something that people should try out, give feedback if they want, find some bugs and yeah. Very nice. Yeah. I could vouch for it. I've used it myself and it's very simple to get it going. Yeah. And so, yeah, thank you very much Daniel. We'll leave things there. Where can we find you on social media? Yeah. So I'm on X, Daniel Tenwalder on LinkedIn, also on Blue Sky nowadays, which is the new alternative for X, I think.

Starting point is 00:47:50 People can reach out. The extension called DuckPGQ is on our community. There's a page on the community extensions of DuckDB. I have my own Notion page with more documentation and there's the GitHub for DuckPGQ extension as well. If people want to dig into the source code, see how I implemented the breath first search. Nice cool. We'll drop links to all that in the show notes and obviously as well as some, if you want to, if the listener wants to know more about the actual details, then it's obviously there's the demo paper from VLDB and there's a side of paper as well from a year or two ago. We can go and read up some more about it.

Starting point is 00:48:26 But yeah, thank you very much, Daniel. It's been a fantastic chat today. I'm sure the listener will have really enjoyed it as well. And you've all been told, go out there, play with DuckPGQ and yeah, get some graphs in your life. Thank you for having me.

Your Ad Here

Disseminate: The Computer Science Research Podcast - Daniël ten Wolde | DuckPGQ: A graph extension supporting SQL/PGQ

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.