Disseminate: The Computer Science Research Podcast - Semih Salihoğlu | Kùzu Graph Database Management System | #29

Episode Date: April 3, 2023

Summary: In this episode Semih Salihoğlu tell us about Kùzu, an in-process property graph database management system built for query speed and scalability.Listen to hear the vision for Kùzu and to ...learn more about Kùzu's factorized query processor! Links:Kùzu GitHub repoCIDR papercontact@kuzudb.com Kùzu SlackKùzu TwitterKùzu Website - blog posts Semih mentioned can be found hereSemih's HomepageSemih's Twitter Hosted on Acast. See acast.com/privacy for more information.

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Disseminate, the computer science research podcast. I'm your host, Jack Wardby. Today, I'm delighted to say I'm joined by Semi Saliholu, who will be talking about a new graph database management system, Kuzu. Semi is an associate professor at the University of Waterloo and is also the current holder of the David R. Sheraton Faculty Fellowship. Welcome to the show, Semin. Thank you very much. Thanks for having me.
Starting point is 00:00:47 Great stuff. Well, let's jump straight in then. So maybe you can tell us a little bit more about yourself and how you became interested in research and databases. Yeah, your story so far. I see. Good. Well, okay. So you should interrupt me because I could go on here. So I got introduced to data management back in undergrad when I took the introduction course. And it was given by, I went to Yale as an undergrad, and it was given by Avi Silberschatz. So Avi is very well known in the field, and he is actually the author of one of the main books. Well, the course was overall, I thought, very boring. Actually, we have a problem as a field to teach that course in an interesting way. I find it very challenging to myself to engage students because a lot of people know SQL
Starting point is 00:01:33 and sort of core database technology. But Avi had a way of explaining a lot of stories because he was very well integrated with the industry. So he had led the AT&T labs for a long time and he had very close connections at Google. So he would inspire us a lot with these major challenges in data management. So that's where it started.
Starting point is 00:01:54 But I did my senior thesis in networks. I was always a bit systems oriented, maybe a bit more theoretical bent, but picking topics from systems. Anyway, so I did have a three-year industry experience before starting my PhD. And when I applied to PhD, I applied as broadly as systems, not necessarily just data management. It was networks or data management. And when I went there, I had a few sort of rotations with people.
Starting point is 00:02:24 And believe it or not, I ended up in a very, very different field. I ended up in DNA computing, nothing to do with systems, nothing to do with data management whatsoever, but that lasted for about a year and a half. And towards the end of my second year, I came back to where I felt more comfortable aligned with Jennifer Widom, who was my PhD advisor.
Starting point is 00:02:50 And yeah, so that's how I sort of ended up doing a PhD in graph analytics, not really data management. It was the Hadoop MapReduce era. And systems like Sparks were coming out, and everybody was trying to do some distributed sort of data management, data processing. And I picked the graph vertical because I just read this Pregel paper back in the day. We could get into that if you're interested. And then, so that was basically my PhD. And then when I took my position at the University of Waterloo, I wanted to do
Starting point is 00:03:17 something a bit more in data management. And there was a wave around property graph data management with Neo4j and yeah i want to i want to get into the space and you know that's that's that's what i've been doing for the last seven eight years yeah the rest is history i guess yeah yeah the rest is history fantastic so yeah graph databases and so that's what we're going to be talking about today and so maybe before we start talking about kuzu maybe we could for the for the uninitiated maybe you could tell us a little bit more about graph databases, the background, what they are, what they're used for, and how they differ from relational databases.
Starting point is 00:03:51 Yeah. So I'll actually unashamedly put a plug for a blog post that I wrote that really articulates my opinions in this field in more detail than I'll probably be able to do right now because of time limits. But there's a blog post that I put, that I wrote on Kuzu's website called what every competent graph database management system should do. But I also go into the history of graph database management systems. So, so graph database management systems, you know,
Starting point is 00:04:21 the term might mean several different things. So nowadays a lot of the time people just think of systems like Neo4j, because that's essentially the current most popular model is what's called property graph database model. And Neo4j really kind of pioneered that and popularized this. But systems based on graph structured data as a data modeling choice. Graphs have existed even before relational systems. So the very first database management system in history was based on a data model called the network model. And network is just another term for a graph. So it was essentially a way you would represent your records and you would link them to each other,
Starting point is 00:05:04 just like you link nodes through each other in the current property graph database manager. So that's even in the early 60s, you know, before the field was emerging, there were systems we could call graph database management systems. So and every wave has essentially seen to some different, you know different levels of success some database management systems that have been adopting a graph-based model instead of the, at this point, the de facto tabular
Starting point is 00:05:34 model, or you could refer to them as relational model. So that came in the 60s. The first systems in relational systems were built in the 80s. And again, 80s, 90s, 2000s, there was always systems, database management systems built on a graph model like XML, RDF, now property graph database management, object-oriented systems.
Starting point is 00:05:55 So what should I say about them in the current era? So graph database management systems, they adopt a graph model where instead of representing your records as a set of tables, you represent them as a set of nodes and connections or relationships between your set of nodes. And in the property graph data model, you can also add some properties, key value properties on your nodes and edges. Ultimately, the query language comes, it's a high-level query language, just like SQL, and it's very much like SQL. So Cypher is the one that Kuzu implements,
Starting point is 00:06:29 Neo4j, it's Neo4j's language. You know, you try to follow the open Cypher, but, you know, other systems around that adopt the property graph model, they have high-level languages that are very much SQL-like,
Starting point is 00:06:41 and it's not very surprising because that's, in the end, the core principles of how data is managed at a database management system is always relational in the sense that there's only one way that humans know how to develop declarative languages, and that's really to compile them down to relational operators that process sets of tuples, output sets of tuples. So at their core, just like any other database management system ever built in history, they're all, in a sense, relational at that sense. But the data model in the query language has these graph-specific syntax or graph-specific features,
Starting point is 00:07:18 which makes it suitable for several classes of applications in particular. So like fraud and recommendations, whenever you want to do essentially some recursive computations, I think they're much better than relational systems to program because they have these special syntax where you can say, you know, I want to find connections between this record and that record, and you may have represent them as nodes. And that's very intuitive in a graph model. You're essentially looking for paths. So paths is, for example, a very first-class citizen notion, which doesn't exist in tables.
Starting point is 00:07:50 And you need to think of joint paths and whatnot, which is not very intuitive. So there's a mismatch there. So there's other classes, sets of features that they provide. And I go into all of these in my blog post. But overall, essentially, the two things to really remember is that it's an alternative data model with a SQL-like language with alternative syntax. And I actually think people should just know alternatives to data management beyond SQL,
Starting point is 00:08:18 just because there's never going to be a perfect fit database management system and every system you know every application has its more natural way of modeling it and asking queries and different systems are going to be more more suitable on that then so um kuzu give us the elevated pitch for kuzu and can and maybe tell us where the name comes from as well so kuzu the elevator pitch for kuzu is the kuzu is the aims to be the duck db is the DuckDB for graphs. And right now, for a lot of people, that means something very specific because DuckDB is very, very popular and successful system. It is a modern graph database management system built with state-of-the-art principles of how essentially general analytics-oriented or query-oriented, I should say, database management systems should be built.
Starting point is 00:09:08 And it follows the usability features of DuckDB. It's embeddable just like DuckDB and aims to connect very well with sort of a set of higher-level analytics libraries, specifically graph analytics libraries that users might use, aside from the database engine. So, you know, one way to think about this is that we'd like to be default storage of database management storage for people who would like to model their data as a set of node and relationship records. But they'll do maybe some high level, more advanced analytics, like maybe they'll want to build a graph machine learning model over those records. And they'll need to do lots of things before they could essentially so if you just need the graph database management system for your application, we want to offer a lot of value by being a state-of-the-art system in the sense that, you know,
Starting point is 00:10:11 we've done a lot of work on graph database management systems and query processing and scalability in particular. So all of those features are getting integrated into Kuzu. And that's also what ductv does like to be also for the tabular analytics integrates uh state-of-the-art techniques plus integrates very well with these higher level uh the tabular analytics libraries like numpy etc and we wanted we're kind of trying to trying to replicate that awesome so let's um go into the architecture a little bit more of kujo so he said it's it tries to bring in all of these state-of-the-art components and the latest research into the system. So maybe you can tell us about the high-level picture of it.
Starting point is 00:10:52 And maybe we'll walk through each component at a time. Sounds good. So it's the high-level sort of principles of how database management systems are implemented is very well known. Like every system pretty much goes through similar sets of phases when they take a text query and in the end execute the query with some executable code. So basically we have a parser binder, et cetera, to compile the text query into some in-memory representation.
Starting point is 00:11:25 We have an initial join optimizer that actually, we're a little different from some systems that our initial join optimizer takes the initial representation of the query and optimizes and gives the logical plan that optimizes the joins that will happen in the query. And then we have an optimizer that takes that logical plan and does a lot of standard optimization techniques like filter pushdowns, removing of projections, removing of unnecessary properties slash columns in relational terms, several normalizations, so on and so forth. And then that's mapped into a physical plan. And that physical plan is given to a task scheduler, which executes. then that's mapped into a physical plan. And that physical plan is given to a task scheduler, which executes. So that's the high level sort of journey as it's the journey
Starting point is 00:12:11 in Postgres, as it's the journey in I'm sure Neo4j or DuckDB. So that high level thing is pretty standard. So I should maybe just start talking about the query processor because that's where we have done most work. So Kuzu is essentially the successor of a previous system called GraphFlow DB. So GraphFlow started in 2017. And the motivation for GraphFlow was when I entered essentially this field of graph database management system, I had felt that back in my PhD, I'd made a big mistake by not understanding how the software that I was using was, that I was spending my years on doing a PhD, was used.
Starting point is 00:12:55 So I worked on these distributed MapReduce-like systems called, Pregle was the original system from Google. The version that I have was called GPS, and the open source clone of Pregle was something called Apache Giraffe. It's still out there, I think, but I doubt it has any users. So I did my PhD on that, but I never really sort of understood the user base and understood how big, how impactful work in that space could be. So in general, if you want your research to be impactful,
Starting point is 00:13:30 you need to bet that the software that you're working on, at least in data management, at least, we all work on some data management or some data processing software. You need to bet that that's going to be successful, that that's going to live. There'll be lots of applications using it. It'll be popular to some extent. And I not done any of that work. I just blindly went into it. And by the time I was graduating, it had become clear to me that really there were no users of giraffe or preg or GPS, those kinds of systems. I didn't want to
Starting point is 00:13:58 make that mistake. At the same time in 2015, 16, Mike Stonebraker won the Turing Award and had this Turing lecture where he talked about the importance of sort of talking to users throughout his career. And, you know, obviously he represents one style of research where he advocates that the true validator of good and bad ideas is the market, not reviews or academics. And so I did this user survey I was able to convince my PhD student Siddhartha to really lead this it was actually a lot of boring work to try to do if you're a systems oriented student and you come into my group and I get you to work nine months on on a survey is you know it's super boring but I mean he went through
Starting point is 00:14:44 it thankfully instead of coding you know, it's, it's super boring. But I mean, he went through it, thankfully, instead of coding, you know, we did this user survey where we asked 90 people, it was an online survey. There were 90 sort of participants, 89 I think participants online survey. Plus I think we talked to maybe 15 users to really understand what their challenges were. And I think a lot of the people work actually coming from Neo4j or other graph database management systems, although we wanted to do the survey a bit
Starting point is 00:15:11 more broader by talking to users of Draft and NetworkX and other sort of popular graph libraries as well. But I think many of the people were essentially coming from Neo4j because it had the biggest user base. And we asked them, what are your problems? And they really told us scalability is a big problem in the field. So that's how we positioned our research back in the day around 2017. Let's work on scalability, how to make essentially graph querying more scalable, graph storage more scalable. Amin really led the project.
Starting point is 00:15:41 Amin Medby is one of my PhD students who's graduating this year. He was the main essentially researcher on the project. I mean, Matt B is one of my PhD students who's graduating this year. He was the main, essentially, researcher on the system. And so a lot of those ideas actually are now getting into Kuzu. And a lot of that work was either in storage or query processing. So maybe in terms of those two, we have the most sort of state of, we have integrated most of the state of the art techniques that we know. So what are those? So in the query processor, Kuzu is a factorized query processor, which means that the intermediate results are represented not as flat tuples, but as unions of Cartesian products.
Starting point is 00:16:19 And I can get into this. It implements some of these novel techniques, joint techniques for many-to-many joints, for growing joints, which is a core problem in graph querying. And it then integrates these very well-known and adopted, very well-understood techniques, like it's vectorized in the sense that columnar DBMs, relational systems, use the term. So it means that instead of passing single tuple between operators in your query plans, vpass2048 tuples are being passed. And that's the term vectorization
Starting point is 00:16:55 in the context of query processing. So it's a vectorized query processor and it does Morsel-driven parallelism, which is a technique that came, I think, maybe introduced in Hyper, I'm not sure, but it was a Thomas Neumann paper. And.db adopted to readopted. It's a way of parallelizing
Starting point is 00:17:12 a single query plan instead of parallelizing multiple queries running in parallel. Basically, it's a very simple idea. It just means that if there is a pipeline, suppose you need to scan and filter something. You need to find all filter something, right?
Starting point is 00:17:26 So let's take something very simple. You need to find all your nodes that have an age greater than 50, and there's maybe, you know, a million nodes. Morpholization just means that you partition the amount of data from disk that needs to be scanned, and different threads are competing to get different morsels, essentially partitions of things, and they run different pipelines, and they coordinate that where the pipelines end. So that's the way we parallelize queries.
Starting point is 00:17:52 So those are essentially maybe the core sort of ingredients to the query processor. Storage is a columnar design, but it's graph optimized. So why is it culinary? The motivation for this, and this is essentially our position as a research group, from having talked to a lot of people who use graph database management systems, a lot of the applications that graph database management systems are very good at supporting,
Starting point is 00:18:19 where they add value, is in essentially query heavy workloads. So you're looking for something very complex that requires essentially maybe finding a path between one node and another node, or even like just going like two, three degrees of neighborhoods around nodes to see what they're connected to. Again, like classic examples are fraud. You know, is this a bank account? Suppose you get a financial transaction network
Starting point is 00:18:45 you want to represent as a set of nodes and edges. There are account nodes. There are transfers between accounts. And you just want to know, is this node connected to the suspicious node from take your country? The challenge is that the graphs by their nature, I mean, a lot of one core property of the data
Starting point is 00:19:03 that applications tend to use graph database for is that nodes are going to connect to many other nodes. So there's going to be many, many relationships. So if one of your accounts is transferring on average to 100 different accounts and they transfer to 100 different accounts, the amount of essentially part of the database that you need to explore and search just grows for each hop that you do, for each join that you do. And so that ends up requiring scanning and sort of processing large chunks of the database. So then that says it's read optimizer, these systems need to be read optimizer, query optimized, versus systems that might just want to be very good at sort of updates and some, you know, you're taking lots of transactions,
Starting point is 00:19:46 you want to update them. They're not really transactional. I mean, as any database management system, they need to support transactions because people will want atomicity and durability. They don't want to lose their data, but they're not supposed to offer a lot of value in that space. I think that's a mismatch.
Starting point is 00:19:59 So as a design choice, many of them try to optimize for this query speed. And so our storage is columnar because that's the core. Well, I mean, that's a very well understood sort of principle that columnar systems are much more scan friendly. They give you opportunities to scan less parts of the data if the queries ask for less parts of the data than row-oriented systems. So, which means that every property that we have on nodes, for example, an H property gets its own sort of set of disk blocks or file, if you will. So, it's columnar as a core principle. And in terms of storing the edges and edge properties, relationship properties, it's still a columnar system. We adopt a disk-based version of the
Starting point is 00:20:51 well-known CSR format. That's a very common format to represent sort of graphs both in memory and we have a disk-based version of it. So that's also columnar. I mean, it's actually called columnar sparse row format. So it's a columnar system. You can really think of CSR as a compression of two sorted columns. I won't get into it, but it's another columnar format. And we have one choice that we made,
Starting point is 00:21:21 which sometimes I question actually, but we made it because that's what we've done in GraphFlow too, is that we double indexed the edge properties. So which means that if you have a connection, suppose account A has sent money to account B, and there's a property on that edge, maybe the amount of money that was sent is 500, that we stored that 500 twice, one for one of the nodes,
Starting point is 00:21:44 the source, if you will, one for the other node. So that if, and that's a choice that we made. And I think it's especially good in a disk-based system. I mean, it increases your storage, but by a constant factor too. And you can't, I mean, you can't do less than that. And if you need to essentially read, scan the incoming properties of a specific node, and in general, you should always optimize for these very simple queries, because they're going to be always most common. Just like in any phenomenon in the world, there's going to be a
Starting point is 00:22:17 skew in terms of the complexity of the query and how frequently you're going to get it in an actual system. So for the very simple queries, pinpoint queries where you say, give me this node and all edges around it. And if you need to read incoming edges, then you want to locate them. Essentially, this gives us a way that they're all stored in the same page. So that's why we double index it.
Starting point is 00:22:40 We don't double index node properties. So node properties, you can, but you'd have to replicate hundreds of times if a node is being touched by hundreds of other nodes. So that's another thing. So we've got CSR format. We've got columnar format for node properties. CSR format for relationships and relationship properties.
Starting point is 00:22:58 We've got double indexing there. Just on that for a second. Why do you sometimes regret that decision? Well, because it's – why do I regret this? So there is alternative ways. I mean, I could get a bit technical, but there is alternative ways that we could have reduced the storage, I think. I don't want to say I regret it. I sometimes question it because we didn't very, very properly study how fast we could have been if we chose to have just like, and that's what's the alternative method. The alternative method is what a relational system would do. You'd have a relationship table, no particular order there, and everything is kind of stored. All of the edges are stored there and all the properties are stored once. But you'd have an index that essentially you need to have this index. You need to have for which node, what are its edges. Okay.
Starting point is 00:23:52 So that's not storing the property though. That's just storing the IDs of the neighbors. And maybe using that, you could just store the IDs of the edges as well. And then do a full scan with some smart way of filtering so you don't scan the whole tables. So that's an alternative way that you could store only once, edge properties only once. And you could still maybe do fine. Actually in GraphFlow, which was an in-memory system, we didn't double index. We actually did a single index, but it was in memory. So, you know, we'd have some random reads,
Starting point is 00:24:25 but because it's in memory, we didn't care too much. We never really tested that. So I was a bit more conservative when we said we're going to do a move to, not just me back in the day, my student who implemented GraphFlow storage also implemented the initial version of Kuzuz storage. We had this intuition that, you know what, let's just double index, let's forget about
Starting point is 00:24:48 this problem. But we never properly studied this. That's why I say, I sometimes still question it, that could a system, could we just store it once and still be okay? We never did that study. One for an interesting research paper in the future, I guess.
Starting point is 00:25:04 Yeah, maybe. Yeah. So that's storage. We never did that study, you know. One for an interesting research paper in the future, I guess. Yeah, maybe. Yeah. Yeah. So that's storage. So let me just tell you just very quickly about other things. So the Join Optimizer is a core sort of optimizer in every system where you spend a lot of time. It's a dynamic programming-based optimizer. And Thomas Neumann really has done the most work in that space.
Starting point is 00:25:25 So we try to get that. We have a buffer manager that's based on another, right now, University of Munich-based paper. VM Cache is the paper. So that area is led by Victor Leis. He was at a different university and moved to University of Munich where Thomas Neumann is. By the way, to the University of Munich where Thomas Neumann is. By the way,
Starting point is 00:25:46 for any of the listeners, Thomas Neumann is like the god in systems, in databases. That's why his name is coming up a lot. I mean, not only one god, one of the gods. And we do look at his work
Starting point is 00:25:59 and Victor is also at the University of Munich and he's done most work and the best of the work, I think, in buffer management. So we read that paper and try to integrate it into Kuzu. So we have a buffer manager based on VM cache. Let me not get into it. But, you know, people can easily find that.
Starting point is 00:26:18 And we have a transaction manager. So that's very, I think if one day a historian went and said, what was the simplest and most, simplest and correct transaction manager? I'm very confident it's correct, but it's also very simple. It's realizable kind of by design. I'm not super proud of it,
Starting point is 00:26:43 but we had to add these features in the simplest way possible. I'm not proud of the way we implemented it, but I'm proud of the way of how simple we were able to implement it. It's based on write-ahead logging. So, you know, writes go into different sort of pages first and they're logged. And during checkpointing, they're sort of copied over. And by design, you can only have a single writer in the system. So it's kind of optimized for bulk writes instead of sort of many concurrent writes and reads and writes can, reads and the write transaction,
Starting point is 00:27:12 single write transaction can work concurrently. So, so it is serializable in that sense. So maybe one day you're going to sort of make that more optimal if there's any demand, but it's, it's again these are choices that we made our engineering efforts are based on optimizing the reads and scalability
Starting point is 00:27:29 instead of transactions uh so yeah so transactions i mean in one sense it's it's it's a bit naive and simple it's correct i think uh it's you know relatively well tested but it's also we did a lot of simplifying sort of assumptions like that you only allow a single writer in the system and things like that yeah yeah cool so let's let's dig into a little bit more about the factorized query process because this was the focus of the side of paper right so we've been through what factorization is but maybe you can kind of touch on the design principles that you use when designing these new join operators and yeah i guess just tell us more about the ones that you've got in kuzu yeah so factorization is um again it's a compression technique but it's not a compression technique you use to essentially make your disk files smaller which is how compression
Starting point is 00:28:20 is usually used on systems you know you've got a file that stores maybe a property or a column of a table, and you want to make it short because there's lots of, you want to make it small because there's lots of repetitions and whatnot. Factorization is interesting because it's actually a technique that's designed specifically for query processing for when the queries contain many-to-many joins, meaning that they can grow.
Starting point is 00:28:43 So the core idea is very simple. So think of like the simplest star query. You've got a middle node B, you've got one outgoing edge to a node A, you've got another outgoing edge to C, and that's your query pattern. So this is essentially what you would write in the match clause. And you want to find essentially all incoming
Starting point is 00:29:02 and outgoing neighbors of a particular node B. And suppose you get a filter on B. And suppose that filter says, you know, Jack's essentially outgoing, essentially transactions and incoming transactions. Give me Jack's neighborhood, basically. So that's the simplest essentially star query you could get. Now, if you've got essentially 10,000 outgoing neighbors and 10,000 incoming neighbors, now if you represent this output in flat format, it's 10,000 times 10,000 outgoing neighbors and 10,000 incoming neighbors. Now, if you represent this output in flat format, it's 10,000 times 10,000.
Starting point is 00:29:29 So, you know, I think it's 100 million essentially tuples would be represented there. But there is actually a very, the neighborhood is actually much smaller. There's only 20,000 edges there. So factorization is a way of representing this output as a Cartesian product instead. So it's jacks, nodes, and data, Cartesian product with 1,000 outgoing neighbors, 10,000 outgoing neighbors, Cartesian product with 10,000 incoming neighbors.
Starting point is 00:29:53 So that Cartesian product is another way to represent exactly the same 100 million tuples. And it's essentially, you know, that's the way factorization works, that whenever possible, we try to essentially represent tuples, sets of tuples as Cartesian products in a much more compressed way and pass that between operators. So you asked about the design goals. So one thing was just be able to implement factorization. And we had a particular way of implementing this based on my PhDs in the means work in GraphFlow is really directly implemented in Kuzu.
Starting point is 00:30:34 So we knew we wanted to essentially represent intermediate relations, not a set of flat tuples, like maybe DuckDB does, or pretty much any of the other systems do, but in a factorized way. We also wanted to make sure we don't use index nested loop joins. So this is the type of operators that we in GraphFlow had or Neo4j has, for example, an expand node, which is a join operator that takes a node record and expands it to its neighbors, for example. So if you use that in an in-memory system, maybe that's fine.
Starting point is 00:31:07 But in a disk system, if that's the type of joins that you do, that is that there's an expand operator that takes a set of tuples, and for each one, it expands it. So that's what in database literature is the way index-nested loop join operators would work. So from one side, you read the tuples. From the other side, you go and scan and perform the join one tuple at a time, so for each join. So that's actually not going to be robust. And I don't say that's always bad. Actually, in the paper, there's many cases where that actually works the best in very, very, very basic cases, like, you know,
Starting point is 00:31:42 you want one node and you want its neighbors, expand type of join operators do well, but it's not robust when you are getting, you know, thousands, tens of thousands, hundreds of thousands, tuples and you need to do that. It's just degrades very quickly because it leads to, you know, again, you can't really localize neighborhoods. So it degrades to random IOs. So we didn't want to do that.
Starting point is 00:32:04 So we want everything to be based on hash joins, which is the common kind of wisdom nowadays for doing joins, at least equality joins in systems. And we also wanted to make sure we don't, when possible, behave like index-ness loop join. Index-ness loop join has this one advantage that if all I want is Jack's altcoin neighbors, you know, an expand operator that just goes and takes the, you know, maybe it was preceded
Starting point is 00:32:32 by a scan operator that gave Jack's node ID. And all it does is that, you know, just looks up in an index the IDs of the neighbors and then does the minimum amount of IO and maybe read only the properties of those nodes or edges. So that's actually, you know, good because if you got, you know, a 10 gigabyte file, you might just end up reading, I don't know, a few kilobytes of things, even though it's random IO,
Starting point is 00:32:55 you're just reading very few parts of the database. So in those cases, index.net.to.join has an advantage. Again, I don't think it's robust, but we also wanted to behave like that when the queries were very selective, when we needed to join a single node record or a few node records. So those are the three goals. Factorization, make it hash join based, but also try to avoid scanning essentially files. And the reason this is a problem is that hash join always reads entire files. So by default, there's a build phase and there's a probe phase. There's no mechanism to say only read me the neighbors of this one particular. So those were the three design goals.
Starting point is 00:33:37 And we have a way of implementing this in the system. Awesome. With these specialized join operators, yeah. Cool, yeah. So maybe you can tell us a little bit more about these specialized join operators. So these three are in the system awesome with these specialized joint operators yeah cool yeah so you can maybe you can tell us a little bit more about these specialized joint operators so there's three in the paper maybe you can just give us the intuition behind um how they work essentially yeah so the so let me just tell you the maybe the core uh sort of idea of how in a case where you are uh maybe just reading jack's neighbors how this would work and we would still be hash join based
Starting point is 00:34:05 but essentially achieve all the three goals. So it will be difficult to see how we get factorization because the query is too simple. So maybe let's just, we could expand it to see how factorization is achieved. But suppose the only query all we have to do is read Jack's neighbors. So what will happen is that just like in any hash join, there's a build phase, which creates a hash table. And then there's going to be a probe phase, which essentially scans the left side of the hash join operators, and then does probes into this hash table that was built. So we'll do something along the lines of you're going to read Jack's record and we will figure out Jack's node ID. So maybe it's like seven. Okay. And then we're going to have a hash table that puts seven and Jack there. Okay. Suppose you read only the name of
Starting point is 00:35:00 Jack. Maybe that's the primary key, but the internal node ID is seven. Now Now all we need to know in the hash join, the hash join is going to join that record with all of its neighbors. All we need at that point, once we create the hash table, we know that all that we need to scan are edges whose sources are seven. So we pass that, we do sideways, it's called sideways information passing. So we pass that information as a mask of node IDs to the left side, probe site. And that probe site is going to essentially use that scan, use that mask to know which parts of those CSRs to read. So it will only read seven outgoing neighbors and avoid reading the entire, essentially, this file. And so we'll only get a way of,
Starting point is 00:35:54 we have this mechanism so that only seven outgoing neighbors will be read and then they will be probed and joined with seven. So that's essentially the core idea of how we avoid scans is that we pass these node masks to know which parts of these either relationship tables or relationship sort of indices should be read or some node properties should be read. So that's the core principle in these different operators.
Starting point is 00:36:22 And, you know, so way, how do we get, so that doesn't explain how we get factorization. The way we get factorization is that, so in CSR, the edges are localized. So we don't essentially pass that as essentially a sequence of, suppose you had hundreds outgoing neighbors. We don't pass that one by one. We pass that as a sequence of hundreds, not with seven.
Starting point is 00:36:44 So we just represent this as seven times those 100 so that's the tuple that's being sent and we're going to probe on seven so so that seven and hundred outgoing neighbors will be joined as a single probe not hundred probes and with a single probe we're going to read the jack property which was in the hash table so that's the way essentially what's being passed is always these Cartesian products, unless some operator needs to essentially flatten and essentially break those Cartesian products. Okay, nice.
Starting point is 00:37:16 So I guess the kind of question that falls out of it is kind of how much better are these joint operators than on a graph workload? So maybe you could tell us a few kind of numbers, some of the experiments you ran. So we did several experiments, and I'm not a big fan of system-to-system comparisons because it's very difficult to get these, quote-unquote, competitor systems in their ideal performing sort of configurations.
Starting point is 00:37:39 Yeah. So we do try to compare, essentially, what an index-next-to nested loop join configuration of Kuzu would look like with this, you know, node mask based Kuzu. So, you know, one thing that we want to do is, so we're isolating, we're just trying to study the pros and cons of this passing of masks. Yeah. which is an overhead, basically. You know, that's not something that index-nested loop join or hash join does, versus the benefits we get from essentially getting rid of index-nested loop joins in the system. So that's just one meaningful comparison is that Kuzu's current version versus Kuzu index-nested loop join, which is essentially where we added an index-nested loop join operator to the system, where we do the joins using these expand-like operators.
Starting point is 00:38:26 And the general message of those experiments is that when nodes are very selective, this overhead obviously is not worth it. Its expand type of operators are going to be better performing at those very high extreme selectivity cases. A couple nodes are being passed, but it degrades very quickly compared to essentially this node mask based operators. So this will be tested both on some large benchmarks like LDBC plus some micro benchmarks, which are actually my favorite experiments of all systems, which are those micro-bank marks because you can really focus on a particular behavior of the system.
Starting point is 00:39:06 And then, you know, when the selectivity is very high, there's also an overhead, right? So when the selectivity is very high, you're not really getting much benefit from those passing those masks because those node masks are not selected. They're not helping you avoid scans.
Starting point is 00:39:20 So at those two extremes, you know, there is an overhead and you could do better, but overall it doesn't degrade the way index and sudo join operators degrade, or it's not way off like DuckDB is, for example, which doesn't have a node mask, just does hash joins by forcing the system to scan all these large files. So, you know, so in comparison, it's kind of like a trade-off that you do, you know. So at those extreme cases, there are systems that would outperform this type of processing
Starting point is 00:39:53 or there are techniques. But overall, it's like, I think, a reasonable trade-off to make that your system and your co-operating processing is more robust. So that's the type of, yeah, that's the type of sort of evaluation that we provided in the paper the paper yeah i reckon the listener goes and checks it out as well um but i guess on that so i guess it's possible to kind of have to support both types of operators in the system right and have some sort of like some sort of i don't know some sort of option that lets you choose specific
Starting point is 00:40:19 one based on the selectivity or whatever right so i mean you can get the best of both wells i agree i agree it's it's doable actually one you can get the best of both worlds. I agree. I agree. It's doable. Actually, one of the, there's two first authors of the paper, Xiang and Godong, they contributed the most to the paper. So, they're listed as first co-authors. They have been sort of arguing of when we should avoid semi-masks and this not, we call it semi-masks, but these node masks that I mentioned, you know, when should we just avoid this overhead completely?
Starting point is 00:40:51 Because it's like, if we kind of during runtime field, see that these node masks are not being selective, we might maybe avoid it and do something adaptive. Or, or if you're smart enough, your optimizer is smart enough to predict that node mask will not be selective, you can just remove it and replace some of those join operators with indexness loop join. I agree.
Starting point is 00:41:15 It's a question of like, it's always an engineering question of like, do you want to invest that time? Or are you happy with how your sort of query process is performing even with those overheads? So it's a matter of that. But I do agree that, you know, and our experiments make it visible that there is performance
Starting point is 00:41:32 that we put on the table and that our trade-off is there explicit. Awesome stuff. Yeah, I'm going to kind of roll the next two questions into one because I always ask kind of what are the limitations of whoever I'm interviewing, whatever they're working on. But obviously, Kuzu is a very young system. So I mean, a lot of the it's not feature complete, right? So I mean, there's a lot of road to run still. So I kind of want to ask you more like, where do you go next with Kuzu? Yeah, so I have several sort of theses that are being written on Kuzu. One thing that we don't do or we haven't sort of
Starting point is 00:42:06 invested much time in graph flow too there was no work on this i didn't have a student to take this topic is recursive queries so which is essentially actually an area that graph database is need to shine in certainly their query languages are much better for recursive queries because recursive is kind of a first class citizen historically too like even in those systems in 60s you know 80s 90s 2000s in all those graph-based systems like rdf object-oriented system even like the 60s system like ids recursive application query like applications that wanted to do recursive querying were always because of some of the killer apps so so there is a thesis being written on that. So we are essentially trying to implement some new query processing techniques
Starting point is 00:42:50 for like shortest paths or for variable length joins and things like that. A second thing that I'm very interested in, and I have a student who's going to start working on this and we have some ideas, is how to make property graph database management systems more competent in RDF style graph data. So that's another sort of popular data model that people broadly label as graph-based. And I do think, you know, the property graph model,
Starting point is 00:43:23 so this is like a position that I try to articulate. So I think out of all the graph-based systems, graph-based models, property graph model is the most generic one in which you could build the most sort of applications. It's more generally applicable, just like relational model as a model is very generically applicable so rdf i
Starting point is 00:43:46 don't think is is you know rd is a very specific application domain i'm not saying it's a small application domain but it's a gotta it's really suitable for a specific sort of application domain as a model but property graphs are not property graphs are just as relational model is very generic or is quite generic you could represent lots of application data. And actually, this is one of the other things that we mentioned in our survey paper back in 2017, that when we asked people what kind of applications they and some techniques to maybe do some owl-based reasoning and those things. So I don't think vice versa is possible. So I don't think you are going to manage your transactions with RDF systems or like financial transactions and frauds. I don't think that's a very good fit as a model, but I think property graph as a model is a good fit.
Starting point is 00:44:51 So there will be a thesis written on how to essentially make property graph-based systems, and specifically this will be done obviously on Kuzu, more essentially techniques to make the system more efficient in RDF style, knowledge management, and maybe some level of reasoning. Very related to that is a topic that's going to sound very boring to maybe to people, but I still don't know what's the right way to manage strings. We don't know this. There's actually no authoritative paper in the field on this. There's some authoritativeitative paper in the field on this.
Starting point is 00:45:30 There's some state-of-the-art compression techniques, but there's no authoritative paper that says, that makes an advice. If you assume this thing about your data set, here's maybe the properties of the strings, their length, et cetera. Here's the right way to sort of store it on disk. Here's the right way to store, like manage the overflows maybe of those strings or because that's a variable length data type. And here's the right way to represent it in memory. So we do these things where we kind of find and blindly copy from Umbra or DuckDB, these things, but, you know,
Starting point is 00:45:58 I don't think it's very well understood. I had this, have this dream of finding a student to actually just do this investigation for about two years to sort of do this experimental study of like, here is all the techniques that's in the field of how to store things in memory, how strings in memory, how to manage them in memory, how to process them in memory, how to store them on disk. I don't know if they'll get a chance to do that, but that's something I hope I can find a student to do on Kuzu as well. Yeah, it's one of those things on the face of it,
Starting point is 00:46:30 it sounds quite a hard sell, but I imagine that when you get into it, it's a fascinating, really interesting project once you've kind of got going with it. So hopefully a listener listening to this now thinks, okay, I'm going to go work on strings for three years. Yeah, maybe. Yeah, but I mean, so Hannes,
Starting point is 00:46:44 who is the CEO of Duck tv labs and one of the main engineers there also mentioned this actually in his keynote at cider this year where kuzu paper got published he said like he's very surprised that few people work on strings despite the fact that majority of the data bases are strings and like there needs to be one authoritative paper i think that makes this advice like this is how you should implement strings if this is your assumptions. And I want to see that chart of like whatever those assumptions are, you know, maybe some workload assumptions, maybe some, you know, data set assumptions. And then says, here's a good thing that we found that's robust and easy to implement and whatnot. Yeah, that would be a really good contribution. I guess, yeah, so the next sort of set of questions to have,
Starting point is 00:47:26 I tend to sort of ask them to everyone with various different flavors. But the first one is kind of how as a software engineer, data engineer, can I kind of leverage your work? I guess the answer to that is just kind of go and play with Kuzu, right, and try and integrate it into your workflow. I agree, yeah. See if you can kind of fit it in. So I guess the follow-up question to that would be like kind of go and play with Kuzu, right? And try and integrate it into your workflow so you can kind of fit it in. So I guess the follow-up question to that
Starting point is 00:47:47 would be like kind of longer term, what sort of impact do you think Kuzu can have? So let me also just maybe on the previous question, I wanted to say like, I just kind of focused on the research and things that are happening. There's also like, as you said, Kuzu is a young system.
Starting point is 00:48:05 So there's a lot of engineering that's happening to add a lot of features and we are trying, we're getting excited about sort of making Kuzu more usable in the graph learning space and trying to connect more with users. So we are adding some, some features specific to these graph learning libraries. So there's going to be all those integrations that are happening. So for that question, yeah, so I think playing around with Kuzu
Starting point is 00:48:30 is probably the right message for developers or general application developers. I mean, if you're a developer of graph database management systems, they'll look like you're in Neo4j. So, you know, for people, engineers of Neo4j, I think there's some core messages that we think some of these core
Starting point is 00:48:50 joint techniques should get into these systems. And there's a lot to benefit. For other application developers, I think, you know, there's a set of application domains where really relational model and SQL is not going to be a great fit, nor are the systems that implement SQL
Starting point is 00:49:08 are implementing the right sort of techniques to be very performant. Even like the most state-of-the-art techniques like Umbra is not going to be really, really well in many-to-many joins, for example, or very, very well in recursive joins. Like recursion is something that's always a second thought
Starting point is 00:49:22 in relational systems. It's a first-class in graph systems. It's a first class in graph systems. So, you know, if you've got applications that benefit from these set of features, and there's more in my blog post, I think I've listed about five, six of these application domains. I think just being sort of knowledgeable about this alternative database management systems is going to make them better engineers so because like trying to every retrofit every application to relational model is i think not a not a not a very good idea so people people should be educated more on on graph database management systems and these alternative database management uh systems their models their query languages uh so it's going to make them i think better engineers yeah for sure yeah definitely agree yeah next question, I think you touched on this a little bit earlier on
Starting point is 00:50:07 when you were talking about the journey and Graf Law and whatnot. But I guess on this journey you've been on, what was probably the most interesting thing that you've kind of learned and that was maybe quite unexpected as well? Yeah, good question. Actually, this was the question that I thought most about. I don't have a great answer to this. I asked Xiang and Godong,
Starting point is 00:50:25 the two authors, what were their sort of, what was the thing that they found most sort of unexpected or interesting? I guess like, I think we learned a little bit more about ourselves. So we haven't done this before. So like I am trying to do this in Canada. So,
Starting point is 00:50:41 you know, by following DuckTB or systems like Umbra or Hyper, you know, these are, this is a specific sort of research style that's quite popular in Europe. So both DuckTB and this Umbra hyper type of systems are being developed in academic settings in Europe. And I'm trying to, you know, my group is trying to replicate that in Canada with Kuzu. So, you know, we are doing a lot of engineering and less sort of paper writing than maybe normally researchers in academic settings do.
Starting point is 00:51:17 And we are doing this for the first time. So one thing that we are noticing about ourselves is that how this, this it's more a psychological thing that's we're that i'm finding interesting how often we get wrong uh how often our intuitions are wrong about what is important and what is not it just becomes very difficult to judge these things in absence of a large user base and like for example let me give you an example we did start the project and spent a lot of effort trying to implement this secondary, second storage sort of structures in the system to store arbitrary key value properties. And, you know, that just slowed us a lot. And that was a lot of code. And right
Starting point is 00:51:58 before we released back in November, we said, you know what, let's just see if anybody is actually interested in this or are people really structuring their graphs and we're going to add this back on. But I mean, so that was like when the project started, you know, maybe a year and a half, two years before that, it was just, that was like nobody ever questioned that we should not have started this. And the mistake that we do often, we're not saying that we're over-optimizing. We get too sort of optimistic about, you know, this is going to be really important, let's certainly do it, versus, and we kind of
Starting point is 00:52:32 trade that off with simplicity, and we kind of implement more than we should, and we throw away. So that's a mistake we consistently do. I think we're getting it better, but that's something that we started seeing more concretely by the amount of code that we're also throwing away from the system. So that's something to keep in mind. That's a very important lesson, I think, that if we were to do this again,
Starting point is 00:52:58 many of the features would just start much smaller and try to get it complete and optimized and evolved, just like text and writing papers, you know, you write something the best way to write, you know, as they say, 12 pages is write 24 pages and throw half out start simple, start simple, add more and optimize it, it's just going to evolve
Starting point is 00:53:17 over time don't overdo it is one thing we've realized ourselves, it's a mistake we made. But it's in the quote, right? Was it premature optimizations of all evil or something like that? Yeah, maybe sometimes. It's very real. Yeah, we've done several of those for sure.
Starting point is 00:53:35 Cool. I guess kind of while we're talking about the journey as well, the things were kind of along the way. Maybe this is probably an example of that. But you tried that and it failed, that you think might be interesting people to know about. Maybe this is a good example of that anyway. So I could tell you ideas we didn't take from GraphFlow
Starting point is 00:53:56 because we were able to write good papers on them, but I don't, in the end, advise people to implement them. So for example, we have this paper on A-plus indices. It's a way of bringing materialized views into graph database management systems. I mean, I think they're a fine idea, and, you know, we published this in ICDE, which is a top conference and whatnot. It's a good interpretation of how materialized views could be brought into a system as these indices that support essentially subsets of your edges, which are effectively results of
Starting point is 00:54:33 some queries you may have run on the edges. And we found sort of fine applications of it, but it's also now very practical in the sense that it's a lot of effort. And that's when we sit down and we say, okay, we've got a budget of this much engineering we can afford. Should we do it in A-plus indices or should we do it like... It was like clearly one of the things we said, no, we don't have to do this. It's not really worth it. So in some sense, if you want to interpret it,
Starting point is 00:54:59 it's a failure in that sense because it doesn't offer something very practical, I think. Or the amount of effort is something we can't afford right now. So there are ideas like that that we brought in. And there was a lot more work on factorization that we did that's in Amin's thesis that we are not adopting. Because again, I find it's good to do research projects because without those courageous PhD students doing those research, you can't really
Starting point is 00:55:32 distill the practical ideas, I think, but I don't think we really got it right yet. So if you looked at the theory of factorization, there is excellent ideas there. Like there's, so the form of factorization that we do, again, this is for interested students who might look at this, is called F representations. There's another more even compressed versions that you could keep called D representations. It basically stands for factorized representations with denotations. The D comes from the denotation. And we don't have a way to essentially adopt that. Again, Amin has something in his thesis, but again, I am not courageous enough to actually say, because, you know, the way Kuzu is trying to be is like to be very practical system. So I'm very selective on what gets in and what doesn't get in. But I think it's good work
Starting point is 00:56:22 that Amin has done. But I'm saying that there needs to be more work until we actually feel confident that, okay, the representations can now be integrated into the system. Yeah. So I don't know if I answered your question, but there are tons of failures. That's in the very nature of research, I think, that a lot of the ideas, I think, we just get wrong. We publish them, and then somebody else does something better, I think that a lot of the ideas, I think we just get wrong. We publish them and then somebody else does something better, I think. And then hopefully some core sort of principles merge that we know how to do that in a more practical way. Yeah. Out of interest, how big is the team at the minute that's working on Kuzu? How many engineers? So there's, I think, five of us, including myself and three master's students. So the five out of the five, many of them are
Starting point is 00:57:11 doing mainly engineering. They're in the university and whatnot, but they're really less, because they got their degrees, they're staying as research associates, or they're postdocs, so they're less worried about, and they're less worried about, and they're really trying to, they're less worried about publications. And they're just very inspired by sort of trying to be like DuckDB and get a user base. So they're doing more engineering, although, you know, obviously we are going to continue writing papers. Although, and then there's a couple people who are new, whom I've started training recently. And they'll write some of these thesis that I that I mentioned over the next two three years so but like you know where however you counted including me maybe seven eight yeah some of some
Starting point is 00:57:55 are very junior and not doing any engineering just research and some are like doing more research yeah is is is kuzu kind of your whole sort of like vehicle for research or do you have other sort of non-kuzu related projects going as well yeah so i we recently published a paper actually chunk who is also an author on kuzu it was chunk's master's thesis uh chunk kind of uh was doing a project on open government data sets that was a research project that i'm actually very excited about as well. So this is an area I would say led by Rene Miller. I mean,
Starting point is 00:58:31 one of the main people in the, in the spaces is Rene Miller has a long history of research in open government data sets. So these are the data sets like, you know, UK has one, US has one, data.gov, Canada has one, opencanada.ca.
Starting point is 00:58:47 These are ways governments are publishing, essentially, lots of datasets about how the government runs with the goal, long-level, sort of long-term goal of transparency, more, essentially, democratic access to how essentially the governmental institutions are running for journalists to keep an eye on sort of some maybe, I don't know, inefficiencies and major problems in the society. And it's not very usable though, these data sets, a lot of the people, it's not very easy to find them. It's not very easy to integrate them. So I have a research project. It's unfortunately gotten a bit smaller
Starting point is 00:59:33 since Chang decided he'd rather work on Kuzu to essentially make these data sets more usable. So I still have, actually, I'm currently this week writing a paper on that. So I have that going on. And I have a PhD student who's graduating also this year who has been doing work on differential data flow and applications of differential data flow to graph processing. So for those of you who may not know what differential data flow is, differential data flow is one of the technologies behind Materialize Inc.
Starting point is 01:00:09 It's a popular startup in the data space. They do essentially streaming SQL and to materialize your queries, although I might be wrong, so check what they do. But they're basically using a technology called timely data flow and differential data flow, which Frank McSherry was very influential in popularizing. So differential data flow is the most generic incremental view maintenance technique that I'm aware of. It can maintain arbitrary data flows. And one of their best applications is on recursive, essentially, data flows and how to maintain the outputs of recursive flows. And it relates directly to graphs because a lot of sort of graph analytics is recursive computations.
Starting point is 01:00:51 Like if you want to find shortest paths, if you want to find connected components, those types of graph analytics is often represented as recursive queries. And if you represent that recursive, essentially, computation as a data flow and give it the differential data flow and your input, your graph is changing, it's going to maintain your connected components or it's going to maintain your shortest paths, which is an amazing technology. It's actually unique in the sense I don't think there is another technology that's able to do this. So he has been, my PhD student Siddhartha has been working on building essentially graph layers over
Starting point is 01:01:25 differential data flow so i have that going on on the side although again these have gotten smaller because the students are leaving uh to either move on to other things or to move to kuzu and and things like that yeah and great so yeah i've got um three more questions and so the first one is i really like to ask them people it's really interesting to see the divergency answers i get for it so it's and i'm sure you'll have a totally like new perspective on it as well so it's like how do you go about approaching idea idea generation and then determining what to work on i just think i'd love to know more about your creative process i see um i i so it's a very difficult to answer questions. So some of these are student driven, obviously, at this point. And back in the day, it was a bit advisor driven. So back in my PhD, it was a bit advisor driven. So a lot of the very productive work that I did, actually, back in PhD came from Jeff Allman. Jeff is a Turing Award winner for all those. I'm sure people have read Jeff's books.
Starting point is 01:02:31 So Jeff kind of, I did a lot of work with Jeff during PhD, some theoretical work. And they were really driven by Jeff and Foto Afretti back then. She was a collaborator of Jeff. So back then it was driven by that and they would give me specific problems. They would be very well defined and I'd work on that. So that was the style. Now some of the problems are also reversed where some students are going to bring ideas.
Starting point is 01:02:53 But so often I work the way I work nowadays is I started very, very naive, simple things. I implement, I make sure that the most naive versions of things get implemented. So like I'll pose a problem like, you know, how should shortest paths be implemented, which is an ongoing project right now? What can we do in that space? And the first thing that we're going to do is
Starting point is 01:03:19 we will do the most basic ways of implementing this. And frankly, most of my best ideas come when I implement those things or I see my students implement those things and I look at the queries. I find it difficult to come up with very good systems research ideas by just, I can come up with high level directions,
Starting point is 01:03:39 but then in the end, I can't often predict what the papers are going to be about beforehand. And that comes only when we implement something. And then we see, OK, these are the challenges in the implementation. So I highly advise all of my students right now. And this is the sort of style that I've come to adopt in the last three, four years. It's extremely sort of code driven.
Starting point is 01:04:04 Everybody needs to write a lot of code. I think that's good for training as well. And everybody needs to work in the context of a system. So one thing that I'm very against is publishing ideas that are not implemented in serious systems. Because they're kind of destined to be, I think, out of touch with the reality of how things get integrated into a system, which means they're going to be dismissed. So everything needs to be implemented in a serious system. Like Kuzu at this point is becoming quite serious that if you do research on Kuzu, you
Starting point is 01:04:37 will get to interact with a lot of components of the system or you need to take, like I'm advising a PhD student, not really advising, but I'm talking to a PhD student who works on Neo4j at the University of Waterloo. And, you know, she does work inside Neo4j, which is good. And that's the style of, I think, research should be done. And when you do that, I think, do you get good practical ideas? Now, that has one risk, which is that when you do that, your ideas are also more humble because you're kind of highly limited by what can be done versus if you were to kind of feel freer and do something on the side on your own scratch code. So those papers are more easily rejected. So I do get a lot of rejections. My students and I, you know, a lot of the work that we do gets quite a bit of rejection,
Starting point is 01:05:26 unfortunately, because some of the ideas end up more like, you know, here are some sort of core engineering principles, but they don't, it makes, it looks a bit, it's a bit hard to make them look fancy from a research perspective. But that's the style that I work, I think. So write the code and see what doesn't work. And then after enough failures, good ideas come, I think. And they always come, by the way. So no need to worry about that. Some idea always comes on that you can publish, I think. Awesome.
Starting point is 01:05:57 Yeah, that's a great answer to add to my collection of answers for that question. Great stuff. Cool. So just two more questions now so the first one is um what do you think is the biggest challenge facing us now in database research maybe specifically um we can focus this towards like graph data management and there was a recent um debate on graph database publishers published in the register i'd like to get your take on that as well if possible uh actually i i didn't read them because I find those things a bit distracting.
Starting point is 01:06:27 And so some of those are more around personalities than I think, or some companies and companies have essentially energies around them. So I actually, I genuinely did not read them. Although I know the general topic that was discussed and I obviously have my position. So like anybody I think who claims graph databases will take over relational systems
Starting point is 01:06:51 is way off in their predictions. I don't think that will happen. I think you gotta be, I mean, the relational market is humongous and very well established. And I don't think people are know, people are going to switch to graph technology for tons of applications that they will. So that's not going to happen. But anybody also says graph database are going away or else in way of as well. I don't think they're
Starting point is 01:07:15 going in actually, as I have tried to say this style, the one that Neo4j kind of brought the data model and decipherlike sort of query languages are going to stay. And I think they're actually going, they're likely to be the most successful of every wave that has come so far. And there's more buzz around them too, but I don't see them going away like XML databases.
Starting point is 01:07:38 I think they're here to stay. I think they've got a relatively healthy sort of application base and more can be done in the graphs. I mean, one reason that convinces me of this is that there's really no other data model other than a graph and tables that is very generic to represent application data. And there will always be appeals to choose one over the other. And some of this is also just sociological question that people just prefer thinking of certain sort of records as nodes versus tables. So they've done a good job. I think property graph model is a good model. So I do see it remaining for around. I mean, so that's also
Starting point is 01:08:21 way off. But obviously, I don't think graph database are going to take over relational systems or something. That's also way off. That doesn't demonstrate, I think, a good understanding of the core sort of power around relational systems. But graph database are here. They're going to get better. The market is going to grow healthily, I think. There will always be many applications that use those systems. What is a challenge in databases?
Starting point is 01:08:54 So, I mean, so I'm not, I don't see myself as a visionary, frankly. So, I don't, I can't sort of foresee what kind of big problems people are going to uh to work on but one thing that i am kind of inspired about like this i'm just i'm just putting it out there so that maybe some phd students who might run into this podcast uh might want to get curious about is there has been an ideal way back, actually stemmed from AI, artificial intelligence, from very early days historically of building information systems that can do deductions. And that has its own niche sort of communities with like automatic proofs and things like that, where it connected very well with the database community back in the day.
Starting point is 01:09:48 I don't think the communities talk as much, but these logic-based systems like Datalog is essentially a part of first order logic that really captures the relational calculus. But it's a very limited part of first-order logic. And the logic-based sort of AI community, the symbolic AI community, has done a lot of work on finding essentially algorithms that can do deductions. So let me give you an example. Actually, this is an example. I wrote a blog post on this that will appear at Sigmund Records. But I think six months from now, for some reason, they said it would appear in Sigmund Records. But I think six months from now, for some reason,
Starting point is 01:10:25 they said it would appear in Sigmund. But here's an example that if you read a book on those sort of core AI topics of knowledge representation and sort of applications of first-order logic to essentially knowledge representation and deductions and whatnot. So one example that they will give is, think of a database, a table of essentially items,
Starting point is 01:10:51 some elements, maybe some balls, and they have colors. So maybe the colors are red or black. And then there are some nulls. There's some balls whose colors are not known, but there's a constraint that says that every ball has to be, which you can do in first order logic, which you can do in some logic based systems as well. They database systems, I think as well, that there's a, there's a constraint that every ball has to be blue or black. Right. So we just fit relational algebra.
Starting point is 01:11:19 And my question is that how many, what is the maximum number? What's the, what's the color of the balls? What's the maximum number of reds or blacks? Okay, so, you know, out of the all buckets that you could have in terms of colors, what's the maximum number? And suppose, again, there's three records. One is blue, one is red, one is null. But there is a constraint that colors are either blue or black. Now, the answer is two, but no relational system can compute this, right? Because
Starting point is 01:11:46 no relational system is going to be like, you can't, like you need something, some other algorithms to do this deduction and use the constraint and be smart that the maximum has to be like, you need to do a little bit of a proof there, right? You need to say that, okay, there's two cases, either this null is red or this null is black. Let's do the query. When it's red, you get number two, because now there are two reds. Now, if it's black, you get two as well. So the maximum has to be two. So that's the correct answer.
Starting point is 01:12:14 And obviously, humans, we can answer this very quickly. But no information system right now, no database system will do this, because we're really restricted to a very small part of first-order logic. I think a good thing to think nowadays is to see what, like, first of all, I don't even know how well motivated it is with this current AI sort of applications, but it's an ideal.
Starting point is 01:12:36 I think that's important to remember as people think about reasoning, as people think about sort of what can we do to extend AI applications with some logic-based deductions, because people are talking about neuro-symbolic AI. And I don't fully understand, again, the applications to see if this sort of gut feeling that I have that this might be a good topic to look at is important. But at least as an ideal, it has existed for over 50 years, maybe, of how can we build more intelligent database management systems
Starting point is 01:13:10 by extending the part of sort of first order logic that we capture. So, you know, if somebody does the work of seeing if there are applications of this, that more deductions could be good for certain applications, I think it's a good time to start. And maybe 10 years from now, those systems might be really useful
Starting point is 01:13:28 because this logic AI, logic community has done a lot of work to find a lot of algorithms. And it's the right time to think about how many of those can actually be put back into a database management system, make practical and whatnot. So that's an area that I think is important. Then there is some classic areas like nowadays heterogeneous computing, GPUs, DPUs, how do we bring those kind of hardware into database techniques?
Starting point is 01:13:54 Those are clearly very timely to do, I think, to write PhD thesis on too. So, yeah, those two immediately come to my mind about if I were to be a PhD student now, I'd explore those type of things and and and you know uh it's as usual as every decade it's very exciting there's tons of things to work on yeah well there you go listening you've been told those are the things you need to go work on I see yeah maybe yeah it's at least good to explore I think yeah definitely really interesting directions, cool so it's time for the last word now so
Starting point is 01:14:29 what's the one thing you want the listener to take away from your work on Kuzu and from this podcast today so I'd like them to first of all understand that Kuzu is being developed seriously so I'd like to see more understand that Kuzu is being developed seriously.
Starting point is 01:14:46 So I'd like to see more people use Kuzu. We've got our sort of first set of users who are trying and playing around with the system, which is exciting to see. But it is a system that aims to be highly scalable and sort of fast on graph queries. These join-heavy queries when you need to find connections between nodes. and sort of fast on graph queries, these, you know, join heavy queries when you need to find connections between nodes. And, you know, it is adopting the usability style of.db. So you can think of it in one sentence as.db for graphs. So that's one thing to take away.
Starting point is 01:15:23 And, you know, that's about it. Did you want me to comment about something else? That's if you wanted to. And the other thing is I do think, you know, educating yourself a bit more on other data models is going to help you. You know, knowing more about RDF, knowing more about property graph database management in general is healthy. And I think for anybody who's teaching courses on database management systems, I think I do this
Starting point is 01:15:51 all the time, spend a couple lectures on these things. I think we need to educate people more on these alternative models that go beyond the alternate data models and sort of variants of SQL like Cypher and Sparkle. And those languages are easy to pick up. And it's good to know because, you know, you want to be able to use your skills on those technologies easily. I think it's going to make people better engineers. So that's another thing that I want to advocate very, very strongly for. Fantastic. Well, that's a great great a great thing to end it on so
Starting point is 01:16:25 yeah we'll end it there thanks so much so it's an absolute pleasure talking to you thank you so much for coming on the show and if the listener is interested in knowing more about all of semi's work and kuzu we'll put all the links to everything and all the blog posts and whatnot in the show notes and if you enjoy the show please consider supporting the podcast through buy me a coffee and we will see you all next time for some more awesome computer science research. Music Music
Starting point is 01:16:55 Music Music Music Music

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.