Disseminate: The Computer Science Research Podcast - Semih Salihoğlu | Kùzu Graph Database Management System | #29
Episode Date: April 3, 2023Summary: In this episode Semih Salihoğlu tell us about Kùzu, an in-process property graph database management system built for query speed and scalability.Listen to hear the vision for Kùzu and to ...learn more about Kùzu's factorized query processor! Links:Kùzu GitHub repoCIDR papercontact@kuzudb.com Kùzu SlackKùzu TwitterKùzu Website - blog posts Semih mentioned can be found hereSemih's HomepageSemih's Twitter Hosted on Acast. See acast.com/privacy for more information.
 Transcript
 Discussion  (0)
    
                                         Hello and welcome to Disseminate, the computer science research podcast.
                                         
                                         I'm your host, Jack Wardby.
                                         
                                         Today, I'm delighted to say I'm joined by Semi Saliholu,
                                         
                                         who will be talking about a new graph database management system, Kuzu.
                                         
                                         Semi is an associate professor at the University of Waterloo
                                         
                                         and is also the current holder of the David R. Sheraton Faculty Fellowship.
                                         
                                         Welcome to the show, Semin.
                                         
                                         Thank you very much. Thanks for having me.
                                         
    
                                         Great stuff. Well, let's jump straight in then. So maybe you can tell us a little bit more about
                                         
                                         yourself and how you became interested in research and databases. Yeah, your story so far.
                                         
                                         I see. Good. Well, okay. So you should interrupt me because I could go on here. So I got introduced to data management back in undergrad when I took the introduction course.
                                         
                                         And it was given by, I went to Yale as an undergrad, and it was given by Avi Silberschatz.
                                         
                                         So Avi is very well known in the field, and he is actually the author of one of the main books.
                                         
                                         Well, the course was overall, I thought, very boring.
                                         
                                         Actually, we have a problem as a field to teach that course in an interesting way.
                                         
                                         I find it very challenging to myself to engage students because a lot of people know SQL
                                         
    
                                         and sort of core database technology.
                                         
                                         But Avi had a way of explaining a lot of stories because he was very well integrated with the
                                         
                                         industry.
                                         
                                         So he had led the AT&T labs for a long time
                                         
                                         and he had very close connections at Google.
                                         
                                         So he would inspire us a lot
                                         
                                         with these major challenges in data management.
                                         
                                         So that's where it started.
                                         
    
                                         But I did my senior thesis in networks.
                                         
                                         I was always a bit systems oriented,
                                         
                                         maybe a bit more theoretical bent,
                                         
                                         but picking topics from systems.
                                         
                                         Anyway, so I did have a three-year industry experience before starting my PhD.
                                         
                                         And when I applied to PhD, I applied as broadly as systems, not necessarily just data management.
                                         
                                         It was networks or data management.
                                         
                                         And when I went there, I had a few sort of rotations with people.
                                         
    
                                         And believe it or not,
                                         
                                         I ended up in a very, very different field.
                                         
                                         I ended up in DNA computing, nothing to do with systems,
                                         
                                         nothing to do with data management whatsoever,
                                         
                                         but that lasted for about a year and a half.
                                         
                                         And towards the end of my second year,
                                         
                                         I came back to where I felt more comfortable aligned with Jennifer Widom,
                                         
                                         who was my PhD advisor.
                                         
    
                                         And yeah, so that's how I sort of ended up doing a PhD in graph analytics,
                                         
                                         not really data management.
                                         
                                         It was the Hadoop MapReduce era.
                                         
                                         And systems like Sparks were coming out,
                                         
                                         and everybody was trying to do some distributed sort of data management, data processing.
                                         
                                         And I picked the graph vertical because I just read this
                                         
                                         Pregel paper back in the day. We could get into that if you're interested. And then, so that was
                                         
                                         basically my PhD. And then when I took my position at the University of Waterloo, I wanted to do
                                         
    
                                         something a bit more in data management. And there was a wave around property graph data
                                         
                                         management with Neo4j and yeah i want to i
                                         
                                         want to get into the space and you know that's that's that's what i've been doing for the last
                                         
                                         seven eight years yeah the rest is history i guess yeah yeah the rest is history fantastic so yeah
                                         
                                         graph databases and so that's what we're going to be talking about today and so maybe before we
                                         
                                         start talking about kuzu maybe we could for the for the uninitiated maybe you could tell us a
                                         
                                         little bit more about graph databases, the background,
                                         
                                         what they are, what they're used for, and how they differ from relational databases.
                                         
    
                                         Yeah.
                                         
                                         So I'll actually unashamedly put a plug for a blog post that I wrote
                                         
                                         that really articulates my opinions in this field in more detail
                                         
                                         than I'll probably be able to do right now because of time limits. But there's a blog post that I put,
                                         
                                         that I wrote on Kuzu's website called what every competent graph database
                                         
                                         management system should do.
                                         
                                         But I also go into the history of graph database management systems.
                                         
                                         So, so graph database management systems, you know,
                                         
    
                                         the term might mean several different things.
                                         
                                         So nowadays a lot of the time people just think of systems like Neo4j, because that's essentially the current
                                         
                                         most popular model is what's called property graph database model. And Neo4j really kind of
                                         
                                         pioneered that and popularized this. But systems based on graph structured data as a data modeling choice.
                                         
                                         Graphs have existed even before relational systems.
                                         
                                         So the very first database management system in history was based on a data model called the network model.
                                         
                                         And network is just another term for a graph.
                                         
                                         So it was essentially a way you would represent your records and you would link them to each other,
                                         
    
                                         just like you link nodes through each other in the current property graph database manager.
                                         
                                         So that's even in the early 60s, you know, before the field was emerging,
                                         
                                         there were systems we could call graph database management systems.
                                         
                                         So and every wave has essentially seen to some different, you know different levels of success some database management
                                         
                                         systems that have been adopting
                                         
                                         a graph-based model instead
                                         
                                         of the, at this
                                         
                                         point, the de facto tabular
                                         
    
                                         model, or you could refer
                                         
                                         to them as relational model.
                                         
                                         So that came in the 60s. The first systems
                                         
                                         in relational systems were built in the 80s.
                                         
                                         And again, 80s, 90s,
                                         
                                         2000s, there was always systems,
                                         
                                         database management systems built on a graph model like XML, RDF,
                                         
                                         now property graph database management, object-oriented systems.
                                         
    
                                         So what should I say about them in the current era?
                                         
                                         So graph database management systems, they adopt a graph model
                                         
                                         where instead of representing your records as a set of tables, you represent them as a set of nodes and connections or relationships between your set of nodes.
                                         
                                         And in the property graph data model, you can also add some properties, key value properties on your nodes and edges.
                                         
                                         Ultimately, the query language comes, it's a high-level query language, just like SQL,
                                         
                                         and it's very much like SQL.
                                         
                                         So Cypher is the one
                                         
                                         that Kuzu implements,
                                         
    
                                         Neo4j, it's Neo4j's language.
                                         
                                         You know, you try to follow
                                         
                                         the open Cypher,
                                         
                                         but, you know, other systems
                                         
                                         around that adopt
                                         
                                         the property graph model,
                                         
                                         they have high-level languages
                                         
                                         that are very much SQL-like,
                                         
    
                                         and it's not very surprising
                                         
                                         because that's, in the end,
                                         
                                         the core principles of how data is managed at a database management system is always relational
                                         
                                         in the sense that there's only one way that humans know how to develop declarative languages,
                                         
                                         and that's really to compile them down to relational operators that process sets of
                                         
                                         tuples, output sets of tuples. So at their core, just like any other database management system ever built in history,
                                         
                                         they're all, in a sense, relational at that sense.
                                         
                                         But the data model in the query language has these graph-specific syntax or graph-specific features,
                                         
    
                                         which makes it suitable for several classes of applications in particular.
                                         
                                         So like fraud and recommendations, whenever you want to
                                         
                                         do essentially some recursive computations, I think they're much better than relational systems
                                         
                                         to program because they have these special syntax where you can say, you know, I want to find
                                         
                                         connections between this record and that record, and you may have represent them as nodes. And
                                         
                                         that's very intuitive in a graph model. You're essentially looking for paths.
                                         
                                         So paths is, for example, a very first-class citizen notion,
                                         
                                         which doesn't exist in tables.
                                         
    
                                         And you need to think of joint paths and whatnot,
                                         
                                         which is not very intuitive.
                                         
                                         So there's a mismatch there.
                                         
                                         So there's other classes, sets of features that they provide.
                                         
                                         And I go into all of these in my blog post.
                                         
                                         But overall, essentially, the two things to
                                         
                                         really remember is that it's an alternative data model with a SQL-like language with alternative
                                         
                                         syntax. And I actually think people should just know alternatives to data management beyond SQL,
                                         
    
                                         just because there's never going to be a perfect fit database management system and every system you know every application
                                         
                                         has its more natural way of modeling it and asking queries and different systems are going to be more
                                         
                                         more suitable on that then so um kuzu give us the elevated pitch for kuzu and can and maybe tell us
                                         
                                         where the name comes from as well so kuzu the elevator pitch for kuzu is the kuzu is the
                                         
                                         aims to be the duck db is the DuckDB for graphs.
                                         
                                         And right now, for a lot of people, that means something very specific because DuckDB is very, very popular and successful system.
                                         
                                         It is a modern graph database management system built with state-of-the-art principles of how essentially general analytics-oriented or query-oriented, I should say,
                                         
                                         database management systems should be built.
                                         
    
                                         And it follows the usability features of DuckDB.
                                         
                                         It's embeddable just like DuckDB and aims to connect very well
                                         
                                         with sort of a set of higher-level analytics libraries,
                                         
                                         specifically graph analytics libraries that users might use,
                                         
                                         aside from the database engine.
                                         
                                         So, you know, one way to think about this is that we'd like to be default storage of database management storage for people who would like to model their data as a set of node and relationship records.
                                         
                                         But they'll do maybe some high level, more advanced analytics, like maybe they'll want to build a graph machine learning model over those records. And they'll need to do lots of things before they could essentially so if you just need the graph database management system for your application,
                                         
                                         we want to offer a lot of value by being a state-of-the-art system in the sense that, you know,
                                         
    
                                         we've done a lot of work on graph database management systems and query processing and scalability in particular.
                                         
                                         So all of those features are getting integrated into Kuzu.
                                         
                                         And that's also what ductv does like to be also
                                         
                                         for the tabular analytics integrates uh state-of-the-art techniques plus integrates
                                         
                                         very well with these higher level uh the tabular analytics libraries like numpy etc and we wanted
                                         
                                         we're kind of trying to trying to replicate that awesome so let's um go into the architecture a
                                         
                                         little bit more of kujo so he said it's it tries to bring in all of these state-of-the-art components and the latest research into the system.
                                         
                                         So maybe you can tell us about the high-level picture of it.
                                         
    
                                         And maybe we'll walk through each component at a time.
                                         
                                         Sounds good.
                                         
                                         So it's the high-level sort of principles of how database management systems are implemented is very well known.
                                         
                                         Like every system pretty much goes through similar sets of phases
                                         
                                         when they take a text query and in the end execute the query
                                         
                                         with some executable code.
                                         
                                         So basically we have a parser binder, et cetera,
                                         
                                         to compile the text query into some in-memory representation.
                                         
    
                                         We have an initial join optimizer that actually, we're a little different from some systems that
                                         
                                         our initial join optimizer takes the initial representation of the query and optimizes
                                         
                                         and gives the logical plan that optimizes the joins that will happen in the query. And then we have an optimizer that takes that
                                         
                                         logical plan and does a lot of standard optimization techniques like filter pushdowns,
                                         
                                         removing of projections, removing of unnecessary properties slash columns in relational terms,
                                         
                                         several normalizations, so on and so forth. And then that's mapped into a physical plan.
                                         
                                         And that physical plan is given to a task scheduler, which executes. then that's mapped into a physical plan. And that physical plan is given
                                         
                                         to a task scheduler, which executes. So that's the high level sort of journey as it's the journey
                                         
    
                                         in Postgres, as it's the journey in I'm sure Neo4j or DuckDB. So that high level thing is
                                         
                                         pretty standard. So I should maybe just start talking about the query processor because that's where we have done most work.
                                         
                                         So Kuzu is essentially the successor of a previous system called GraphFlow DB.
                                         
                                         So GraphFlow started in 2017.
                                         
                                         And the motivation for GraphFlow was when I entered essentially this field
                                         
                                         of graph database management system,
                                         
                                         I had felt that back in my PhD, I'd made a big mistake by not understanding how the software
                                         
                                         that I was using was, that I was spending my years on doing a PhD, was used.
                                         
    
                                         So I worked on these distributed MapReduce-like systems called, Pregle was the original system
                                         
                                         from Google.
                                         
                                         The version that I have was called GPS, and the open source clone of Pregle was something called Apache Giraffe.
                                         
                                         It's still out there, I think, but I doubt it has any users.
                                         
                                         So I did my PhD on that,
                                         
                                         but I never really sort of understood the user base
                                         
                                         and understood how big, how impactful work in that space could be.
                                         
                                         So in general, if you want your research to be impactful,
                                         
    
                                         you need to bet that the software that you're working on,
                                         
                                         at least in data management, at least,
                                         
                                         we all work on some data management or some data processing software.
                                         
                                         You need to bet that that's going to be successful,
                                         
                                         that that's going to live.
                                         
                                         There'll be lots of applications using it. It'll be popular to some extent. And I not done any of that work. I
                                         
                                         just blindly went into it. And by the time I was graduating, it had become clear to me that
                                         
                                         really there were no users of giraffe or preg or GPS, those kinds of systems. I didn't want to
                                         
    
                                         make that mistake. At the same time in 2015, 16, Mike Stonebraker won the Turing Award and had this Turing lecture
                                         
                                         where he talked about the importance of sort of talking to users throughout his career.
                                         
                                         And, you know, obviously he represents one style of research where he advocates that
                                         
                                         the true validator of good and bad ideas is the market, not reviews or academics.
                                         
                                         And so I did this user survey I was able to convince my PhD student Siddhartha to
                                         
                                         really lead this it was actually a lot of boring work to try to do if you're a
                                         
                                         systems oriented student and you come into my group and I get you to work nine
                                         
                                         months on on a survey is you know it's super boring but I mean he went through
                                         
    
                                         it thankfully instead of coding you know, it's, it's super boring. But I mean, he went through it, thankfully, instead of coding, you know,
                                         
                                         we did this user survey where we asked 90 people,
                                         
                                         it was an online survey. There were 90 sort of participants,
                                         
                                         89 I think participants online survey.
                                         
                                         Plus I think we talked to maybe 15 users to really understand what their
                                         
                                         challenges were.
                                         
                                         And I think a lot of the people work actually coming from Neo4j or other
                                         
                                         graph database management systems, although we wanted to do the survey a bit
                                         
    
                                         more broader by talking to users of Draft and NetworkX and other
                                         
                                         sort of popular graph libraries as well. But I think many of the people were essentially
                                         
                                         coming from Neo4j because it had the biggest user base. And we asked them, what
                                         
                                         are your problems? And they really told us scalability is a big problem in the field.
                                         
                                         So that's how we positioned our research back in the day around 2017.
                                         
                                         Let's work on scalability, how to make essentially graph querying more scalable,
                                         
                                         graph storage more scalable.
                                         
                                         Amin really led the project.
                                         
    
                                         Amin Medby is one of my PhD students who's graduating this year.
                                         
                                         He was the main essentially researcher on the project. I mean, Matt B is one of my PhD students who's graduating this year. He was the main, essentially, researcher on the system. And so a lot of those ideas actually are
                                         
                                         now getting into Kuzu. And a lot of that work was either in storage or query processing. So maybe
                                         
                                         in terms of those two, we have the most sort of state of, we have integrated most of the
                                         
                                         state of the art techniques that we know. So what are those?
                                         
                                         So in the query processor, Kuzu is a factorized query processor,
                                         
                                         which means that the intermediate results are represented not as flat tuples,
                                         
                                         but as unions of Cartesian products.
                                         
    
                                         And I can get into this.
                                         
                                         It implements some of these novel techniques, joint techniques for many-to-many
                                         
                                         joints, for growing joints, which is a core problem in graph querying. And it then integrates
                                         
                                         these very well-known and adopted, very well-understood techniques, like it's vectorized
                                         
                                         in the sense that columnar DBMs, relational systems, use the term. So it means that instead of passing single tuple
                                         
                                         between operators in your query plans,
                                         
                                         vpass2048 tuples are being passed.
                                         
                                         And that's the term vectorization
                                         
    
                                         in the context of query processing.
                                         
                                         So it's a vectorized query processor
                                         
                                         and it does Morsel-driven parallelism,
                                         
                                         which is a technique that came, I think,
                                         
                                         maybe introduced in Hyper, I'm not sure, but it was
                                         
                                         a Thomas Neumann paper.
                                         
                                         And.db adopted to
                                         
                                         readopted. It's a way of parallelizing
                                         
    
                                         a single query plan
                                         
                                         instead of parallelizing multiple queries running
                                         
                                         in parallel.
                                         
                                         Basically, it's a very simple idea. It just means
                                         
                                         that if there is a pipeline,
                                         
                                         suppose you need to scan and
                                         
                                         filter something.
                                         
                                         You need to find all filter something, right?
                                         
    
                                         So let's take something very simple.
                                         
                                         You need to find all your nodes that have an age greater than 50,
                                         
                                         and there's maybe, you know, a million nodes.
                                         
                                         Morpholization just means that you partition the amount of data from disk that needs to be scanned,
                                         
                                         and different threads are competing to get different morsels,
                                         
                                         essentially partitions of things, and they run different pipelines,
                                         
                                         and they coordinate that where the pipelines end.
                                         
                                         So that's the way we parallelize queries.
                                         
    
                                         So those are essentially maybe the core sort of ingredients to the query processor.
                                         
                                         Storage is a columnar design, but it's graph optimized.
                                         
                                         So why is it culinary? The motivation for this,
                                         
                                         and this is essentially our position as a research group,
                                         
                                         from having talked to a lot of people
                                         
                                         who use graph database management systems,
                                         
                                         a lot of the applications that graph database management systems
                                         
                                         are very good at supporting,
                                         
    
                                         where they add value,
                                         
                                         is in essentially query heavy workloads.
                                         
                                         So you're looking for something very complex that requires essentially maybe finding a
                                         
                                         path between one node and another node, or even like just going like two, three degrees
                                         
                                         of neighborhoods around nodes to see what they're connected to.
                                         
                                         Again, like classic examples are fraud.
                                         
                                         You know, is this a bank account?
                                         
                                         Suppose you get a financial transaction network
                                         
    
                                         you want to represent as a set of nodes and edges.
                                         
                                         There are account nodes.
                                         
                                         There are transfers between accounts.
                                         
                                         And you just want to know,
                                         
                                         is this node connected to the suspicious node
                                         
                                         from take your country?
                                         
                                         The challenge is that the graphs by their nature,
                                         
                                         I mean, a lot of one core property of the data
                                         
    
                                         that applications tend to use graph database for is that nodes are going to connect to many other nodes.
                                         
                                         So there's going to be many, many relationships.
                                         
                                         So if one of your accounts is transferring on average to 100 different accounts and they transfer to 100 different accounts,
                                         
                                         the amount of essentially part of the database that you need to explore and search just grows for each hop that you do,
                                         
                                         for each join that you do. And so that ends up requiring scanning and sort of processing large
                                         
                                         chunks of the database. So then that says it's read optimizer, these systems need to be read
                                         
                                         optimizer, query optimized, versus systems that might just want to be very good at sort of updates
                                         
                                         and some, you know, you're taking lots of transactions,
                                         
    
                                         you want to update them.
                                         
                                         They're not really transactional.
                                         
                                         I mean, as any database management system,
                                         
                                         they need to support transactions
                                         
                                         because people will want atomicity and durability.
                                         
                                         They don't want to lose their data,
                                         
                                         but they're not supposed to offer a lot of value in that space.
                                         
                                         I think that's a mismatch.
                                         
    
                                         So as a design choice,
                                         
                                         many of them try to optimize for this query speed.
                                         
                                         And so our storage is columnar because that's the core.
                                         
                                         Well, I mean, that's a very well understood sort of principle that columnar systems are much more scan friendly.
                                         
                                         They give you opportunities to scan less parts of the data if the queries ask for less parts of the data than row-oriented systems. So, which means that every property that we have on nodes, for example,
                                         
                                         an H property gets its own sort of set of disk blocks or file, if you will. So, it's columnar
                                         
                                         as a core principle. And in terms of storing the edges and edge properties,
                                         
                                         relationship properties, it's still a columnar system. We adopt a disk-based version of the
                                         
    
                                         well-known CSR format. That's a very common format to represent sort of graphs both in memory and we
                                         
                                         have a disk-based version of it.
                                         
                                         So that's also columnar. I mean, it's actually called columnar sparse row format.
                                         
                                         So it's a columnar system.
                                         
                                         You can really think of CSR as a compression
                                         
                                         of two sorted columns.
                                         
                                         I won't get into it, but it's another columnar format.
                                         
                                         And we have one choice that we made,
                                         
    
                                         which sometimes I question actually,
                                         
                                         but we made it because that's what we've done in GraphFlow too,
                                         
                                         is that we double indexed the edge properties.
                                         
                                         So which means that if you have a connection,
                                         
                                         suppose account A has sent money to account B,
                                         
                                         and there's a property on that edge,
                                         
                                         maybe the amount of money that was sent is 500,
                                         
                                         that we stored that 500 twice, one for one of the nodes,
                                         
    
                                         the source, if you will, one for the other node.
                                         
                                         So that if, and that's a choice that we made.
                                         
                                         And I think it's especially good in a disk-based system.
                                         
                                         I mean, it increases your storage, but by a constant factor too.
                                         
                                         And you can't, I mean, you can't do less than that.
                                         
                                         And if you need to essentially read, scan the incoming properties of a specific node,
                                         
                                         and in general, you should always optimize for these very simple queries, because they're going
                                         
                                         to be always most common. Just like in any phenomenon in the world, there's going to be a
                                         
    
                                         skew in terms of the complexity of the query and how frequently you're going to get it in an actual
                                         
                                         system. So for the very simple queries, pinpoint queries where you say,
                                         
                                         give me this node and all edges around it.
                                         
                                         And if you need to read incoming edges,
                                         
                                         then you want to locate them.
                                         
                                         Essentially, this gives us a way
                                         
                                         that they're all stored in the same page.
                                         
                                         So that's why we double index it.
                                         
    
                                         We don't double index node properties.
                                         
                                         So node properties, you can,
                                         
                                         but you'd have to replicate hundreds of times
                                         
                                         if a node is being touched by hundreds of other nodes.
                                         
                                         So that's another thing.
                                         
                                         So we've got CSR format.
                                         
                                         We've got columnar format for node properties.
                                         
                                         CSR format for relationships and relationship properties.
                                         
    
                                         We've got double indexing there.
                                         
                                         Just on that for a second.
                                         
                                         Why do you sometimes regret that decision?
                                         
                                         Well, because it's – why do I regret this?
                                         
                                         So there is alternative ways. I mean, I could get a bit technical, but there is alternative ways that we could have reduced the storage, I think.
                                         
                                         I don't want to say I regret it. I sometimes question it because we didn't very, very properly study how fast we could have been if we chose to have just like, and that's what's the alternative method. The alternative method is what a relational system would do. You'd have a relationship table, no particular order there, and everything is kind of stored. All of the edges are stored there and all the properties are stored once. But you'd have an index that essentially you need to have this index.
                                         
                                         You need to have for which node, what are its edges.
                                         
                                         Okay.
                                         
    
                                         So that's not storing the property though.
                                         
                                         That's just storing the IDs of the neighbors.
                                         
                                         And maybe using that, you could just store the IDs of the edges as well.
                                         
                                         And then do a full scan with some smart way of filtering so you don't scan
                                         
                                         the whole tables. So that's an alternative way that you could store only once, edge properties
                                         
                                         only once. And you could still maybe do fine. Actually in GraphFlow, which was an in-memory
                                         
                                         system, we didn't double index. We actually did a single index, but it was in memory. So, you know,
                                         
                                         we'd have some random reads,
                                         
    
                                         but because it's in memory, we didn't care too much.
                                         
                                         We never really tested that.
                                         
                                         So I was a bit more conservative when we said we're going to do a move to,
                                         
                                         not just me back in the day,
                                         
                                         my student who implemented GraphFlow storage
                                         
                                         also implemented the initial version of Kuzuz storage.
                                         
                                         We had this intuition that, you know
                                         
                                         what, let's just double index, let's forget about
                                         
    
                                         this problem. But we never
                                         
                                         properly studied this.
                                         
                                         That's why I say, I sometimes
                                         
                                         still question it, that could a system,
                                         
                                         could we just store it once and
                                         
                                         still be okay?
                                         
                                         We never did that study.
                                         
                                         One for an interesting research paper in the future, I guess.
                                         
    
                                         Yeah, maybe.
                                         
                                         Yeah. So that's storage. We never did that study, you know. One for an interesting research paper in the future, I guess. Yeah, maybe.
                                         
                                         Yeah.
                                         
                                         Yeah.
                                         
                                         So that's storage.
                                         
                                         So let me just tell you just very quickly about other things. So the Join Optimizer is a core sort of optimizer in every system where you spend a lot of time.
                                         
                                         It's a dynamic programming-based optimizer.
                                         
                                         And Thomas Neumann really has done the most work in that space.
                                         
    
                                         So we try to get that.
                                         
                                         We have a buffer manager that's based on another, right now,
                                         
                                         University of Munich-based paper.
                                         
                                         VM Cache is the paper.
                                         
                                         So that area is led by Victor Leis.
                                         
                                         He was at a different university and moved to University of Munich
                                         
                                         where Thomas Neumann is. By the way, to the University of Munich where Thomas Neumann is.
                                         
                                         By the way,
                                         
    
                                         for any of the listeners,
                                         
                                         Thomas Neumann is like the god
                                         
                                         in systems,
                                         
                                         in databases.
                                         
                                         That's why his name is coming up a lot.
                                         
                                         I mean, not only one god,
                                         
                                         one of the gods.
                                         
                                         And we do look at his work
                                         
    
                                         and Victor is also at the University of Munich
                                         
                                         and he's done most work
                                         
                                         and the best of the work,
                                         
                                         I think, in buffer management.
                                         
                                         So we read that paper and try to integrate it into Kuzu.
                                         
                                         So we have a buffer manager based on VM cache.
                                         
                                         Let me not get into it.
                                         
                                         But, you know, people can easily find that.
                                         
    
                                         And we have a transaction manager. So that's very,
                                         
                                         I think if one day a historian went and said,
                                         
                                         what was the simplest and most,
                                         
                                         simplest and correct transaction manager?
                                         
                                         I'm very confident it's correct,
                                         
                                         but it's also very simple.
                                         
                                         It's realizable kind of by design.
                                         
                                         I'm not super proud of it,
                                         
    
                                         but we had to add these features in the simplest way possible.
                                         
                                         I'm not proud of the way we implemented it, but I'm proud of the way of how simple we were able
                                         
                                         to implement it. It's based on write-ahead logging. So, you know, writes go into different
                                         
                                         sort of pages first and they're logged. And during checkpointing, they're sort of copied over.
                                         
                                         And by design, you can only have a single writer in the system.
                                         
                                         So it's kind of optimized for bulk writes instead of sort of many
                                         
                                         concurrent writes and reads and writes can,
                                         
                                         reads and the write transaction,
                                         
    
                                         single write transaction can work concurrently.
                                         
                                         So,
                                         
                                         so it is serializable in that sense.
                                         
                                         So maybe one day you're going to sort of make that more optimal if there's
                                         
                                         any demand,
                                         
                                         but it's,
                                         
                                         it's again these
                                         
                                         are choices that we made our engineering efforts are based on optimizing the reads and scalability
                                         
    
                                         instead of transactions uh so yeah so transactions i mean in one sense it's it's it's a bit naive and
                                         
                                         simple it's correct i think uh it's you know relatively well tested but it's also we did a
                                         
                                         lot of simplifying sort of assumptions like that you only allow a single writer in the system and things like that yeah yeah cool so let's let's
                                         
                                         dig into a little bit more about the factorized query process because this was the focus of the
                                         
                                         side of paper right so we've been through what factorization is but maybe you can kind of touch
                                         
                                         on the design principles that you use when designing these new join operators and yeah
                                         
                                         i guess just tell us more about the ones that you've got in kuzu yeah so factorization is um again it's a compression technique but it's not
                                         
                                         a compression technique you use to essentially make your disk files smaller which is how compression
                                         
    
                                         is usually used on systems you know you've got a file that stores maybe a property
                                         
                                         or a column of a table, and you want to make it short
                                         
                                         because there's lots of, you want to make it small
                                         
                                         because there's lots of repetitions and whatnot.
                                         
                                         Factorization is interesting because it's actually a technique
                                         
                                         that's designed specifically for query processing
                                         
                                         for when the queries contain many-to-many joins,
                                         
                                         meaning that they can grow.
                                         
    
                                         So the core idea is very simple.
                                         
                                         So think of like the simplest star query.
                                         
                                         You've got a middle node B,
                                         
                                         you've got one outgoing edge to a node A,
                                         
                                         you've got another outgoing edge to C,
                                         
                                         and that's your query pattern.
                                         
                                         So this is essentially what you would write in the match clause.
                                         
                                         And you want to find essentially all incoming
                                         
    
                                         and outgoing neighbors of a particular node B.
                                         
                                         And suppose you get a filter on B.
                                         
                                         And suppose that filter says, you know, Jack's essentially outgoing,
                                         
                                         essentially transactions and incoming transactions.
                                         
                                         Give me Jack's neighborhood, basically.
                                         
                                         So that's the simplest essentially star query you could get.
                                         
                                         Now, if you've got essentially 10,000 outgoing neighbors and 10,000 incoming neighbors,
                                         
                                         now if you represent this output in flat format, it's 10,000 times 10,000 outgoing neighbors and 10,000 incoming neighbors. Now, if you represent this output in flat format, it's 10,000 times 10,000.
                                         
    
                                         So, you know, I think it's 100 million essentially tuples would be represented there.
                                         
                                         But there is actually a very, the neighborhood is actually much smaller.
                                         
                                         There's only 20,000 edges there.
                                         
                                         So factorization is a way of representing this output as a Cartesian product instead.
                                         
                                         So it's jacks, nodes, and data,
                                         
                                         Cartesian product with 1,000 outgoing neighbors,
                                         
                                         10,000 outgoing neighbors,
                                         
                                         Cartesian product with 10,000 incoming neighbors.
                                         
    
                                         So that Cartesian product is another way to represent
                                         
                                         exactly the same 100 million tuples.
                                         
                                         And it's essentially, you know, that's the way factorization works, that whenever possible, we try to essentially represent tuples, sets of tuples as Cartesian products in a much more compressed way and pass that between operators.
                                         
                                         So you asked about the design goals.
                                         
                                         So one thing was just be able to implement factorization.
                                         
                                         And we had a particular way of implementing this
                                         
                                         based on my PhDs in the means work in GraphFlow
                                         
                                         is really directly implemented in Kuzu.
                                         
    
                                         So we knew we wanted to essentially represent
                                         
                                         intermediate relations, not a set of flat tuples,
                                         
                                         like maybe DuckDB does,
                                         
                                         or pretty much any of the other systems do,
                                         
                                         but in a factorized way.
                                         
                                         We also wanted to make sure we don't use index nested loop joins. So this is the type of
                                         
                                         operators that we in GraphFlow had or Neo4j has, for example, an expand node, which is a join
                                         
                                         operator that takes a node record and expands it to its neighbors, for example. So if you use that in an in-memory system, maybe that's fine.
                                         
    
                                         But in a disk system, if that's the type of joins that you do,
                                         
                                         that is that there's an expand operator that takes a set of tuples,
                                         
                                         and for each one, it expands it.
                                         
                                         So that's what in database literature is the way index-nested loop join operators would work.
                                         
                                         So from one side, you read the tuples.
                                         
                                         From the other side, you go and scan and perform the join one tuple at a time, so for each join. So that's actually
                                         
                                         not going to be robust. And I don't say that's always bad. Actually, in the paper, there's
                                         
                                         many cases where that actually works the best in very, very, very basic cases, like, you know,
                                         
    
                                         you want one node and you want its neighbors,
                                         
                                         expand type of join operators do well,
                                         
                                         but it's not robust when you are getting, you know,
                                         
                                         thousands, tens of thousands, hundreds of thousands,
                                         
                                         tuples and you need to do that.
                                         
                                         It's just degrades very quickly because it leads to, you know, again, you can't really localize neighborhoods.
                                         
                                         So it degrades to random IOs.
                                         
                                         So we didn't want to do that.
                                         
    
                                         So we want everything to be based on hash joins,
                                         
                                         which is the common kind of wisdom nowadays for doing joins,
                                         
                                         at least equality joins in systems.
                                         
                                         And we also wanted to make sure we don't,
                                         
                                         when possible, behave like index-ness loop join.
                                         
                                         Index-ness loop join has this one advantage that if all I want is Jack's altcoin neighbors,
                                         
                                         you know, an expand operator that just goes
                                         
                                         and takes the, you know, maybe it was preceded
                                         
    
                                         by a scan operator that gave Jack's node ID.
                                         
                                         And all it does is that, you know,
                                         
                                         just looks up in an index the IDs of the neighbors
                                         
                                         and then does the minimum amount of IO
                                         
                                         and maybe read only the properties
                                         
                                         of those nodes or edges.
                                         
                                         So that's actually, you know, good because if you got, you know, a 10 gigabyte file,
                                         
                                         you might just end up reading, I don't know, a few kilobytes of things, even though it's random IO,
                                         
    
                                         you're just reading very few parts of the database. So in those cases, index.net.to.join has an
                                         
                                         advantage. Again, I don't think it's robust, but we also wanted to behave like that when the queries were very selective, when we needed to join a single node record or a few node records.
                                         
                                         So those are the three goals.
                                         
                                         Factorization, make it hash join based, but also try to avoid scanning essentially files.
                                         
                                         And the reason this is a problem is that hash join always reads entire files.
                                         
                                         So by default, there's a build phase and there's a probe phase.
                                         
                                         There's no mechanism to say only read me the neighbors of this one particular.
                                         
                                         So those were the three design goals.
                                         
    
                                         And we have a way of implementing this in the system.
                                         
                                         Awesome.
                                         
                                         With these specialized join operators, yeah.
                                         
                                         Cool, yeah.
                                         
                                         So maybe you can tell us a little bit more about these specialized join operators. So these three are in the system awesome with these specialized joint operators yeah cool yeah so you can maybe you can tell us a little bit more about these specialized joint operators so there's three
                                         
                                         in the paper maybe you can just give us the intuition behind um how they work essentially
                                         
                                         yeah so the so let me just tell you the maybe the core uh sort of idea of how in a case where you
                                         
                                         are uh maybe just reading jack's neighbors how this would work and we would still be hash join based
                                         
    
                                         but essentially achieve all the three goals. So it will be difficult to see how we get
                                         
                                         factorization because the query is too simple. So maybe let's just, we could expand it to see
                                         
                                         how factorization is achieved. But suppose the only query all we have to do is read Jack's
                                         
                                         neighbors. So what will happen is that just like in any hash join,
                                         
                                         there's a build phase, which creates a hash table. And then there's going to be a probe phase,
                                         
                                         which essentially scans the left side of the hash join operators, and then does probes into this
                                         
                                         hash table that was built. So we'll do something along the lines of you're going to read Jack's record and we will figure out Jack's node ID. So maybe it's like seven. Okay. And then we're going
                                         
                                         to have a hash table that puts seven and Jack there. Okay. Suppose you read only the name of
                                         
    
                                         Jack. Maybe that's the primary key, but the internal node ID is seven. Now Now all we need to know in the hash join, the hash join is going to join that
                                         
                                         record with all of its neighbors. All we need at that point, once we create the
                                         
                                         hash table, we know that all that we need to scan are edges whose sources are seven.
                                         
                                         So we pass that, we do sideways, it's called sideways information passing. So we pass that information as a mask of node IDs to the left side, probe site.
                                         
                                         And that probe site is going to essentially use that scan, use that mask to know which
                                         
                                         parts of those CSRs to read. So it will only read seven outgoing neighbors
                                         
                                         and avoid reading the entire, essentially, this file.
                                         
                                         And so we'll only get a way of,
                                         
    
                                         we have this mechanism
                                         
                                         so that only seven outgoing neighbors will be read
                                         
                                         and then they will be probed and joined with seven.
                                         
                                         So that's essentially the core idea
                                         
                                         of how we avoid scans is that we pass these node
                                         
                                         masks to know which parts of these either relationship tables or relationship sort of
                                         
                                         indices should be read or some node properties should be read.
                                         
                                         So that's the core principle in these different operators.
                                         
    
                                         And, you know, so way, how do we get,
                                         
                                         so that doesn't explain how we get factorization.
                                         
                                         The way we get factorization is that,
                                         
                                         so in CSR, the edges are localized.
                                         
                                         So we don't essentially pass that as essentially
                                         
                                         a sequence of, suppose you had hundreds outgoing neighbors.
                                         
                                         We don't pass that one by one.
                                         
                                         We pass that as a sequence of hundreds, not with seven.
                                         
    
                                         So we just represent this as seven
                                         
                                         times those 100 so that's the tuple that's being sent and we're going to probe on seven so so that
                                         
                                         seven and hundred outgoing neighbors will be joined as a single probe not hundred probes
                                         
                                         and with a single probe we're going to read the jack property which was in the
                                         
                                         hash table so that's the way essentially what's being passed is always these Cartesian products,
                                         
                                         unless some operator needs to essentially flatten
                                         
                                         and essentially break those Cartesian products.
                                         
                                         Okay, nice.
                                         
    
                                         So I guess the kind of question that falls out of it is kind of
                                         
                                         how much better are these joint operators than on a graph workload?
                                         
                                         So maybe you could tell us a few kind of numbers,
                                         
                                         some of the experiments you ran.
                                         
                                         So we did several experiments,
                                         
                                         and I'm not a big fan of system-to-system comparisons
                                         
                                         because it's very difficult to get these, quote-unquote,
                                         
                                         competitor systems in their ideal performing sort of configurations.
                                         
    
                                         Yeah.
                                         
                                         So we do try to compare, essentially,
                                         
                                         what an index-next-to nested loop join configuration of Kuzu would look like with this, you know, node mask based Kuzu.
                                         
                                         So, you know, one thing that we want to do is, so we're isolating, we're just trying to study the pros and cons of this passing of masks.
                                         
                                         Yeah. which is an overhead, basically. You know, that's not something that index-nested loop join or hash join does,
                                         
                                         versus the benefits we get from essentially getting rid of index-nested loop joins in the system. So that's just one meaningful comparison is that Kuzu's current version versus Kuzu index-nested loop join,
                                         
                                         which is essentially where we added an index-nested loop join operator to the system,
                                         
                                         where we do the joins using these expand-like operators.
                                         
    
                                         And the general message of those experiments is that when nodes are very selective,
                                         
                                         this overhead obviously is not worth it.
                                         
                                         Its expand type of operators are going to be better performing
                                         
                                         at those very high extreme selectivity cases.
                                         
                                         A couple nodes are being passed, but it degrades
                                         
                                         very quickly compared to essentially this node mask based operators. So this will be tested both
                                         
                                         on some large benchmarks like LDBC plus some micro benchmarks, which are actually my favorite
                                         
                                         experiments of all systems, which are those micro-bank marks because you can really focus on a particular behavior of the system.
                                         
    
                                         And then, you know,
                                         
                                         when the selectivity is very high,
                                         
                                         there's also an overhead, right?
                                         
                                         So when the selectivity is very high,
                                         
                                         you're not really getting much benefit
                                         
                                         from those passing those masks
                                         
                                         because those node masks are not selected.
                                         
                                         They're not helping you avoid scans.
                                         
    
                                         So at those two extremes, you know,
                                         
                                         there is an overhead and you could do better,
                                         
                                         but overall it doesn't degrade the way index and sudo join operators degrade, or it's not
                                         
                                         way off like DuckDB is, for example, which doesn't have a node mask, just does hash joins
                                         
                                         by forcing the system to scan all these large files. So, you know, so in comparison,
                                         
                                         it's kind of like a trade-off that you do, you know.
                                         
                                         So at those extreme cases,
                                         
                                         there are systems that would outperform this type of processing
                                         
    
                                         or there are techniques.
                                         
                                         But overall, it's like, I think,
                                         
                                         a reasonable trade-off to make that your system
                                         
                                         and your co-operating processing is more robust.
                                         
                                         So that's the type of, yeah,
                                         
                                         that's the type of sort of evaluation that we provided in the paper the paper yeah i reckon the listener goes and checks it out as well um but i guess on that so
                                         
                                         i guess it's possible to kind of have to support both types of operators in the system right and
                                         
                                         have some sort of like some sort of i don't know some sort of option that lets you choose specific
                                         
    
                                         one based on the selectivity or whatever right so i mean you can get the best of both wells
                                         
                                         i agree i agree it's it's doable actually one you can get the best of both worlds. I agree. I agree. It's doable.
                                         
                                         Actually, one of the, there's two first authors of the paper,
                                         
                                         Xiang and Godong, they contributed the most to the paper.
                                         
                                         So, they're listed as first co-authors.
                                         
                                         They have been sort of arguing of when we should avoid semi-masks
                                         
                                         and this not, we call it semi-masks, but
                                         
                                         these node masks that I mentioned, you know, when should we just avoid this overhead completely?
                                         
    
                                         Because it's like, if we kind of during runtime field, see that these node masks are not being
                                         
                                         selective, we might maybe avoid it and do something adaptive.
                                         
                                         Or, or if you're smart enough,
                                         
                                         your optimizer is smart enough to predict that node mask will not be selective,
                                         
                                         you can just remove it
                                         
                                         and replace some of those join operators
                                         
                                         with indexness loop join.
                                         
                                         I agree.
                                         
    
                                         It's a question of like,
                                         
                                         it's always an engineering question of like,
                                         
                                         do you want to invest that time?
                                         
                                         Or are you happy with how your sort of query process
                                         
                                         is performing even with those overheads?
                                         
                                         So it's a matter of that.
                                         
                                         But I do agree that, you know,
                                         
                                         and our experiments make it visible that there is performance
                                         
    
                                         that we put on the table and that our trade-off is there explicit.
                                         
                                         Awesome stuff.
                                         
                                         Yeah, I'm going to kind of roll the next two questions into one
                                         
                                         because I always ask kind of what are the limitations of whoever I'm interviewing,
                                         
                                         whatever they're working on. But obviously, Kuzu is a very young system. So I mean, a lot of the
                                         
                                         it's not feature complete, right? So I mean, there's a lot of road to run still. So I kind
                                         
                                         of want to ask you more like, where do you go next with Kuzu? Yeah, so I have several sort of
                                         
                                         theses that are being written on Kuzu. One thing that we don't do or we haven't sort of
                                         
    
                                         invested much time in graph flow too there was no work on this i didn't have a student
                                         
                                         to take this topic is recursive queries so which is essentially actually an area that graph database
                                         
                                         is need to shine in certainly their query languages are much better for recursive queries
                                         
                                         because recursive is kind of a first class citizen historically too like even in those systems in 60s you know 80s 90s 2000s in all those graph-based
                                         
                                         systems like rdf object-oriented system even like the 60s system like ids recursive application
                                         
                                         query like applications that wanted to do recursive querying were always because of some of the killer
                                         
                                         apps so so there is a thesis being written on that.
                                         
                                         So we are essentially trying to implement some new query processing techniques
                                         
    
                                         for like shortest paths or for variable length joins and things like that.
                                         
                                         A second thing that I'm very interested in,
                                         
                                         and I have a student who's going to start working on this and we have some
                                         
                                         ideas, is how to make property graph database management systems
                                         
                                         more competent in RDF style graph data.
                                         
                                         So that's another sort of popular data model
                                         
                                         that people broadly label as graph-based.
                                         
                                         And I do think, you know, the property graph model,
                                         
    
                                         so this is like a position that I try to articulate.
                                         
                                         So I think out of all the graph-based systems,
                                         
                                         graph-based models, property graph model
                                         
                                         is the most generic one in which you could build
                                         
                                         the most sort of applications.
                                         
                                         It's more generally applicable,
                                         
                                         just like relational model as a model
                                         
                                         is very generically applicable so rdf i
                                         
    
                                         don't think is is you know rd is a very specific application domain i'm not saying it's a small
                                         
                                         application domain but it's a gotta it's really suitable for a specific sort of application domain
                                         
                                         as a model but property graphs are not property graphs are just as relational model is very
                                         
                                         generic or is quite generic you could represent lots of application data. And actually, this is one of the other things that we mentioned in our survey paper back in 2017, that when we asked people what kind of applications they and some techniques to maybe do some owl-based reasoning and those things.
                                         
                                         So I don't think vice versa is possible.
                                         
                                         So I don't think you are going to manage your transactions with RDF systems or like financial transactions and frauds.
                                         
                                         I don't think that's a very good fit as a model,
                                         
                                         but I think property graph as a model is a good fit.
                                         
    
                                         So there will be a thesis written on how to essentially make
                                         
                                         property graph-based systems, and specifically this will be done
                                         
                                         obviously on Kuzu, more essentially techniques to make the system
                                         
                                         more efficient in RDF style,
                                         
                                         knowledge management, and maybe some level of reasoning. Very related to that is a topic that's
                                         
                                         going to sound very boring to maybe to people, but I still don't know what's the right way to
                                         
                                         manage strings. We don't know this. There's actually no authoritative paper in the field
                                         
                                         on this. There's some authoritativeitative paper in the field on this.
                                         
    
                                         There's some state-of-the-art compression techniques, but there's no authoritative paper that says, that makes an advice. If you assume this thing about your data set,
                                         
                                         here's maybe the properties of the strings, their length, et cetera. Here's the right way to
                                         
                                         sort of store it on disk. Here's the right way to store, like manage the overflows maybe of those strings
                                         
                                         or because that's a variable length data type.
                                         
                                         And here's the right way to represent it in memory.
                                         
                                         So we do these things where we kind of
                                         
                                         find and blindly copy from Umbra or DuckDB,
                                         
                                         these things, but, you know,
                                         
    
                                         I don't think it's very well understood.
                                         
                                         I had this, have this dream of finding a student
                                         
                                         to actually just do this investigation
                                         
                                         for about two years to sort of do this experimental study of like, here is all the techniques that's
                                         
                                         in the field of how to store things in memory, how strings in memory, how to manage them in memory,
                                         
                                         how to process them in memory, how to store them on disk. I don't know if they'll get a chance to
                                         
                                         do that, but that's something I hope I can find a student to do on Kuzu as well.
                                         
                                         Yeah, it's one of those things on the face of it,
                                         
    
                                         it sounds quite a hard sell,
                                         
                                         but I imagine that when you get into it,
                                         
                                         it's a fascinating, really interesting project
                                         
                                         once you've kind of got going with it.
                                         
                                         So hopefully a listener listening to this now thinks,
                                         
                                         okay, I'm going to go work on strings for three years.
                                         
                                         Yeah, maybe.
                                         
                                         Yeah, but I mean, so Hannes,
                                         
    
                                         who is the CEO of Duck tv labs and one of the main
                                         
                                         engineers there also mentioned this actually in his keynote at cider this year where kuzu paper
                                         
                                         got published he said like he's very surprised that few people work on strings despite the fact
                                         
                                         that majority of the data bases are strings and like there needs to be one authoritative paper i
                                         
                                         think that makes this advice like this is how you should implement strings if this is your assumptions.
                                         
                                         And I want to see that chart of like whatever those assumptions are, you know, maybe some workload assumptions, maybe some, you know, data set assumptions.
                                         
                                         And then says, here's a good thing that we found that's robust and easy to implement and whatnot.
                                         
                                         Yeah, that would be a really good contribution. I guess, yeah, so the next sort of set of questions to have,
                                         
    
                                         I tend to sort of ask them to everyone with various different flavors.
                                         
                                         But the first one is kind of how as a software engineer,
                                         
                                         data engineer, can I kind of leverage your work?
                                         
                                         I guess the answer to that is just kind of go and play with Kuzu, right,
                                         
                                         and try and integrate it into your workflow.
                                         
                                         I agree, yeah.
                                         
                                         See if you can kind of fit it in. So I guess the follow-up question to that would be like kind of go and play with Kuzu, right? And try and integrate it into your workflow so you can kind of fit it in.
                                         
                                         So I guess the follow-up question to that
                                         
    
                                         would be like kind of longer term,
                                         
                                         what sort of impact do you think Kuzu can have?
                                         
                                         So let me also just maybe on the previous question,
                                         
                                         I wanted to say like,
                                         
                                         I just kind of focused on the research
                                         
                                         and things that are happening.
                                         
                                         There's also like, as you said,
                                         
                                         Kuzu is a young system.
                                         
    
                                         So there's a lot of engineering that's happening to add a lot of features and
                                         
                                         we are trying,
                                         
                                         we're getting excited about sort of making Kuzu more usable in the graph
                                         
                                         learning space and trying to connect more with users.
                                         
                                         So we are adding some,
                                         
                                         some features specific to these graph learning libraries.
                                         
                                         So there's going to be all those integrations that are happening.
                                         
                                         So for that question, yeah, so I think playing around with Kuzu
                                         
    
                                         is probably the right message for developers
                                         
                                         or general application developers.
                                         
                                         I mean, if you're a developer of graph database management systems,
                                         
                                         they'll look like you're in Neo4j.
                                         
                                         So, you know, for people,
                                         
                                         engineers of Neo4j,
                                         
                                         I think there's some core messages
                                         
                                         that we think some of these core
                                         
    
                                         joint techniques should get into these systems.
                                         
                                         And there's a lot to benefit.
                                         
                                         For other application developers,
                                         
                                         I think, you know,
                                         
                                         there's a set of application domains
                                         
                                         where really relational model
                                         
                                         and SQL is not going to be a great fit,
                                         
                                         nor are the systems that implement SQL
                                         
    
                                         are implementing the right sort of techniques
                                         
                                         to be very performant.
                                         
                                         Even like the most state-of-the-art techniques
                                         
                                         like Umbra is not going to be
                                         
                                         really, really well in many-to-many joins,
                                         
                                         for example, or very, very well in recursive joins.
                                         
                                         Like recursion is something
                                         
                                         that's always a second thought
                                         
    
                                         in relational systems.
                                         
                                         It's a first-class in graph systems. It's a first class in graph systems.
                                         
                                         So, you know, if you've got applications that benefit from these set of features, and there's more in my blog post, I think I've listed about five, six of these application domains.
                                         
                                         I think just being sort of knowledgeable about this alternative database management systems is going to make them better engineers so because like trying to
                                         
                                         every retrofit every application to relational model is i think not a not a not a very good idea
                                         
                                         so people people should be educated more on on graph database management systems and these
                                         
                                         alternative database management uh systems their models their query languages uh so it's going to
                                         
                                         make them i think better engineers yeah for sure yeah definitely agree yeah next question, I think you touched on this a little bit earlier on
                                         
    
                                         when you were talking about the journey and Graf Law and whatnot.
                                         
                                         But I guess on this journey you've been on,
                                         
                                         what was probably the most interesting thing that you've kind of learned
                                         
                                         and that was maybe quite unexpected as well?
                                         
                                         Yeah, good question.
                                         
                                         Actually, this was the question that I thought most about.
                                         
                                         I don't have a great answer to this.
                                         
                                         I asked Xiang and Godong,
                                         
    
                                         the two authors,
                                         
                                         what were their sort of,
                                         
                                         what was the thing that they found most sort of unexpected or interesting?
                                         
                                         I guess like,
                                         
                                         I think we learned a little bit more about ourselves.
                                         
                                         So we haven't done this before.
                                         
                                         So like I am trying to do this in Canada.
                                         
                                         So,
                                         
    
                                         you know,
                                         
                                         by following DuckTB or systems like Umbra or Hyper,
                                         
                                         you know, these are, this is a specific sort of research style that's quite popular in Europe.
                                         
                                         So both DuckTB and this Umbra hyper type of systems are being developed in academic settings
                                         
                                         in Europe. And I'm trying to, you know, my group is trying to replicate that in Canada with Kuzu.
                                         
                                         So, you know,
                                         
                                         we are doing a lot of engineering and less sort of paper writing than maybe
                                         
                                         normally researchers in academic settings do.
                                         
    
                                         And we are doing this for the first time.
                                         
                                         So one thing that we are noticing about ourselves is that how this,
                                         
                                         this it's more a psychological thing
                                         
                                         that's we're that i'm finding interesting how often we get wrong uh how often our intuitions
                                         
                                         are wrong about what is important and what is not it just becomes very difficult to judge these
                                         
                                         things in absence of a large user base and like for example let me give you an example we did
                                         
                                         start the project and spent a lot of effort trying to implement this secondary, second storage sort of structures in the system to store arbitrary key
                                         
                                         value properties. And, you know, that just slowed us a lot. And that was a lot of code. And right
                                         
    
                                         before we released back in November, we said, you know what, let's just see if anybody is actually
                                         
                                         interested in this or are people really structuring their graphs and we're going to add this back on.
                                         
                                         But I mean, so that was like when the project started, you know, maybe a year and a half,
                                         
                                         two years before that, it was just, that was like nobody ever questioned that we should
                                         
                                         not have started this.
                                         
                                         And the mistake that we do often, we're not saying that we're over-optimizing.
                                         
                                         We get too sort of optimistic about, you know, this is going to be really
                                         
                                         important, let's certainly do it, versus, and we kind of
                                         
    
                                         trade that off with simplicity, and we kind of implement
                                         
                                         more than we should, and we throw away. So that's a mistake
                                         
                                         we consistently do. I think we're getting it better, but that's
                                         
                                         something that we started seeing more concretely
                                         
                                         by the amount of code that we're also throwing away from the system.
                                         
                                         So that's something to keep in mind.
                                         
                                         That's a very important lesson, I think,
                                         
                                         that if we were to do this again,
                                         
    
                                         many of the features would just start much smaller
                                         
                                         and try to get it complete and optimized and evolved,
                                         
                                         just like text and writing papers, you know, you write something
                                         
                                         the best way to write, you know, as they say, 12 pages
                                         
                                         is write 24 pages and throw half out
                                         
                                         start
                                         
                                         simple, start simple, add more and
                                         
                                         optimize it, it's just going to evolve
                                         
    
                                         over time
                                         
                                         don't overdo it
                                         
                                         is one thing we've realized ourselves, it's a mistake
                                         
                                         we made. But it's in the quote, right?
                                         
                                         Was it premature optimizations of all evil or something like that?
                                         
                                         Yeah, maybe sometimes.
                                         
                                         It's very real.
                                         
                                         Yeah, we've done several of those for sure.
                                         
    
                                         Cool.
                                         
                                         I guess kind of while we're talking about the journey as well,
                                         
                                         the things were kind of along the way.
                                         
                                         Maybe this is probably an example of that.
                                         
                                         But you tried that and it failed,
                                         
                                         that you think might be interesting people to know about.
                                         
                                         Maybe this is a good example of that anyway.
                                         
                                         So I could tell you ideas we didn't take from GraphFlow
                                         
    
                                         because we were able to write good papers on them,
                                         
                                         but I don't, in the end, advise people to implement them.
                                         
                                         So for example, we have this paper on A-plus indices.
                                         
                                         It's a way of bringing materialized views into graph database management systems.
                                         
                                         I mean, I think they're a fine idea, and, you know, we published this in ICDE,
                                         
                                         which is a top conference and whatnot.
                                         
                                         It's a good interpretation of how materialized views could be brought into a system
                                         
                                         as these indices that support essentially subsets of your edges, which are effectively results of
                                         
    
                                         some queries you may have run on the edges. And we found sort of fine applications of it,
                                         
                                         but it's also now very practical in the sense that it's a lot of effort. And that's when we sit down
                                         
                                         and we say, okay, we've got a budget of this much engineering we can afford.
                                         
                                         Should we do it in A-plus indices or should we do it like...
                                         
                                         It was like clearly one of the things we said,
                                         
                                         no, we don't have to do this.
                                         
                                         It's not really worth it.
                                         
                                         So in some sense, if you want to interpret it,
                                         
    
                                         it's a failure in that sense
                                         
                                         because it doesn't offer something very practical, I think.
                                         
                                         Or the amount of effort is something we can't afford right now.
                                         
                                         So there are ideas like that that we brought in.
                                         
                                         And there was a lot more work on factorization that we did that's in Amin's thesis that we
                                         
                                         are not adopting.
                                         
                                         Because again, I find it's good to do research
                                         
                                         projects because without those courageous PhD students doing those research, you can't really
                                         
    
                                         distill the practical ideas, I think, but I don't think we really got it right yet. So if you looked
                                         
                                         at the theory of factorization, there is excellent ideas there. Like there's, so the form of
                                         
                                         factorization that we do, again, this is for
                                         
                                         interested students who might look at this, is called F representations. There's another
                                         
                                         more even compressed versions that you could keep called D representations. It basically stands for
                                         
                                         factorized representations with denotations. The D comes from the denotation. And we don't have a way to essentially adopt that. Again, Amin has something in his thesis, but again, I am not courageous
                                         
                                         enough to actually say, because, you know, the way Kuzu is trying to be is like to be very practical
                                         
                                         system. So I'm very selective on what gets in and what doesn't get in. But I think it's good work
                                         
    
                                         that Amin has done. But I'm saying that there needs to be more work until we actually feel confident that, okay, the representations can now be integrated into the system.
                                         
                                         Yeah.
                                         
                                         So I don't know if I answered your question, but there are tons of failures.
                                         
                                         That's in the very nature of research, I think, that a lot of the ideas, I think, we just get wrong.
                                         
                                         We publish them, and then somebody else does something better, I think that a lot of the ideas, I think we just get wrong. We publish them and then somebody else does something better, I think.
                                         
                                         And then hopefully some core sort of principles merge that we know how to do that in a more practical way.
                                         
                                         Yeah. Out of interest, how big is the team at the minute that's working on Kuzu? How many engineers?
                                         
                                         So there's, I think, five of us, including myself and three master's students. So the five out of the five, many of them are
                                         
    
                                         doing mainly engineering. They're in the university and whatnot, but they're really less, because they
                                         
                                         got their degrees, they're staying as research associates, or they're postdocs, so they're less
                                         
                                         worried about, and they're less worried about, and they're really trying to, they're less worried about publications.
                                         
                                         And they're just very inspired by sort of trying to be like DuckDB and get a user base.
                                         
                                         So they're doing more engineering, although, you know, obviously we are going to continue writing papers.
                                         
                                         Although, and then there's a couple people who are new, whom I've started training recently.
                                         
                                         And they'll write some of these thesis that I that I mentioned over the next two three years so but like you
                                         
                                         know where however you counted including me maybe seven eight yeah some of some
                                         
    
                                         are very junior and not doing any engineering just research and some are
                                         
                                         like doing more research yeah is is is kuzu kind of your whole sort of like vehicle
                                         
                                         for research or do you have other sort of non-kuzu related projects going as well
                                         
                                         yeah so i we recently published a paper actually chunk who is also an author on kuzu it was
                                         
                                         chunk's master's thesis uh chunk kind of uh was doing a project on open government data sets
                                         
                                         that was a research project that i'm actually very excited about as well.
                                         
                                         So this is an area I would say led by Rene Miller.
                                         
                                         I mean,
                                         
    
                                         one of the main people in the,
                                         
                                         in the spaces is Rene Miller has a long history of research in open
                                         
                                         government data sets.
                                         
                                         So these are the data sets like,
                                         
                                         you know,
                                         
                                         UK has one,
                                         
                                         US has one,
                                         
                                         data.gov, Canada has one, opencanada.ca.
                                         
    
                                         These are ways governments are publishing, essentially, lots of datasets about how the
                                         
                                         government runs with the goal, long-level, sort of long-term goal of transparency, more,
                                         
                                         essentially, democratic access to how essentially the governmental institutions
                                         
                                         are running for journalists to keep an eye on sort of some maybe, I don't know, inefficiencies
                                         
                                         and major problems in the society. And it's not very usable though, these data sets, a lot of the people, it's not very easy to find them.
                                         
                                         It's not very easy to integrate them.
                                         
                                         So I have a research project.
                                         
                                         It's unfortunately gotten a bit smaller
                                         
    
                                         since Chang decided he'd rather work on Kuzu
                                         
                                         to essentially make these data sets more usable.
                                         
                                         So I still have, actually, I'm currently this week writing a paper on that.
                                         
                                         So I have that going on.
                                         
                                         And I have a PhD student who's graduating also this year who has been doing work on
                                         
                                         differential data flow and applications of differential data flow to graph processing.
                                         
                                         So for those of you who may not know what differential data flow is,
                                         
                                         differential data flow is one of the technologies behind Materialize Inc.
                                         
    
                                         It's a popular startup in the data space.
                                         
                                         They do essentially streaming SQL and to materialize your queries,
                                         
                                         although I might be wrong, so check what they do.
                                         
                                         But they're basically using a technology called timely data flow and differential data flow, which Frank McSherry was very influential in popularizing.
                                         
                                         So differential data flow is the most generic incremental view maintenance technique that I'm
                                         
                                         aware of. It can maintain arbitrary data flows. And one of their best applications is on recursive,
                                         
                                         essentially, data flows and how to maintain the outputs of recursive flows.
                                         
                                         And it relates directly to graphs because a lot of sort of graph analytics is recursive computations.
                                         
    
                                         Like if you want to find shortest paths, if you want to find connected components, those types of graph analytics is often represented as recursive queries.
                                         
                                         And if you represent that recursive, essentially, computation as a data flow and give it the differential data flow and your input, your graph is changing,
                                         
                                         it's going to maintain your connected components
                                         
                                         or it's going to maintain your shortest paths, which is an amazing technology.
                                         
                                         It's actually unique in the sense I don't think there is another technology
                                         
                                         that's able to do this.
                                         
                                         So he has been, my PhD student Siddhartha has been working on building
                                         
                                         essentially graph layers over
                                         
    
                                         differential data flow so i have that going on on the side although again these have gotten smaller
                                         
                                         because the students are leaving uh to either move on to other things or to move to kuzu and
                                         
                                         and things like that yeah and great so yeah i've got um three more questions and so the first one
                                         
                                         is i really like to ask them people it's really interesting to see the divergency answers i get for it so it's and i'm sure you'll have a totally like new
                                         
                                         perspective on it as well so it's like how do you go about approaching idea idea generation and then
                                         
                                         determining what to work on i just think i'd love to know more about your creative process i see um
                                         
                                         i i so it's a very difficult to answer questions. So some of these are student driven, obviously, at this point. And back in the day, it was a bit advisor driven. So back in my PhD, it was a bit advisor driven. So a lot of the very productive work that I did, actually, back in PhD came from Jeff Allman. Jeff is a Turing Award winner for all those.
                                         
                                         I'm sure people have read Jeff's books.
                                         
    
                                         So Jeff kind of, I did a lot of work with Jeff during PhD,
                                         
                                         some theoretical work.
                                         
                                         And they were really driven by Jeff and Foto Afretti back then.
                                         
                                         She was a collaborator of Jeff.
                                         
                                         So back then it was driven by that and they would give me specific problems.
                                         
                                         They would be very well defined and I'd work on that.
                                         
                                         So that was the style.
                                         
                                         Now some of the problems are also reversed where some students are going to bring ideas.
                                         
    
                                         But so often I work the way I work nowadays is I started very, very naive, simple things.
                                         
                                         I implement, I make sure that the most naive versions
                                         
                                         of things get implemented.
                                         
                                         So like I'll pose a problem like, you know,
                                         
                                         how should shortest paths be implemented,
                                         
                                         which is an ongoing project right now?
                                         
                                         What can we do in that space?
                                         
                                         And the first thing that we're going to do is
                                         
    
                                         we will do the most basic ways of implementing this.
                                         
                                         And frankly, most of my best ideas come
                                         
                                         when I implement those things
                                         
                                         or I see my students implement those things
                                         
                                         and I look at the queries.
                                         
                                         I find it difficult to come up with
                                         
                                         very good systems research ideas by just,
                                         
                                         I can come up with high level directions,
                                         
    
                                         but then in the end,
                                         
                                         I can't often predict what the papers
                                         
                                         are going to be about beforehand.
                                         
                                         And that comes only when we implement something.
                                         
                                         And then we see, OK, these are the challenges in the implementation.
                                         
                                         So I highly advise all of my students right now.
                                         
                                         And this is the sort of style that I've come to adopt in the last three, four years.
                                         
                                         It's extremely sort of code driven.
                                         
    
                                         Everybody needs to write a lot of code.
                                         
                                         I think that's good for training as well.
                                         
                                         And everybody needs to work in the context of a system.
                                         
                                         So one thing that I'm very against is publishing ideas that are not implemented in serious systems.
                                         
                                         Because they're kind of destined to be, I think, out of touch with the reality of how things get integrated into a system,
                                         
                                         which means they're going to be dismissed.
                                         
                                         So everything needs to be implemented in a serious system.
                                         
                                         Like Kuzu at this point is becoming quite serious that if you do research on Kuzu, you
                                         
    
                                         will get to interact with a lot of components of the system or you need to take, like I'm
                                         
                                         advising a PhD student, not really advising, but I'm talking to a PhD student who works on Neo4j at the University of Waterloo.
                                         
                                         And, you know, she does work inside Neo4j, which is good.
                                         
                                         And that's the style of, I think, research should be done.
                                         
                                         And when you do that, I think, do you get good practical ideas?
                                         
                                         Now, that has one risk, which is that when you do that, your ideas are also more humble because you're kind of highly limited by what can be done versus if you were to kind of feel freer and do something on the side on your own scratch code.
                                         
                                         So those papers are more easily rejected.
                                         
                                         So I do get a lot of rejections. My students and I, you know, a lot of the work that we do gets quite a bit of rejection,
                                         
    
                                         unfortunately, because some of the ideas end up more like, you know, here are some sort of core
                                         
                                         engineering principles, but they don't, it makes, it looks a bit, it's a bit hard to make them look
                                         
                                         fancy from a research perspective. But that's the style that I work, I think. So write the code and see what doesn't work.
                                         
                                         And then after enough failures, good ideas come, I think.
                                         
                                         And they always come, by the way.
                                         
                                         So no need to worry about that.
                                         
                                         Some idea always comes on that you can publish, I think.
                                         
                                         Awesome.
                                         
    
                                         Yeah, that's a great answer to add to my collection of answers for that question.
                                         
                                         Great stuff.
                                         
                                         Cool.
                                         
                                         So just two more questions now so the first
                                         
                                         one is um what do you think is the biggest challenge facing us now in database research
                                         
                                         maybe specifically um we can focus this towards like graph data management and there was a recent
                                         
                                         um debate on graph database publishers published in the register i'd like to get your take on that
                                         
                                         as well if possible uh actually i i didn't read them because I find those things a bit distracting.
                                         
    
                                         And so some of those are more around personalities
                                         
                                         than I think, or some companies
                                         
                                         and companies have essentially energies around them.
                                         
                                         So I actually, I genuinely did not read them.
                                         
                                         Although I know the general topic that was discussed
                                         
                                         and I obviously have my position.
                                         
                                         So like anybody I think who claims graph databases
                                         
                                         will take over relational systems
                                         
    
                                         is way off in their predictions.
                                         
                                         I don't think that will happen.
                                         
                                         I think you gotta be,
                                         
                                         I mean, the relational market is humongous
                                         
                                         and very well established.
                                         
                                         And I don't think people are know, people are going to switch to
                                         
                                         graph technology for tons of applications that they will. So that's not going to happen. But
                                         
                                         anybody also says graph database are going away or else in way of as well. I don't think they're
                                         
    
                                         going in actually, as I have tried to say this style, the one that Neo4j kind of brought the
                                         
                                         data model and decipherlike sort of query languages
                                         
                                         are going to stay.
                                         
                                         And I think they're actually going,
                                         
                                         they're likely to be the most successful
                                         
                                         of every wave that has come so far.
                                         
                                         And there's more buzz around them too,
                                         
                                         but I don't see them going away like XML databases.
                                         
    
                                         I think they're here to stay.
                                         
                                         I think they've got a relatively healthy
                                         
                                         sort of application base
                                         
                                         and more can be done in the graphs.
                                         
                                         I mean, one reason that convinces me of this is that there's really no other data model other than a graph and tables that is very generic to represent application data.
                                         
                                         And there will always be appeals to choose one over the other. And some of this is also just sociological question that people just prefer thinking of
                                         
                                         certain sort of records as nodes versus tables. So they've done a good job. I think property
                                         
                                         graph model is a good model. So I do see it remaining for around. I mean, so that's also
                                         
    
                                         way off. But obviously, I don't think graph database are going to take over
                                         
                                         relational systems or something. That's also way off.
                                         
                                         That doesn't demonstrate, I think, a good understanding
                                         
                                         of the core sort of
                                         
                                         power around relational
                                         
                                         systems. But graph database are here. They're going to get better.
                                         
                                         The market is going to grow healthily, I think.
                                         
                                         There will always be many applications that use those systems. What is a challenge in databases?
                                         
    
                                         So, I mean, so I'm not, I don't see myself as a visionary, frankly. So, I don't, I can't sort of
                                         
                                         foresee what kind of big problems people are going to uh to work on but one thing that
                                         
                                         i am kind of inspired about like this i'm just i'm just putting it out there so that maybe some
                                         
                                         phd students who might run into this podcast uh might want to get curious about is there has been an ideal way back,
                                         
                                         actually stemmed from AI, artificial intelligence,
                                         
                                         from very early days historically of building information systems that can do deductions.
                                         
                                         And that has its own niche sort of communities with like automatic proofs
                                         
                                         and things like that, where it connected very well with the database community back in the day.
                                         
    
                                         I don't think the communities talk as much,
                                         
                                         but these logic-based systems like Datalog is essentially a part of first
                                         
                                         order logic that really captures the relational calculus.
                                         
                                         But it's a very limited part of first-order logic. And the logic-based sort
                                         
                                         of AI community, the symbolic AI community, has done a lot of work on finding essentially
                                         
                                         algorithms that can do deductions. So let me give you an example. Actually, this is an example. I
                                         
                                         wrote a blog post on this that will appear at Sigmund Records. But I think six months from now,
                                         
                                         for some reason, they said it would appear in Sigmund Records. But I think six months from now, for some reason,
                                         
    
                                         they said it would appear in Sigmund.
                                         
                                         But here's an example that if you read a book
                                         
                                         on those sort of core AI topics of knowledge representation
                                         
                                         and sort of applications of first-order logic
                                         
                                         to essentially knowledge representation
                                         
                                         and deductions and whatnot.
                                         
                                         So one example that they will give is,
                                         
                                         think of a database, a table of essentially items,
                                         
    
                                         some elements, maybe some balls, and they have colors.
                                         
                                         So maybe the colors are red or black.
                                         
                                         And then there are some nulls.
                                         
                                         There's some balls whose colors are not known,
                                         
                                         but there's a constraint that says that every ball has to be, which you can do in first order logic, which you can do in some logic based systems as well.
                                         
                                         They database systems, I think as well, that there's a, there's a constraint that every ball has to be blue or black.
                                         
                                         Right.
                                         
                                         So we just fit relational algebra.
                                         
    
                                         And my question is that how many, what is the maximum number?
                                         
                                         What's the, what's the color of the balls?
                                         
                                         What's the maximum number of reds or blacks?
                                         
                                         Okay, so, you know, out of the all buckets that you could have in terms of colors, what's the maximum number?
                                         
                                         And suppose, again, there's three records.
                                         
                                         One is blue, one is red, one is null.
                                         
                                         But there is a constraint that colors are either blue or black.
                                         
                                         Now, the answer is two, but no relational system can compute this, right? Because
                                         
    
                                         no relational system is going to be like, you can't, like you need something, some other algorithms to
                                         
                                         do this deduction and use the constraint and be smart that the maximum has to be like, you need
                                         
                                         to do a little bit of a proof there, right? You need to say that, okay, there's two cases, either
                                         
                                         this null is red or this null is black. Let's do the query. When it's red, you get number two,
                                         
                                         because now there are two reds.
                                         
                                         Now, if it's black, you get two as well.
                                         
                                         So the maximum has to be two.
                                         
                                         So that's the correct answer.
                                         
    
                                         And obviously, humans, we can answer this very quickly.
                                         
                                         But no information system right now,
                                         
                                         no database system will do this,
                                         
                                         because we're really restricted
                                         
                                         to a very small part of first-order logic.
                                         
                                         I think a good thing to think nowadays is to see what, like, first of all,
                                         
                                         I don't even know how well motivated it is with this current AI sort of applications,
                                         
                                         but it's an ideal.
                                         
    
                                         I think that's important to remember as people think about reasoning,
                                         
                                         as people think about sort of what can we do
                                         
                                         to extend AI applications with some logic-based deductions, because people are talking about
                                         
                                         neuro-symbolic AI. And I don't fully understand, again, the applications to see if this sort of
                                         
                                         gut feeling that I have that this might be a good topic to look at is important. But at least as an
                                         
                                         ideal, it has existed for over 50 years, maybe,
                                         
                                         of how can we build more intelligent
                                         
                                         database management systems
                                         
    
                                         by extending the part of sort of first order logic
                                         
                                         that we capture.
                                         
                                         So, you know, if somebody does the work of seeing
                                         
                                         if there are applications of this,
                                         
                                         that more deductions could be good
                                         
                                         for certain applications,
                                         
                                         I think it's a good time to start.
                                         
                                         And maybe 10 years from now, those systems might be really useful
                                         
    
                                         because this logic AI, logic community has done a lot of work
                                         
                                         to find a lot of algorithms.
                                         
                                         And it's the right time to think about how many of those
                                         
                                         can actually be put back into a database management system,
                                         
                                         make practical and whatnot.
                                         
                                         So that's an area that I think is important.
                                         
                                         Then there is some classic areas like nowadays heterogeneous computing,
                                         
                                         GPUs, DPUs, how do we bring those kind of hardware into database techniques?
                                         
    
                                         Those are clearly very timely to do, I think, to write PhD thesis on too.
                                         
                                         So, yeah, those two immediately come to my mind about if I were to be a PhD
                                         
                                         student now, I'd explore those type of
                                         
                                         things and and and you know uh it's as usual as every decade it's very exciting there's tons of
                                         
                                         things to work on yeah well there you go listening you've been told those are the things you need to
                                         
                                         go work on I see yeah maybe yeah it's at least good to explore I think yeah definitely really interesting directions, cool
                                         
                                         so it's time for the last
                                         
                                         word now so
                                         
    
                                         what's the one thing you want the listener to take away
                                         
                                         from your work on Kuzu and from
                                         
                                         this podcast today
                                         
                                         so
                                         
                                         I'd like them to
                                         
                                         first of all understand that
                                         
                                         Kuzu is
                                         
                                         being developed seriously so I'd like to see more understand that Kuzu is being developed seriously.
                                         
    
                                         So I'd like to see more people use Kuzu.
                                         
                                         We've got our sort of first set of users who are trying and playing around with the system, which is exciting to see.
                                         
                                         But it is a system that aims to be highly scalable and sort of fast on graph queries.
                                         
                                         These join-heavy queries when you need to find connections between nodes. and sort of fast on graph queries, these, you know, join heavy queries
                                         
                                         when you need to find connections between nodes.
                                         
                                         And, you know, it is adopting the usability style of.db.
                                         
                                         So you can think of it in one sentence as.db for graphs.
                                         
                                         So that's one thing to take away.
                                         
    
                                         And, you know, that's about it.
                                         
                                         Did you want me to comment about something else?
                                         
                                         That's if you wanted to.
                                         
                                         And the other thing is I do think, you know,
                                         
                                         educating yourself a bit more on other data models is going to help you.
                                         
                                         You know, knowing more about RDF,
                                         
                                         knowing more about property graph database management in general is healthy.
                                         
                                         And I think for anybody who's teaching courses on database management systems, I think I do this
                                         
    
                                         all the time, spend a couple lectures on these things. I think we need to educate people more
                                         
                                         on these alternative models that go beyond the alternate data models and sort of variants of
                                         
                                         SQL like Cypher and Sparkle.
                                         
                                         And those languages are easy to pick up.
                                         
                                         And it's good to know because, you know, you want to be able to use your skills on those technologies easily.
                                         
                                         I think it's going to make people better engineers.
                                         
                                         So that's another thing that I want to advocate very, very strongly for.
                                         
                                         Fantastic. Well, that's a great great a great thing to end it on so
                                         
    
                                         yeah we'll end it there thanks so much so it's an absolute pleasure talking to you thank you so
                                         
                                         much for coming on the show and if the listener is interested in knowing more about all of semi's
                                         
                                         work and kuzu we'll put all the links to everything and all the blog posts and whatnot in the show
                                         
                                         notes and if you enjoy the show please consider supporting the podcast through buy me a coffee
                                         
                                         and we will see you all next time for some more awesome computer science
                                         
                                         research.
                                         
                                         Music
                                         
                                         Music
                                         
    
                                         Music
                                         
                                         Music
                                         
                                         Music
                                         
                                         Music
                                         
