Disseminate: The Computer Science Research Podcast - Semih Salihoğlu | Kùzu Graph Database Management System | #29
Episode Date: April 3, 2023Summary: In this episode Semih Salihoğlu tell us about Kùzu, an in-process property graph database management system built for query speed and scalability.Listen to hear the vision for Kùzu and to ...learn more about Kùzu's factorized query processor! Links:Kùzu GitHub repoCIDR papercontact@kuzudb.com Kùzu SlackKùzu TwitterKùzu Website - blog posts Semih mentioned can be found hereSemih's HomepageSemih's Twitter Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate, the computer science research podcast.
I'm your host, Jack Wardby.
Today, I'm delighted to say I'm joined by Semi Saliholu,
who will be talking about a new graph database management system, Kuzu.
Semi is an associate professor at the University of Waterloo
and is also the current holder of the David R. Sheraton Faculty Fellowship.
Welcome to the show, Semin.
Thank you very much. Thanks for having me.
Great stuff. Well, let's jump straight in then. So maybe you can tell us a little bit more about
yourself and how you became interested in research and databases. Yeah, your story so far.
I see. Good. Well, okay. So you should interrupt me because I could go on here. So I got introduced to data management back in undergrad when I took the introduction course.
And it was given by, I went to Yale as an undergrad, and it was given by Avi Silberschatz.
So Avi is very well known in the field, and he is actually the author of one of the main books.
Well, the course was overall, I thought, very boring.
Actually, we have a problem as a field to teach that course in an interesting way.
I find it very challenging to myself to engage students because a lot of people know SQL
and sort of core database technology.
But Avi had a way of explaining a lot of stories because he was very well integrated with the
industry.
So he had led the AT&T labs for a long time
and he had very close connections at Google.
So he would inspire us a lot
with these major challenges in data management.
So that's where it started.
But I did my senior thesis in networks.
I was always a bit systems oriented,
maybe a bit more theoretical bent,
but picking topics from systems.
Anyway, so I did have a three-year industry experience before starting my PhD.
And when I applied to PhD, I applied as broadly as systems, not necessarily just data management.
It was networks or data management.
And when I went there, I had a few sort of rotations with people.
And believe it or not,
I ended up in a very, very different field.
I ended up in DNA computing, nothing to do with systems,
nothing to do with data management whatsoever,
but that lasted for about a year and a half.
And towards the end of my second year,
I came back to where I felt more comfortable aligned with Jennifer Widom,
who was my PhD advisor.
And yeah, so that's how I sort of ended up doing a PhD in graph analytics,
not really data management.
It was the Hadoop MapReduce era.
And systems like Sparks were coming out,
and everybody was trying to do some distributed sort of data management, data processing.
And I picked the graph vertical because I just read this
Pregel paper back in the day. We could get into that if you're interested. And then, so that was
basically my PhD. And then when I took my position at the University of Waterloo, I wanted to do
something a bit more in data management. And there was a wave around property graph data
management with Neo4j and yeah i want to i
want to get into the space and you know that's that's that's what i've been doing for the last
seven eight years yeah the rest is history i guess yeah yeah the rest is history fantastic so yeah
graph databases and so that's what we're going to be talking about today and so maybe before we
start talking about kuzu maybe we could for the for the uninitiated maybe you could tell us a
little bit more about graph databases, the background,
what they are, what they're used for, and how they differ from relational databases.
Yeah.
So I'll actually unashamedly put a plug for a blog post that I wrote
that really articulates my opinions in this field in more detail
than I'll probably be able to do right now because of time limits. But there's a blog post that I put,
that I wrote on Kuzu's website called what every competent graph database
management system should do.
But I also go into the history of graph database management systems.
So, so graph database management systems, you know,
the term might mean several different things.
So nowadays a lot of the time people just think of systems like Neo4j, because that's essentially the current
most popular model is what's called property graph database model. And Neo4j really kind of
pioneered that and popularized this. But systems based on graph structured data as a data modeling choice.
Graphs have existed even before relational systems.
So the very first database management system in history was based on a data model called the network model.
And network is just another term for a graph.
So it was essentially a way you would represent your records and you would link them to each other,
just like you link nodes through each other in the current property graph database manager.
So that's even in the early 60s, you know, before the field was emerging,
there were systems we could call graph database management systems.
So and every wave has essentially seen to some different, you know different levels of success some database management
systems that have been adopting
a graph-based model instead
of the, at this
point, the de facto tabular
model, or you could refer
to them as relational model.
So that came in the 60s. The first systems
in relational systems were built in the 80s.
And again, 80s, 90s,
2000s, there was always systems,
database management systems built on a graph model like XML, RDF,
now property graph database management, object-oriented systems.
So what should I say about them in the current era?
So graph database management systems, they adopt a graph model
where instead of representing your records as a set of tables, you represent them as a set of nodes and connections or relationships between your set of nodes.
And in the property graph data model, you can also add some properties, key value properties on your nodes and edges.
Ultimately, the query language comes, it's a high-level query language, just like SQL,
and it's very much like SQL.
So Cypher is the one
that Kuzu implements,
Neo4j, it's Neo4j's language.
You know, you try to follow
the open Cypher,
but, you know, other systems
around that adopt
the property graph model,
they have high-level languages
that are very much SQL-like,
and it's not very surprising
because that's, in the end,
the core principles of how data is managed at a database management system is always relational
in the sense that there's only one way that humans know how to develop declarative languages,
and that's really to compile them down to relational operators that process sets of
tuples, output sets of tuples. So at their core, just like any other database management system ever built in history,
they're all, in a sense, relational at that sense.
But the data model in the query language has these graph-specific syntax or graph-specific features,
which makes it suitable for several classes of applications in particular.
So like fraud and recommendations, whenever you want to
do essentially some recursive computations, I think they're much better than relational systems
to program because they have these special syntax where you can say, you know, I want to find
connections between this record and that record, and you may have represent them as nodes. And
that's very intuitive in a graph model. You're essentially looking for paths.
So paths is, for example, a very first-class citizen notion,
which doesn't exist in tables.
And you need to think of joint paths and whatnot,
which is not very intuitive.
So there's a mismatch there.
So there's other classes, sets of features that they provide.
And I go into all of these in my blog post.
But overall, essentially, the two things to
really remember is that it's an alternative data model with a SQL-like language with alternative
syntax. And I actually think people should just know alternatives to data management beyond SQL,
just because there's never going to be a perfect fit database management system and every system you know every application
has its more natural way of modeling it and asking queries and different systems are going to be more
more suitable on that then so um kuzu give us the elevated pitch for kuzu and can and maybe tell us
where the name comes from as well so kuzu the elevator pitch for kuzu is the kuzu is the
aims to be the duck db is the DuckDB for graphs.
And right now, for a lot of people, that means something very specific because DuckDB is very, very popular and successful system.
It is a modern graph database management system built with state-of-the-art principles of how essentially general analytics-oriented or query-oriented, I should say,
database management systems should be built.
And it follows the usability features of DuckDB.
It's embeddable just like DuckDB and aims to connect very well
with sort of a set of higher-level analytics libraries,
specifically graph analytics libraries that users might use,
aside from the database engine.
So, you know, one way to think about this is that we'd like to be default storage of database management storage for people who would like to model their data as a set of node and relationship records.
But they'll do maybe some high level, more advanced analytics, like maybe they'll want to build a graph machine learning model over those records. And they'll need to do lots of things before they could essentially so if you just need the graph database management system for your application,
we want to offer a lot of value by being a state-of-the-art system in the sense that, you know,
we've done a lot of work on graph database management systems and query processing and scalability in particular.
So all of those features are getting integrated into Kuzu.
And that's also what ductv does like to be also
for the tabular analytics integrates uh state-of-the-art techniques plus integrates
very well with these higher level uh the tabular analytics libraries like numpy etc and we wanted
we're kind of trying to trying to replicate that awesome so let's um go into the architecture a
little bit more of kujo so he said it's it tries to bring in all of these state-of-the-art components and the latest research into the system.
So maybe you can tell us about the high-level picture of it.
And maybe we'll walk through each component at a time.
Sounds good.
So it's the high-level sort of principles of how database management systems are implemented is very well known.
Like every system pretty much goes through similar sets of phases
when they take a text query and in the end execute the query
with some executable code.
So basically we have a parser binder, et cetera,
to compile the text query into some in-memory representation.
We have an initial join optimizer that actually, we're a little different from some systems that
our initial join optimizer takes the initial representation of the query and optimizes
and gives the logical plan that optimizes the joins that will happen in the query. And then we have an optimizer that takes that
logical plan and does a lot of standard optimization techniques like filter pushdowns,
removing of projections, removing of unnecessary properties slash columns in relational terms,
several normalizations, so on and so forth. And then that's mapped into a physical plan.
And that physical plan is given to a task scheduler, which executes. then that's mapped into a physical plan. And that physical plan is given
to a task scheduler, which executes. So that's the high level sort of journey as it's the journey
in Postgres, as it's the journey in I'm sure Neo4j or DuckDB. So that high level thing is
pretty standard. So I should maybe just start talking about the query processor because that's where we have done most work.
So Kuzu is essentially the successor of a previous system called GraphFlow DB.
So GraphFlow started in 2017.
And the motivation for GraphFlow was when I entered essentially this field
of graph database management system,
I had felt that back in my PhD, I'd made a big mistake by not understanding how the software
that I was using was, that I was spending my years on doing a PhD, was used.
So I worked on these distributed MapReduce-like systems called, Pregle was the original system
from Google.
The version that I have was called GPS, and the open source clone of Pregle was something called Apache Giraffe.
It's still out there, I think, but I doubt it has any users.
So I did my PhD on that,
but I never really sort of understood the user base
and understood how big, how impactful work in that space could be.
So in general, if you want your research to be impactful,
you need to bet that the software that you're working on,
at least in data management, at least,
we all work on some data management or some data processing software.
You need to bet that that's going to be successful,
that that's going to live.
There'll be lots of applications using it. It'll be popular to some extent. And I not done any of that work. I
just blindly went into it. And by the time I was graduating, it had become clear to me that
really there were no users of giraffe or preg or GPS, those kinds of systems. I didn't want to
make that mistake. At the same time in 2015, 16, Mike Stonebraker won the Turing Award and had this Turing lecture
where he talked about the importance of sort of talking to users throughout his career.
And, you know, obviously he represents one style of research where he advocates that
the true validator of good and bad ideas is the market, not reviews or academics.
And so I did this user survey I was able to convince my PhD student Siddhartha to
really lead this it was actually a lot of boring work to try to do if you're a
systems oriented student and you come into my group and I get you to work nine
months on on a survey is you know it's super boring but I mean he went through
it thankfully instead of coding you know, it's, it's super boring. But I mean, he went through it, thankfully, instead of coding, you know,
we did this user survey where we asked 90 people,
it was an online survey. There were 90 sort of participants,
89 I think participants online survey.
Plus I think we talked to maybe 15 users to really understand what their
challenges were.
And I think a lot of the people work actually coming from Neo4j or other
graph database management systems, although we wanted to do the survey a bit
more broader by talking to users of Draft and NetworkX and other
sort of popular graph libraries as well. But I think many of the people were essentially
coming from Neo4j because it had the biggest user base. And we asked them, what
are your problems? And they really told us scalability is a big problem in the field.
So that's how we positioned our research back in the day around 2017.
Let's work on scalability, how to make essentially graph querying more scalable,
graph storage more scalable.
Amin really led the project.
Amin Medby is one of my PhD students who's graduating this year.
He was the main essentially researcher on the project. I mean, Matt B is one of my PhD students who's graduating this year. He was the main, essentially, researcher on the system. And so a lot of those ideas actually are
now getting into Kuzu. And a lot of that work was either in storage or query processing. So maybe
in terms of those two, we have the most sort of state of, we have integrated most of the
state of the art techniques that we know. So what are those?
So in the query processor, Kuzu is a factorized query processor,
which means that the intermediate results are represented not as flat tuples,
but as unions of Cartesian products.
And I can get into this.
It implements some of these novel techniques, joint techniques for many-to-many
joints, for growing joints, which is a core problem in graph querying. And it then integrates
these very well-known and adopted, very well-understood techniques, like it's vectorized
in the sense that columnar DBMs, relational systems, use the term. So it means that instead of passing single tuple
between operators in your query plans,
vpass2048 tuples are being passed.
And that's the term vectorization
in the context of query processing.
So it's a vectorized query processor
and it does Morsel-driven parallelism,
which is a technique that came, I think,
maybe introduced in Hyper, I'm not sure, but it was
a Thomas Neumann paper.
And.db adopted to
readopted. It's a way of parallelizing
a single query plan
instead of parallelizing multiple queries running
in parallel.
Basically, it's a very simple idea. It just means
that if there is a pipeline,
suppose you need to scan and
filter something.
You need to find all filter something, right?
So let's take something very simple.
You need to find all your nodes that have an age greater than 50,
and there's maybe, you know, a million nodes.
Morpholization just means that you partition the amount of data from disk that needs to be scanned,
and different threads are competing to get different morsels,
essentially partitions of things, and they run different pipelines,
and they coordinate that where the pipelines end.
So that's the way we parallelize queries.
So those are essentially maybe the core sort of ingredients to the query processor.
Storage is a columnar design, but it's graph optimized.
So why is it culinary? The motivation for this,
and this is essentially our position as a research group,
from having talked to a lot of people
who use graph database management systems,
a lot of the applications that graph database management systems
are very good at supporting,
where they add value,
is in essentially query heavy workloads.
So you're looking for something very complex that requires essentially maybe finding a
path between one node and another node, or even like just going like two, three degrees
of neighborhoods around nodes to see what they're connected to.
Again, like classic examples are fraud.
You know, is this a bank account?
Suppose you get a financial transaction network
you want to represent as a set of nodes and edges.
There are account nodes.
There are transfers between accounts.
And you just want to know,
is this node connected to the suspicious node
from take your country?
The challenge is that the graphs by their nature,
I mean, a lot of one core property of the data
that applications tend to use graph database for is that nodes are going to connect to many other nodes.
So there's going to be many, many relationships.
So if one of your accounts is transferring on average to 100 different accounts and they transfer to 100 different accounts,
the amount of essentially part of the database that you need to explore and search just grows for each hop that you do,
for each join that you do. And so that ends up requiring scanning and sort of processing large
chunks of the database. So then that says it's read optimizer, these systems need to be read
optimizer, query optimized, versus systems that might just want to be very good at sort of updates
and some, you know, you're taking lots of transactions,
you want to update them.
They're not really transactional.
I mean, as any database management system,
they need to support transactions
because people will want atomicity and durability.
They don't want to lose their data,
but they're not supposed to offer a lot of value in that space.
I think that's a mismatch.
So as a design choice,
many of them try to optimize for this query speed.
And so our storage is columnar because that's the core.
Well, I mean, that's a very well understood sort of principle that columnar systems are much more scan friendly.
They give you opportunities to scan less parts of the data if the queries ask for less parts of the data than row-oriented systems. So, which means that every property that we have on nodes, for example,
an H property gets its own sort of set of disk blocks or file, if you will. So, it's columnar
as a core principle. And in terms of storing the edges and edge properties,
relationship properties, it's still a columnar system. We adopt a disk-based version of the
well-known CSR format. That's a very common format to represent sort of graphs both in memory and we
have a disk-based version of it.
So that's also columnar. I mean, it's actually called columnar sparse row format.
So it's a columnar system.
You can really think of CSR as a compression
of two sorted columns.
I won't get into it, but it's another columnar format.
And we have one choice that we made,
which sometimes I question actually,
but we made it because that's what we've done in GraphFlow too,
is that we double indexed the edge properties.
So which means that if you have a connection,
suppose account A has sent money to account B,
and there's a property on that edge,
maybe the amount of money that was sent is 500,
that we stored that 500 twice, one for one of the nodes,
the source, if you will, one for the other node.
So that if, and that's a choice that we made.
And I think it's especially good in a disk-based system.
I mean, it increases your storage, but by a constant factor too.
And you can't, I mean, you can't do less than that.
And if you need to essentially read, scan the incoming properties of a specific node,
and in general, you should always optimize for these very simple queries, because they're going
to be always most common. Just like in any phenomenon in the world, there's going to be a
skew in terms of the complexity of the query and how frequently you're going to get it in an actual
system. So for the very simple queries, pinpoint queries where you say,
give me this node and all edges around it.
And if you need to read incoming edges,
then you want to locate them.
Essentially, this gives us a way
that they're all stored in the same page.
So that's why we double index it.
We don't double index node properties.
So node properties, you can,
but you'd have to replicate hundreds of times
if a node is being touched by hundreds of other nodes.
So that's another thing.
So we've got CSR format.
We've got columnar format for node properties.
CSR format for relationships and relationship properties.
We've got double indexing there.
Just on that for a second.
Why do you sometimes regret that decision?
Well, because it's – why do I regret this?
So there is alternative ways. I mean, I could get a bit technical, but there is alternative ways that we could have reduced the storage, I think.
I don't want to say I regret it. I sometimes question it because we didn't very, very properly study how fast we could have been if we chose to have just like, and that's what's the alternative method. The alternative method is what a relational system would do. You'd have a relationship table, no particular order there, and everything is kind of stored. All of the edges are stored there and all the properties are stored once. But you'd have an index that essentially you need to have this index.
You need to have for which node, what are its edges.
Okay.
So that's not storing the property though.
That's just storing the IDs of the neighbors.
And maybe using that, you could just store the IDs of the edges as well.
And then do a full scan with some smart way of filtering so you don't scan
the whole tables. So that's an alternative way that you could store only once, edge properties
only once. And you could still maybe do fine. Actually in GraphFlow, which was an in-memory
system, we didn't double index. We actually did a single index, but it was in memory. So, you know,
we'd have some random reads,
but because it's in memory, we didn't care too much.
We never really tested that.
So I was a bit more conservative when we said we're going to do a move to,
not just me back in the day,
my student who implemented GraphFlow storage
also implemented the initial version of Kuzuz storage.
We had this intuition that, you know
what, let's just double index, let's forget about
this problem. But we never
properly studied this.
That's why I say, I sometimes
still question it, that could a system,
could we just store it once and
still be okay?
We never did that study.
One for an interesting research paper in the future, I guess.
Yeah, maybe.
Yeah. So that's storage. We never did that study, you know. One for an interesting research paper in the future, I guess. Yeah, maybe.
Yeah.
Yeah.
So that's storage.
So let me just tell you just very quickly about other things. So the Join Optimizer is a core sort of optimizer in every system where you spend a lot of time.
It's a dynamic programming-based optimizer.
And Thomas Neumann really has done the most work in that space.
So we try to get that.
We have a buffer manager that's based on another, right now,
University of Munich-based paper.
VM Cache is the paper.
So that area is led by Victor Leis.
He was at a different university and moved to University of Munich
where Thomas Neumann is. By the way, to the University of Munich where Thomas Neumann is.
By the way,
for any of the listeners,
Thomas Neumann is like the god
in systems,
in databases.
That's why his name is coming up a lot.
I mean, not only one god,
one of the gods.
And we do look at his work
and Victor is also at the University of Munich
and he's done most work
and the best of the work,
I think, in buffer management.
So we read that paper and try to integrate it into Kuzu.
So we have a buffer manager based on VM cache.
Let me not get into it.
But, you know, people can easily find that.
And we have a transaction manager. So that's very,
I think if one day a historian went and said,
what was the simplest and most,
simplest and correct transaction manager?
I'm very confident it's correct,
but it's also very simple.
It's realizable kind of by design.
I'm not super proud of it,
but we had to add these features in the simplest way possible.
I'm not proud of the way we implemented it, but I'm proud of the way of how simple we were able
to implement it. It's based on write-ahead logging. So, you know, writes go into different
sort of pages first and they're logged. And during checkpointing, they're sort of copied over.
And by design, you can only have a single writer in the system.
So it's kind of optimized for bulk writes instead of sort of many
concurrent writes and reads and writes can,
reads and the write transaction,
single write transaction can work concurrently.
So,
so it is serializable in that sense.
So maybe one day you're going to sort of make that more optimal if there's
any demand,
but it's,
it's again these
are choices that we made our engineering efforts are based on optimizing the reads and scalability
instead of transactions uh so yeah so transactions i mean in one sense it's it's it's a bit naive and
simple it's correct i think uh it's you know relatively well tested but it's also we did a
lot of simplifying sort of assumptions like that you only allow a single writer in the system and things like that yeah yeah cool so let's let's
dig into a little bit more about the factorized query process because this was the focus of the
side of paper right so we've been through what factorization is but maybe you can kind of touch
on the design principles that you use when designing these new join operators and yeah
i guess just tell us more about the ones that you've got in kuzu yeah so factorization is um again it's a compression technique but it's not
a compression technique you use to essentially make your disk files smaller which is how compression
is usually used on systems you know you've got a file that stores maybe a property
or a column of a table, and you want to make it short
because there's lots of, you want to make it small
because there's lots of repetitions and whatnot.
Factorization is interesting because it's actually a technique
that's designed specifically for query processing
for when the queries contain many-to-many joins,
meaning that they can grow.
So the core idea is very simple.
So think of like the simplest star query.
You've got a middle node B,
you've got one outgoing edge to a node A,
you've got another outgoing edge to C,
and that's your query pattern.
So this is essentially what you would write in the match clause.
And you want to find essentially all incoming
and outgoing neighbors of a particular node B.
And suppose you get a filter on B.
And suppose that filter says, you know, Jack's essentially outgoing,
essentially transactions and incoming transactions.
Give me Jack's neighborhood, basically.
So that's the simplest essentially star query you could get.
Now, if you've got essentially 10,000 outgoing neighbors and 10,000 incoming neighbors,
now if you represent this output in flat format, it's 10,000 times 10,000 outgoing neighbors and 10,000 incoming neighbors. Now, if you represent this output in flat format, it's 10,000 times 10,000.
So, you know, I think it's 100 million essentially tuples would be represented there.
But there is actually a very, the neighborhood is actually much smaller.
There's only 20,000 edges there.
So factorization is a way of representing this output as a Cartesian product instead.
So it's jacks, nodes, and data,
Cartesian product with 1,000 outgoing neighbors,
10,000 outgoing neighbors,
Cartesian product with 10,000 incoming neighbors.
So that Cartesian product is another way to represent
exactly the same 100 million tuples.
And it's essentially, you know, that's the way factorization works, that whenever possible, we try to essentially represent tuples, sets of tuples as Cartesian products in a much more compressed way and pass that between operators.
So you asked about the design goals.
So one thing was just be able to implement factorization.
And we had a particular way of implementing this
based on my PhDs in the means work in GraphFlow
is really directly implemented in Kuzu.
So we knew we wanted to essentially represent
intermediate relations, not a set of flat tuples,
like maybe DuckDB does,
or pretty much any of the other systems do,
but in a factorized way.
We also wanted to make sure we don't use index nested loop joins. So this is the type of
operators that we in GraphFlow had or Neo4j has, for example, an expand node, which is a join
operator that takes a node record and expands it to its neighbors, for example. So if you use that in an in-memory system, maybe that's fine.
But in a disk system, if that's the type of joins that you do,
that is that there's an expand operator that takes a set of tuples,
and for each one, it expands it.
So that's what in database literature is the way index-nested loop join operators would work.
So from one side, you read the tuples.
From the other side, you go and scan and perform the join one tuple at a time, so for each join. So that's actually
not going to be robust. And I don't say that's always bad. Actually, in the paper, there's
many cases where that actually works the best in very, very, very basic cases, like, you know,
you want one node and you want its neighbors,
expand type of join operators do well,
but it's not robust when you are getting, you know,
thousands, tens of thousands, hundreds of thousands,
tuples and you need to do that.
It's just degrades very quickly because it leads to, you know, again, you can't really localize neighborhoods.
So it degrades to random IOs.
So we didn't want to do that.
So we want everything to be based on hash joins,
which is the common kind of wisdom nowadays for doing joins,
at least equality joins in systems.
And we also wanted to make sure we don't,
when possible, behave like index-ness loop join.
Index-ness loop join has this one advantage that if all I want is Jack's altcoin neighbors,
you know, an expand operator that just goes
and takes the, you know, maybe it was preceded
by a scan operator that gave Jack's node ID.
And all it does is that, you know,
just looks up in an index the IDs of the neighbors
and then does the minimum amount of IO
and maybe read only the properties
of those nodes or edges.
So that's actually, you know, good because if you got, you know, a 10 gigabyte file,
you might just end up reading, I don't know, a few kilobytes of things, even though it's random IO,
you're just reading very few parts of the database. So in those cases, index.net.to.join has an
advantage. Again, I don't think it's robust, but we also wanted to behave like that when the queries were very selective, when we needed to join a single node record or a few node records.
So those are the three goals.
Factorization, make it hash join based, but also try to avoid scanning essentially files.
And the reason this is a problem is that hash join always reads entire files.
So by default, there's a build phase and there's a probe phase.
There's no mechanism to say only read me the neighbors of this one particular.
So those were the three design goals.
And we have a way of implementing this in the system.
Awesome.
With these specialized join operators, yeah.
Cool, yeah.
So maybe you can tell us a little bit more about these specialized join operators. So these three are in the system awesome with these specialized joint operators yeah cool yeah so you can maybe you can tell us a little bit more about these specialized joint operators so there's three
in the paper maybe you can just give us the intuition behind um how they work essentially
yeah so the so let me just tell you the maybe the core uh sort of idea of how in a case where you
are uh maybe just reading jack's neighbors how this would work and we would still be hash join based
but essentially achieve all the three goals. So it will be difficult to see how we get
factorization because the query is too simple. So maybe let's just, we could expand it to see
how factorization is achieved. But suppose the only query all we have to do is read Jack's
neighbors. So what will happen is that just like in any hash join,
there's a build phase, which creates a hash table. And then there's going to be a probe phase,
which essentially scans the left side of the hash join operators, and then does probes into this
hash table that was built. So we'll do something along the lines of you're going to read Jack's record and we will figure out Jack's node ID. So maybe it's like seven. Okay. And then we're going
to have a hash table that puts seven and Jack there. Okay. Suppose you read only the name of
Jack. Maybe that's the primary key, but the internal node ID is seven. Now Now all we need to know in the hash join, the hash join is going to join that
record with all of its neighbors. All we need at that point, once we create the
hash table, we know that all that we need to scan are edges whose sources are seven.
So we pass that, we do sideways, it's called sideways information passing. So we pass that information as a mask of node IDs to the left side, probe site.
And that probe site is going to essentially use that scan, use that mask to know which
parts of those CSRs to read. So it will only read seven outgoing neighbors
and avoid reading the entire, essentially, this file.
And so we'll only get a way of,
we have this mechanism
so that only seven outgoing neighbors will be read
and then they will be probed and joined with seven.
So that's essentially the core idea
of how we avoid scans is that we pass these node
masks to know which parts of these either relationship tables or relationship sort of
indices should be read or some node properties should be read.
So that's the core principle in these different operators.
And, you know, so way, how do we get,
so that doesn't explain how we get factorization.
The way we get factorization is that,
so in CSR, the edges are localized.
So we don't essentially pass that as essentially
a sequence of, suppose you had hundreds outgoing neighbors.
We don't pass that one by one.
We pass that as a sequence of hundreds, not with seven.
So we just represent this as seven
times those 100 so that's the tuple that's being sent and we're going to probe on seven so so that
seven and hundred outgoing neighbors will be joined as a single probe not hundred probes
and with a single probe we're going to read the jack property which was in the
hash table so that's the way essentially what's being passed is always these Cartesian products,
unless some operator needs to essentially flatten
and essentially break those Cartesian products.
Okay, nice.
So I guess the kind of question that falls out of it is kind of
how much better are these joint operators than on a graph workload?
So maybe you could tell us a few kind of numbers,
some of the experiments you ran.
So we did several experiments,
and I'm not a big fan of system-to-system comparisons
because it's very difficult to get these, quote-unquote,
competitor systems in their ideal performing sort of configurations.
Yeah.
So we do try to compare, essentially,
what an index-next-to nested loop join configuration of Kuzu would look like with this, you know, node mask based Kuzu.
So, you know, one thing that we want to do is, so we're isolating, we're just trying to study the pros and cons of this passing of masks.
Yeah. which is an overhead, basically. You know, that's not something that index-nested loop join or hash join does,
versus the benefits we get from essentially getting rid of index-nested loop joins in the system. So that's just one meaningful comparison is that Kuzu's current version versus Kuzu index-nested loop join,
which is essentially where we added an index-nested loop join operator to the system,
where we do the joins using these expand-like operators.
And the general message of those experiments is that when nodes are very selective,
this overhead obviously is not worth it.
Its expand type of operators are going to be better performing
at those very high extreme selectivity cases.
A couple nodes are being passed, but it degrades
very quickly compared to essentially this node mask based operators. So this will be tested both
on some large benchmarks like LDBC plus some micro benchmarks, which are actually my favorite
experiments of all systems, which are those micro-bank marks because you can really focus on a particular behavior of the system.
And then, you know,
when the selectivity is very high,
there's also an overhead, right?
So when the selectivity is very high,
you're not really getting much benefit
from those passing those masks
because those node masks are not selected.
They're not helping you avoid scans.
So at those two extremes, you know,
there is an overhead and you could do better,
but overall it doesn't degrade the way index and sudo join operators degrade, or it's not
way off like DuckDB is, for example, which doesn't have a node mask, just does hash joins
by forcing the system to scan all these large files. So, you know, so in comparison,
it's kind of like a trade-off that you do, you know.
So at those extreme cases,
there are systems that would outperform this type of processing
or there are techniques.
But overall, it's like, I think,
a reasonable trade-off to make that your system
and your co-operating processing is more robust.
So that's the type of, yeah,
that's the type of sort of evaluation that we provided in the paper the paper yeah i reckon the listener goes and checks it out as well um but i guess on that so
i guess it's possible to kind of have to support both types of operators in the system right and
have some sort of like some sort of i don't know some sort of option that lets you choose specific
one based on the selectivity or whatever right so i mean you can get the best of both wells
i agree i agree it's it's doable actually one you can get the best of both worlds. I agree. I agree. It's doable.
Actually, one of the, there's two first authors of the paper,
Xiang and Godong, they contributed the most to the paper.
So, they're listed as first co-authors.
They have been sort of arguing of when we should avoid semi-masks
and this not, we call it semi-masks, but
these node masks that I mentioned, you know, when should we just avoid this overhead completely?
Because it's like, if we kind of during runtime field, see that these node masks are not being
selective, we might maybe avoid it and do something adaptive.
Or, or if you're smart enough,
your optimizer is smart enough to predict that node mask will not be selective,
you can just remove it
and replace some of those join operators
with indexness loop join.
I agree.
It's a question of like,
it's always an engineering question of like,
do you want to invest that time?
Or are you happy with how your sort of query process
is performing even with those overheads?
So it's a matter of that.
But I do agree that, you know,
and our experiments make it visible that there is performance
that we put on the table and that our trade-off is there explicit.
Awesome stuff.
Yeah, I'm going to kind of roll the next two questions into one
because I always ask kind of what are the limitations of whoever I'm interviewing,
whatever they're working on. But obviously, Kuzu is a very young system. So I mean, a lot of the
it's not feature complete, right? So I mean, there's a lot of road to run still. So I kind
of want to ask you more like, where do you go next with Kuzu? Yeah, so I have several sort of
theses that are being written on Kuzu. One thing that we don't do or we haven't sort of
invested much time in graph flow too there was no work on this i didn't have a student
to take this topic is recursive queries so which is essentially actually an area that graph database
is need to shine in certainly their query languages are much better for recursive queries
because recursive is kind of a first class citizen historically too like even in those systems in 60s you know 80s 90s 2000s in all those graph-based
systems like rdf object-oriented system even like the 60s system like ids recursive application
query like applications that wanted to do recursive querying were always because of some of the killer
apps so so there is a thesis being written on that.
So we are essentially trying to implement some new query processing techniques
for like shortest paths or for variable length joins and things like that.
A second thing that I'm very interested in,
and I have a student who's going to start working on this and we have some
ideas, is how to make property graph database management systems
more competent in RDF style graph data.
So that's another sort of popular data model
that people broadly label as graph-based.
And I do think, you know, the property graph model,
so this is like a position that I try to articulate.
So I think out of all the graph-based systems,
graph-based models, property graph model
is the most generic one in which you could build
the most sort of applications.
It's more generally applicable,
just like relational model as a model
is very generically applicable so rdf i
don't think is is you know rd is a very specific application domain i'm not saying it's a small
application domain but it's a gotta it's really suitable for a specific sort of application domain
as a model but property graphs are not property graphs are just as relational model is very
generic or is quite generic you could represent lots of application data. And actually, this is one of the other things that we mentioned in our survey paper back in 2017, that when we asked people what kind of applications they and some techniques to maybe do some owl-based reasoning and those things.
So I don't think vice versa is possible.
So I don't think you are going to manage your transactions with RDF systems or like financial transactions and frauds.
I don't think that's a very good fit as a model,
but I think property graph as a model is a good fit.
So there will be a thesis written on how to essentially make
property graph-based systems, and specifically this will be done
obviously on Kuzu, more essentially techniques to make the system
more efficient in RDF style,
knowledge management, and maybe some level of reasoning. Very related to that is a topic that's
going to sound very boring to maybe to people, but I still don't know what's the right way to
manage strings. We don't know this. There's actually no authoritative paper in the field
on this. There's some authoritativeitative paper in the field on this.
There's some state-of-the-art compression techniques, but there's no authoritative paper that says, that makes an advice. If you assume this thing about your data set,
here's maybe the properties of the strings, their length, et cetera. Here's the right way to
sort of store it on disk. Here's the right way to store, like manage the overflows maybe of those strings
or because that's a variable length data type.
And here's the right way to represent it in memory.
So we do these things where we kind of
find and blindly copy from Umbra or DuckDB,
these things, but, you know,
I don't think it's very well understood.
I had this, have this dream of finding a student
to actually just do this investigation
for about two years to sort of do this experimental study of like, here is all the techniques that's
in the field of how to store things in memory, how strings in memory, how to manage them in memory,
how to process them in memory, how to store them on disk. I don't know if they'll get a chance to
do that, but that's something I hope I can find a student to do on Kuzu as well.
Yeah, it's one of those things on the face of it,
it sounds quite a hard sell,
but I imagine that when you get into it,
it's a fascinating, really interesting project
once you've kind of got going with it.
So hopefully a listener listening to this now thinks,
okay, I'm going to go work on strings for three years.
Yeah, maybe.
Yeah, but I mean, so Hannes,
who is the CEO of Duck tv labs and one of the main
engineers there also mentioned this actually in his keynote at cider this year where kuzu paper
got published he said like he's very surprised that few people work on strings despite the fact
that majority of the data bases are strings and like there needs to be one authoritative paper i
think that makes this advice like this is how you should implement strings if this is your assumptions.
And I want to see that chart of like whatever those assumptions are, you know, maybe some workload assumptions, maybe some, you know, data set assumptions.
And then says, here's a good thing that we found that's robust and easy to implement and whatnot.
Yeah, that would be a really good contribution. I guess, yeah, so the next sort of set of questions to have,
I tend to sort of ask them to everyone with various different flavors.
But the first one is kind of how as a software engineer,
data engineer, can I kind of leverage your work?
I guess the answer to that is just kind of go and play with Kuzu, right,
and try and integrate it into your workflow.
I agree, yeah.
See if you can kind of fit it in. So I guess the follow-up question to that would be like kind of go and play with Kuzu, right? And try and integrate it into your workflow so you can kind of fit it in.
So I guess the follow-up question to that
would be like kind of longer term,
what sort of impact do you think Kuzu can have?
So let me also just maybe on the previous question,
I wanted to say like,
I just kind of focused on the research
and things that are happening.
There's also like, as you said,
Kuzu is a young system.
So there's a lot of engineering that's happening to add a lot of features and
we are trying,
we're getting excited about sort of making Kuzu more usable in the graph
learning space and trying to connect more with users.
So we are adding some,
some features specific to these graph learning libraries.
So there's going to be all those integrations that are happening.
So for that question, yeah, so I think playing around with Kuzu
is probably the right message for developers
or general application developers.
I mean, if you're a developer of graph database management systems,
they'll look like you're in Neo4j.
So, you know, for people,
engineers of Neo4j,
I think there's some core messages
that we think some of these core
joint techniques should get into these systems.
And there's a lot to benefit.
For other application developers,
I think, you know,
there's a set of application domains
where really relational model
and SQL is not going to be a great fit,
nor are the systems that implement SQL
are implementing the right sort of techniques
to be very performant.
Even like the most state-of-the-art techniques
like Umbra is not going to be
really, really well in many-to-many joins,
for example, or very, very well in recursive joins.
Like recursion is something
that's always a second thought
in relational systems.
It's a first-class in graph systems. It's a first class in graph systems.
So, you know, if you've got applications that benefit from these set of features, and there's more in my blog post, I think I've listed about five, six of these application domains.
I think just being sort of knowledgeable about this alternative database management systems is going to make them better engineers so because like trying to
every retrofit every application to relational model is i think not a not a not a very good idea
so people people should be educated more on on graph database management systems and these
alternative database management uh systems their models their query languages uh so it's going to
make them i think better engineers yeah for sure yeah definitely agree yeah next question, I think you touched on this a little bit earlier on
when you were talking about the journey and Graf Law and whatnot.
But I guess on this journey you've been on,
what was probably the most interesting thing that you've kind of learned
and that was maybe quite unexpected as well?
Yeah, good question.
Actually, this was the question that I thought most about.
I don't have a great answer to this.
I asked Xiang and Godong,
the two authors,
what were their sort of,
what was the thing that they found most sort of unexpected or interesting?
I guess like,
I think we learned a little bit more about ourselves.
So we haven't done this before.
So like I am trying to do this in Canada.
So,
you know,
by following DuckTB or systems like Umbra or Hyper,
you know, these are, this is a specific sort of research style that's quite popular in Europe.
So both DuckTB and this Umbra hyper type of systems are being developed in academic settings
in Europe. And I'm trying to, you know, my group is trying to replicate that in Canada with Kuzu.
So, you know,
we are doing a lot of engineering and less sort of paper writing than maybe
normally researchers in academic settings do.
And we are doing this for the first time.
So one thing that we are noticing about ourselves is that how this,
this it's more a psychological thing
that's we're that i'm finding interesting how often we get wrong uh how often our intuitions
are wrong about what is important and what is not it just becomes very difficult to judge these
things in absence of a large user base and like for example let me give you an example we did
start the project and spent a lot of effort trying to implement this secondary, second storage sort of structures in the system to store arbitrary key
value properties. And, you know, that just slowed us a lot. And that was a lot of code. And right
before we released back in November, we said, you know what, let's just see if anybody is actually
interested in this or are people really structuring their graphs and we're going to add this back on.
But I mean, so that was like when the project started, you know, maybe a year and a half,
two years before that, it was just, that was like nobody ever questioned that we should
not have started this.
And the mistake that we do often, we're not saying that we're over-optimizing.
We get too sort of optimistic about, you know, this is going to be really
important, let's certainly do it, versus, and we kind of
trade that off with simplicity, and we kind of implement
more than we should, and we throw away. So that's a mistake
we consistently do. I think we're getting it better, but that's
something that we started seeing more concretely
by the amount of code that we're also throwing away from the system.
So that's something to keep in mind.
That's a very important lesson, I think,
that if we were to do this again,
many of the features would just start much smaller
and try to get it complete and optimized and evolved,
just like text and writing papers, you know, you write something
the best way to write, you know, as they say, 12 pages
is write 24 pages and throw half out
start
simple, start simple, add more and
optimize it, it's just going to evolve
over time
don't overdo it
is one thing we've realized ourselves, it's a mistake
we made. But it's in the quote, right?
Was it premature optimizations of all evil or something like that?
Yeah, maybe sometimes.
It's very real.
Yeah, we've done several of those for sure.
Cool.
I guess kind of while we're talking about the journey as well,
the things were kind of along the way.
Maybe this is probably an example of that.
But you tried that and it failed,
that you think might be interesting people to know about.
Maybe this is a good example of that anyway.
So I could tell you ideas we didn't take from GraphFlow
because we were able to write good papers on them,
but I don't, in the end, advise people to implement them.
So for example, we have this paper on A-plus indices.
It's a way of bringing materialized views into graph database management systems.
I mean, I think they're a fine idea, and, you know, we published this in ICDE,
which is a top conference and whatnot.
It's a good interpretation of how materialized views could be brought into a system
as these indices that support essentially subsets of your edges, which are effectively results of
some queries you may have run on the edges. And we found sort of fine applications of it,
but it's also now very practical in the sense that it's a lot of effort. And that's when we sit down
and we say, okay, we've got a budget of this much engineering we can afford.
Should we do it in A-plus indices or should we do it like...
It was like clearly one of the things we said,
no, we don't have to do this.
It's not really worth it.
So in some sense, if you want to interpret it,
it's a failure in that sense
because it doesn't offer something very practical, I think.
Or the amount of effort is something we can't afford right now.
So there are ideas like that that we brought in.
And there was a lot more work on factorization that we did that's in Amin's thesis that we
are not adopting.
Because again, I find it's good to do research
projects because without those courageous PhD students doing those research, you can't really
distill the practical ideas, I think, but I don't think we really got it right yet. So if you looked
at the theory of factorization, there is excellent ideas there. Like there's, so the form of
factorization that we do, again, this is for
interested students who might look at this, is called F representations. There's another
more even compressed versions that you could keep called D representations. It basically stands for
factorized representations with denotations. The D comes from the denotation. And we don't have a way to essentially adopt that. Again, Amin has something in his thesis, but again, I am not courageous
enough to actually say, because, you know, the way Kuzu is trying to be is like to be very practical
system. So I'm very selective on what gets in and what doesn't get in. But I think it's good work
that Amin has done. But I'm saying that there needs to be more work until we actually feel confident that, okay, the representations can now be integrated into the system.
Yeah.
So I don't know if I answered your question, but there are tons of failures.
That's in the very nature of research, I think, that a lot of the ideas, I think, we just get wrong.
We publish them, and then somebody else does something better, I think that a lot of the ideas, I think we just get wrong. We publish them and then somebody else does something better, I think.
And then hopefully some core sort of principles merge that we know how to do that in a more practical way.
Yeah. Out of interest, how big is the team at the minute that's working on Kuzu? How many engineers?
So there's, I think, five of us, including myself and three master's students. So the five out of the five, many of them are
doing mainly engineering. They're in the university and whatnot, but they're really less, because they
got their degrees, they're staying as research associates, or they're postdocs, so they're less
worried about, and they're less worried about, and they're really trying to, they're less worried about publications.
And they're just very inspired by sort of trying to be like DuckDB and get a user base.
So they're doing more engineering, although, you know, obviously we are going to continue writing papers.
Although, and then there's a couple people who are new, whom I've started training recently.
And they'll write some of these thesis that I that I mentioned over the next two three years so but like you
know where however you counted including me maybe seven eight yeah some of some
are very junior and not doing any engineering just research and some are
like doing more research yeah is is is kuzu kind of your whole sort of like vehicle
for research or do you have other sort of non-kuzu related projects going as well
yeah so i we recently published a paper actually chunk who is also an author on kuzu it was
chunk's master's thesis uh chunk kind of uh was doing a project on open government data sets
that was a research project that i'm actually very excited about as well.
So this is an area I would say led by Rene Miller.
I mean,
one of the main people in the,
in the spaces is Rene Miller has a long history of research in open
government data sets.
So these are the data sets like,
you know,
UK has one,
US has one,
data.gov, Canada has one, opencanada.ca.
These are ways governments are publishing, essentially, lots of datasets about how the
government runs with the goal, long-level, sort of long-term goal of transparency, more,
essentially, democratic access to how essentially the governmental institutions
are running for journalists to keep an eye on sort of some maybe, I don't know, inefficiencies
and major problems in the society. And it's not very usable though, these data sets, a lot of the people, it's not very easy to find them.
It's not very easy to integrate them.
So I have a research project.
It's unfortunately gotten a bit smaller
since Chang decided he'd rather work on Kuzu
to essentially make these data sets more usable.
So I still have, actually, I'm currently this week writing a paper on that.
So I have that going on.
And I have a PhD student who's graduating also this year who has been doing work on
differential data flow and applications of differential data flow to graph processing.
So for those of you who may not know what differential data flow is,
differential data flow is one of the technologies behind Materialize Inc.
It's a popular startup in the data space.
They do essentially streaming SQL and to materialize your queries,
although I might be wrong, so check what they do.
But they're basically using a technology called timely data flow and differential data flow, which Frank McSherry was very influential in popularizing.
So differential data flow is the most generic incremental view maintenance technique that I'm
aware of. It can maintain arbitrary data flows. And one of their best applications is on recursive,
essentially, data flows and how to maintain the outputs of recursive flows.
And it relates directly to graphs because a lot of sort of graph analytics is recursive computations.
Like if you want to find shortest paths, if you want to find connected components, those types of graph analytics is often represented as recursive queries.
And if you represent that recursive, essentially, computation as a data flow and give it the differential data flow and your input, your graph is changing,
it's going to maintain your connected components
or it's going to maintain your shortest paths, which is an amazing technology.
It's actually unique in the sense I don't think there is another technology
that's able to do this.
So he has been, my PhD student Siddhartha has been working on building
essentially graph layers over
differential data flow so i have that going on on the side although again these have gotten smaller
because the students are leaving uh to either move on to other things or to move to kuzu and
and things like that yeah and great so yeah i've got um three more questions and so the first one
is i really like to ask them people it's really interesting to see the divergency answers i get for it so it's and i'm sure you'll have a totally like new
perspective on it as well so it's like how do you go about approaching idea idea generation and then
determining what to work on i just think i'd love to know more about your creative process i see um
i i so it's a very difficult to answer questions. So some of these are student driven, obviously, at this point. And back in the day, it was a bit advisor driven. So back in my PhD, it was a bit advisor driven. So a lot of the very productive work that I did, actually, back in PhD came from Jeff Allman. Jeff is a Turing Award winner for all those.
I'm sure people have read Jeff's books.
So Jeff kind of, I did a lot of work with Jeff during PhD,
some theoretical work.
And they were really driven by Jeff and Foto Afretti back then.
She was a collaborator of Jeff.
So back then it was driven by that and they would give me specific problems.
They would be very well defined and I'd work on that.
So that was the style.
Now some of the problems are also reversed where some students are going to bring ideas.
But so often I work the way I work nowadays is I started very, very naive, simple things.
I implement, I make sure that the most naive versions
of things get implemented.
So like I'll pose a problem like, you know,
how should shortest paths be implemented,
which is an ongoing project right now?
What can we do in that space?
And the first thing that we're going to do is
we will do the most basic ways of implementing this.
And frankly, most of my best ideas come
when I implement those things
or I see my students implement those things
and I look at the queries.
I find it difficult to come up with
very good systems research ideas by just,
I can come up with high level directions,
but then in the end,
I can't often predict what the papers
are going to be about beforehand.
And that comes only when we implement something.
And then we see, OK, these are the challenges in the implementation.
So I highly advise all of my students right now.
And this is the sort of style that I've come to adopt in the last three, four years.
It's extremely sort of code driven.
Everybody needs to write a lot of code.
I think that's good for training as well.
And everybody needs to work in the context of a system.
So one thing that I'm very against is publishing ideas that are not implemented in serious systems.
Because they're kind of destined to be, I think, out of touch with the reality of how things get integrated into a system,
which means they're going to be dismissed.
So everything needs to be implemented in a serious system.
Like Kuzu at this point is becoming quite serious that if you do research on Kuzu, you
will get to interact with a lot of components of the system or you need to take, like I'm
advising a PhD student, not really advising, but I'm talking to a PhD student who works on Neo4j at the University of Waterloo.
And, you know, she does work inside Neo4j, which is good.
And that's the style of, I think, research should be done.
And when you do that, I think, do you get good practical ideas?
Now, that has one risk, which is that when you do that, your ideas are also more humble because you're kind of highly limited by what can be done versus if you were to kind of feel freer and do something on the side on your own scratch code.
So those papers are more easily rejected.
So I do get a lot of rejections. My students and I, you know, a lot of the work that we do gets quite a bit of rejection,
unfortunately, because some of the ideas end up more like, you know, here are some sort of core
engineering principles, but they don't, it makes, it looks a bit, it's a bit hard to make them look
fancy from a research perspective. But that's the style that I work, I think. So write the code and see what doesn't work.
And then after enough failures, good ideas come, I think.
And they always come, by the way.
So no need to worry about that.
Some idea always comes on that you can publish, I think.
Awesome.
Yeah, that's a great answer to add to my collection of answers for that question.
Great stuff.
Cool.
So just two more questions now so the first
one is um what do you think is the biggest challenge facing us now in database research
maybe specifically um we can focus this towards like graph data management and there was a recent
um debate on graph database publishers published in the register i'd like to get your take on that
as well if possible uh actually i i didn't read them because I find those things a bit distracting.
And so some of those are more around personalities
than I think, or some companies
and companies have essentially energies around them.
So I actually, I genuinely did not read them.
Although I know the general topic that was discussed
and I obviously have my position.
So like anybody I think who claims graph databases
will take over relational systems
is way off in their predictions.
I don't think that will happen.
I think you gotta be,
I mean, the relational market is humongous
and very well established.
And I don't think people are know, people are going to switch to
graph technology for tons of applications that they will. So that's not going to happen. But
anybody also says graph database are going away or else in way of as well. I don't think they're
going in actually, as I have tried to say this style, the one that Neo4j kind of brought the
data model and decipherlike sort of query languages
are going to stay.
And I think they're actually going,
they're likely to be the most successful
of every wave that has come so far.
And there's more buzz around them too,
but I don't see them going away like XML databases.
I think they're here to stay.
I think they've got a relatively healthy
sort of application base
and more can be done in the graphs.
I mean, one reason that convinces me of this is that there's really no other data model other than a graph and tables that is very generic to represent application data.
And there will always be appeals to choose one over the other. And some of this is also just sociological question that people just prefer thinking of
certain sort of records as nodes versus tables. So they've done a good job. I think property
graph model is a good model. So I do see it remaining for around. I mean, so that's also
way off. But obviously, I don't think graph database are going to take over
relational systems or something. That's also way off.
That doesn't demonstrate, I think, a good understanding
of the core sort of
power around relational
systems. But graph database are here. They're going to get better.
The market is going to grow healthily, I think.
There will always be many applications that use those systems. What is a challenge in databases?
So, I mean, so I'm not, I don't see myself as a visionary, frankly. So, I don't, I can't sort of
foresee what kind of big problems people are going to uh to work on but one thing that
i am kind of inspired about like this i'm just i'm just putting it out there so that maybe some
phd students who might run into this podcast uh might want to get curious about is there has been an ideal way back,
actually stemmed from AI, artificial intelligence,
from very early days historically of building information systems that can do deductions.
And that has its own niche sort of communities with like automatic proofs
and things like that, where it connected very well with the database community back in the day.
I don't think the communities talk as much,
but these logic-based systems like Datalog is essentially a part of first
order logic that really captures the relational calculus.
But it's a very limited part of first-order logic. And the logic-based sort
of AI community, the symbolic AI community, has done a lot of work on finding essentially
algorithms that can do deductions. So let me give you an example. Actually, this is an example. I
wrote a blog post on this that will appear at Sigmund Records. But I think six months from now,
for some reason, they said it would appear in Sigmund Records. But I think six months from now, for some reason,
they said it would appear in Sigmund.
But here's an example that if you read a book
on those sort of core AI topics of knowledge representation
and sort of applications of first-order logic
to essentially knowledge representation
and deductions and whatnot.
So one example that they will give is,
think of a database, a table of essentially items,
some elements, maybe some balls, and they have colors.
So maybe the colors are red or black.
And then there are some nulls.
There's some balls whose colors are not known,
but there's a constraint that says that every ball has to be, which you can do in first order logic, which you can do in some logic based systems as well.
They database systems, I think as well, that there's a, there's a constraint that every ball has to be blue or black.
Right.
So we just fit relational algebra.
And my question is that how many, what is the maximum number?
What's the, what's the color of the balls?
What's the maximum number of reds or blacks?
Okay, so, you know, out of the all buckets that you could have in terms of colors, what's the maximum number?
And suppose, again, there's three records.
One is blue, one is red, one is null.
But there is a constraint that colors are either blue or black.
Now, the answer is two, but no relational system can compute this, right? Because
no relational system is going to be like, you can't, like you need something, some other algorithms to
do this deduction and use the constraint and be smart that the maximum has to be like, you need
to do a little bit of a proof there, right? You need to say that, okay, there's two cases, either
this null is red or this null is black. Let's do the query. When it's red, you get number two,
because now there are two reds.
Now, if it's black, you get two as well.
So the maximum has to be two.
So that's the correct answer.
And obviously, humans, we can answer this very quickly.
But no information system right now,
no database system will do this,
because we're really restricted
to a very small part of first-order logic.
I think a good thing to think nowadays is to see what, like, first of all,
I don't even know how well motivated it is with this current AI sort of applications,
but it's an ideal.
I think that's important to remember as people think about reasoning,
as people think about sort of what can we do
to extend AI applications with some logic-based deductions, because people are talking about
neuro-symbolic AI. And I don't fully understand, again, the applications to see if this sort of
gut feeling that I have that this might be a good topic to look at is important. But at least as an
ideal, it has existed for over 50 years, maybe,
of how can we build more intelligent
database management systems
by extending the part of sort of first order logic
that we capture.
So, you know, if somebody does the work of seeing
if there are applications of this,
that more deductions could be good
for certain applications,
I think it's a good time to start.
And maybe 10 years from now, those systems might be really useful
because this logic AI, logic community has done a lot of work
to find a lot of algorithms.
And it's the right time to think about how many of those
can actually be put back into a database management system,
make practical and whatnot.
So that's an area that I think is important.
Then there is some classic areas like nowadays heterogeneous computing,
GPUs, DPUs, how do we bring those kind of hardware into database techniques?
Those are clearly very timely to do, I think, to write PhD thesis on too.
So, yeah, those two immediately come to my mind about if I were to be a PhD
student now, I'd explore those type of
things and and and you know uh it's as usual as every decade it's very exciting there's tons of
things to work on yeah well there you go listening you've been told those are the things you need to
go work on I see yeah maybe yeah it's at least good to explore I think yeah definitely really interesting directions, cool
so it's time for the last
word now so
what's the one thing you want the listener to take away
from your work on Kuzu and from
this podcast today
so
I'd like them to
first of all understand that
Kuzu is
being developed seriously so I'd like to see more understand that Kuzu is being developed seriously.
So I'd like to see more people use Kuzu.
We've got our sort of first set of users who are trying and playing around with the system, which is exciting to see.
But it is a system that aims to be highly scalable and sort of fast on graph queries.
These join-heavy queries when you need to find connections between nodes. and sort of fast on graph queries, these, you know, join heavy queries
when you need to find connections between nodes.
And, you know, it is adopting the usability style of.db.
So you can think of it in one sentence as.db for graphs.
So that's one thing to take away.
And, you know, that's about it.
Did you want me to comment about something else?
That's if you wanted to.
And the other thing is I do think, you know,
educating yourself a bit more on other data models is going to help you.
You know, knowing more about RDF,
knowing more about property graph database management in general is healthy.
And I think for anybody who's teaching courses on database management systems, I think I do this
all the time, spend a couple lectures on these things. I think we need to educate people more
on these alternative models that go beyond the alternate data models and sort of variants of
SQL like Cypher and Sparkle.
And those languages are easy to pick up.
And it's good to know because, you know, you want to be able to use your skills on those technologies easily.
I think it's going to make people better engineers.
So that's another thing that I want to advocate very, very strongly for.
Fantastic. Well, that's a great great a great thing to end it on so
yeah we'll end it there thanks so much so it's an absolute pleasure talking to you thank you so
much for coming on the show and if the listener is interested in knowing more about all of semi's
work and kuzu we'll put all the links to everything and all the blog posts and whatnot in the show
notes and if you enjoy the show please consider supporting the podcast through buy me a coffee
and we will see you all next time for some more awesome computer science
research.
Music
Music
Music
Music
Music
Music