Disseminate: The Computer Science Research Podcast - Per Fuchs | Sortledton: a Universal, Transactional Graph Data Structure | #13
Episode Date: November 28, 2022Summary (VLDB abstract):Despite the wide adoption of graph processing across many different application domains, there is no underlying data structure that can serve a variety of graph workloads (anal...ytics, traversals, and pattern matching) on dynamic graphs with transactional updates. In this episode, Per talks about Sortledton, a universal graph data structure that addresses the open problem by being carefully optimizing for the most relevant data access patterns used by graph computation kernels. It can support millions of transactional updates per second, while providing competitive performance (1.22x on average) for the most common graph workloads to the best-known baseline for static graphs – csr. With this, we improve the ingestion throughput over state-of-the-art dynamic graph data structures, while supporting a wider range of graph computations under transactional guarantees, with a much simpler design and signifcantly smaller memory footprint (2.1x that of csr).Links:PaperPer's LinkedInGraph Framework EvaluationImplementation Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate, a computer science research podcast. I'm your host, Jack Wardby.
I'm delighted to say I'm joined today by Per Fuchs, who we'll be talking about his VLDB22 paper,
Sortle Don, a Universal Transactional Graph Data Structure. And I believe this was,
you can correct me if I'm wrong here, Per, but this was nominated for Best Paper, right?
So this is a run up. Yeah, fantastic. So Per is or was sorry should I say a PhD student
at the Technical University of Munich and he's now working at Tableau and his research interests
are graph database systems and dynamic graphs. So welcome to the show Per. Yeah thanks a lot for the
introduction I'm happy to be here. Fantastic. So can we start off by explaining to the listener what a graph database system is, what graph processing is, and just sort of give us some background to the paper?
Yes, so basically for me, graph processing or graph database systems are really around handling at least three different graph vectors.
These are graph pattern matching, analytics that would be page rank
on traversals like single source shortest path. And all of these have really different requirements
to watch the system, different access patterns to work the underlying data structure, which is
where this comes into connection to my work. For me, it's really important to say it's universal.
You need to have all three of these workloads. Otherwise, nobody wants a system for one workload, this system,
and then another system for his company to support the second workload
and then the third one maybe for the third workload.
And as well, you want updates on the systems.
You don't want to reload the graph every time.
It takes longer to reload a graph than to run PageRank. So a lot of the static graph processing systems out there are a bit... it's
difficult to put them into industry, I would say. So there's a connection to what can we really
use in industry here. Fantastic. So just kind of, I guess, going back a little bit, how did you get
interested in graphs in the first place? What was the attraction here? I started really early.
My bachelor thesis was already connected to graph database systems.
My supervisor let me use it to detect denial of service attacks
and programs and graphs, which represent programs.
And I always found them fascinating
because they concentrate more on the connection between things.
So relational database systems have a lot of entities in them,
and then you join them to get results.
But graph database systems put like this idea of there's a very important
connection between many things as a first world citizen.
And it makes it really easy to express these ideas normally in an interface
so that's that's another big challenge here which i didn't touch on in my paper
but it would have liked to my phd awesome so i know in your paper you say that it's an open
problem that building a system that can efficiently process this whole range of graph algorithms over
dynamically changing graph dataset.
So it's still an open problem. But so why is this? Can we like maybe elaborate on that a little bit more?
Yeah, the challenge is really you have different access patterns for each workload.
So there's intersections or generally set operations for graph pattern matching, which you need,
which requires you to have a sorted data structure on a hash set.
You want very, very fast inserts up to 2 million a second
over the whole graph for edges.
That's known industry needs of Twitter and Alibaba.
And finally, you want this to be really easy to scan
because this is what graph analytics and analytics
and traversals mostly depend on.
And you want this to have random access towards each neighborhood, which is super important
for traversals.
And each of these things is solved in its own.
So there are data structures out there which we can just put together to make a graph data
structure.
And this was kind of an observation I had really early on. So my feeling was there is a nice and easy solution
reusing existing things out there
to solve all these challenges in one.
And then the next big challenge is to do that
with utmost efficiency
because in single edge, it's a super small entity.
It's 32 or 64 bytes.
So that's just super small
and every overhead bushel.
Cool.
So I guess that segues nicely
into what is Saltedon.
And I struggle to pronounce this,
so I'm not pronouncing it correctly
because I think when I first read it,
I was like, is it Saltedon?
Yeah, it's confusing. So how do I pronounce it, I guess? And I'm not pronouncing it correctly because I think when I first read it I was like Sotledon, Sotledon, yeah it's
confusing so how do I pronounce it I guess
and I'm not pronouncing it correctly and what is it
yeah Sotledon
it comes from being Sotled
Sorted
and it's basically a simpleton idea
you know it's very simple
it's
and that's the combination
Sotledon in the end.
Maybe a bad joke on my side.
I think nobody ever got it.
No, I didn't realize that was it, but no.
Cool, yeah, so I guess what is it?
Yeah, so can you give us a high-level overview of what it is,
and then we can dig into the details.
Yeah, so Sautelton is the data structure which this paper is about
and it's very simple indeed.
It's basically a very well implemented
adjacency list.
So everybody knows the adjacency list and then
you dig into all the requirements
I said
before and you can
find out that you can
reuse existing data structures to build
something and that's a Sautelton. A very highly efficient can find out that you can reuse existing data structures to build something.
And that is Sautelton, a very highly efficient adjacency list.
But I think the paper has much more to offer.
So the data structure is what's all arranged about.
The discussion is all around this data structure.
But we touch on each design decision we took for coming up with this data structure in the end.
So you can see the trade-off we made, and you can see which trade-offs are there,
and the process of designing a well-working graph data structure.
So my hope is that with this paper, after you've read it,
you have a much deeper understanding of what it takes to build a data structure.
And then you can take Saltwood, which is so simple that it's easy to understand
and explain to build your own data structure on top of that. Because you understood it
so well, you will know which parts you would need to change to make it really well working
for your specific workload. And it's also, due to its universality, a good base to start
with for a graph database system.
You touched on it there, but can we go talk a little bit more about what the design goals its universality a good base to start with for a graph database system.
You touched on it there, but can we talk a little bit more about what the design goals were when you were developing Solidon?
The design goals, I think, were on two sides.
One, it should be very dynamic.
Talking, as I said, 2 million updates per second.
We actually made 4 million in the end.
And this is fully transactional, serializable updates of single edges.
And the second design goal was to be as close as possible to the static gold standard data structures for graphs,
namely the compressed graph row representation.
And we wanted to show how close you could get while being dynamic and transactional.
With that, there was a third design
goal I think that should be in every data structure
you design, it's space efficiency.
It should be very space efficient.
These, I think,
were the three design goals we gave
to us and wanted to press them to the
bare minimum of
what's possible and explain
why we believe this being the minimum.
I guess my next kind of question is, and I know that the sort of the key idea,
and I guess now I know what the idea behind the name is,
is what were the key ideas underpinning Sotterdon?
The key ideas underpinning Sotterdon is, well, first of all,
there are a few principles underpinning it.
It's simple and it should be explainable.
It should reuse existing things.
And then it should be a very, very performant solution in the end.
So that was a principle we were aiming for.
And the key idea to get to that is to analyze the access patterns of each workload,
clearly write them down, and then say we need that.
Where did we get that? And find out that there are five access patterns in the end. And out of these five,
one turns out to be just very less important than all the others, which makes it easy to
design something in the end. So kicking out the one property, which a lot of people believed in the
beginning to be very important for a strong graph data structure. And then it turned out,
experimentally, it's not that important. You can't get that much from it. And then by removing that,
you can show that an adjacency list is basically as strong as a CSR. And that makes it so much
simpler. Cool. So can you maybe go through these five things and these five properties
and talk about how, because I know you did some sort of initial experiments,
micro-benchmarks to explore the different space here.
Can you maybe run through those and talk through your initial methodology
for working out what's going to, and arriving at the point of this one
that everyone thinks is really important, is actually not that important?
Yeah, sure.
I can do that. So basically
CSR has five access
patterns covered really well.
The first one is if you want to go
you have a vertex and you want to find out where
the neighborhood is, that's a single
indirection lookup.
That's the first access pattern or
property. I think access pattern is probably the better
word here.
The second thing is once you have a neighborhood,
it's all sequential memory,
so you can scan it with the most efficient as you can do E with modern hardware.
The third thing is that it's sorted.
Each neighborhood is sorted in each other.
All vertices are sequential,
which is important for some algorithms as well.
They want to access all verticesges in the same order.
And the final fifth access pattern is if you want to access all neighborhoods in the order they are stored.
And this is happening a lot with analytics.
And this is the only difference, if you think about it, between a CSR and an adjacency list.
An adjacency list can cover all the other four with ease,
but this one, having all the neighborhoods sequentially in memory,
that makes the CSR really different from an adjacent C-list.
And it's very hard to get right if you want to have it dynamic
because you have this gigantic data structure
holding all edges of the graph.
And this is just a very hard data structure to get right.
There are existing approaches to do that out there,
but they get quite complex.
And then you can run a simple micro experiment
of building the simple adjacency list you can think of
of standard vectors and a CSR,
and then you compare them across a wide range of workloads.
So the Graphalytics benchmark we used, and it shows that there's very, very little difference
between a very simple adjacency list and a C-SAR.
The difference is 1.1 or 1.2x in performance, meaning latency of all the workloads.
And yeah, by this experiment, you can just say, say okay we leave this one property out
and then we build an adjacency list which is a bit more complex than the one we just used
and we can expect very good results to give a little context here more context here we also ran
micro experiments for each of the other access patterns, the next most important access pattern here would be a slowdown
of 3x over all the workloads. So one can clearly see that there's a huge difference between the
importance of these access patterns. And the most important access patterns in this benchmark and
the microbenchmarks would be actually to be sorted, because you simply can't run graph
pattern matching without being at least sorted.
Awesome. So you touched on it a little bit there and that there are other existing data structures out there.
How does Sotterland compare with these?
Maybe can you give a brief overview of what's out there
and then kind of how you contrast with those?
Yeah, let me run through that.
Basically, the other graph data structures we looked at
were graph data structures with supported single-edge updates. We believe single-edge updates to be really important for graph database systems. However, this is something we kind of reconsidered after writing this paper. So we maybe can touch on that later again. So Stinger, Lama, and GraphOne, they can support analytics and traverses quite well.
Really recent papers on that.
And Stinger is one of the old ones, and Lama's as well, whoever everybody refers back to again and again.
Then we have LiveGraph, which added transactional support to the picture.
So again, really important for graph database systems.
And finally, we have TCO.
And TCO is the only one
which has sorted neighborhoods.
So it's the only one supporting
graph pattern matching so far.
TCO was only out the year before us
on the same conference.
And basically,
they were finishing up
while we were writing our paper.
And so the main difference
to TCO is we are much simpler.
CCO has this last access patterns of the CSR.
They have it, but this makes them so complex and design that, yeah,
it was nicely to show that you could also do the same
or maybe have better performance, which is a simpler system.
So that's the difference here.
There's one system we didn't compare to, but I think it's really notable.
It's Aspen.
And this is something one should have an eye on.
We didn't compare to it because it only supports batched updates.
But again, as I said, we reconsider this towards the end of our work, I think.
And they do really strong work as well.
How did you go about implementing SolidRenso?
How was it implemented?
How difficult was it to implement?
I'm guessing with the fact that it was designed to be simple,
it's much simpler to implement than something like Teseo would be.
Yeah, I would say so.
So the implementation is in C++ and from scratch.
It's basically an implementation of a vector for the index,
which we actually used from the building blocks.
And then we have the implementation of a blocked skip list.
So a blocked skip list is a skip list where each node holds multiple elements. Yeah, that's mainly it. And over that, we built transactional support by
well, we redesigned the transactional support
for relational systems, took the
paper from our share, and again, made this simpler,
realizing two things about Graph Workloads.
The first thing is mostly it's very, very small updates, like a single edge or two or three edges.
So updates tend to be very, very small.
And second, you know normally what you might touch before you do the update.
You know you want to have at least five edges, but that means I need to lock the vertices for each side,
so at most 10 vertices will be touched.
So you knew the right side before,
and by that you can save on, for example,
you don't need an undo list then anymore.
Cool. So how did you arrive at that assumption
that there are small updates
and that you know the size of them a priori?
Yeah, I mean it's one of the biggest challenges to actually get really good graph workloads. We found
this very hard. So what we did is we just looked at all the benchmarks which are out there, mainly the SMB benchmark. And we find that the SMB
benchmark focuses on these really
small updates plus one thing.
They have deletions in there which can be
cascading.
So there is actually one workload
out there which we are aware of which is
not falling into this area
of really small updates,
cascading deletes.
But we also came to the conclusion that cascading deletes
is something totally new, which so far no transactional system
we are aware of could do well.
Any transactional protocols we know of wouldn't work so well
with something that can basically touch a whole graph,
and you have no idea how much of the graph it will touch before.
If you want to run this with really heavy workloads like PageRank in concurrency,
that would be very hard to get right.
So we excluded this workload.
So this workload is something we couldn't do.
And I believe this would be a paper in its own
to actually come up with a good transactional support
for cascading the leads or other things
that can start one word
and go to the whole graph
and then do read and write.
How long was the implementation effort?
Are we talking the magnitude of months, weeks, years?
So the implementation effort was maybe around, I think, two months.
Kind of started in January to March.
So roughly around that.
And then there was a prolonged thing of actually getting this evaluated.
We were really happy. Dean DeLeon from CWI just published Teseo,
kind of our main competitor there,
and he published a test hunch with that.
So we could just reuse that.
So that was just amazing for us,
made it so much easier,
and we had great support from him.
Yeah, but however, we had to find out a few bugs or, let's say,
consistency issues between our implementation and how he saw things. We got the numbers right such
that when we measured on our system, we got the same number of measurements with his test harness
and this took quite a long while. I think it was maybe another month or two.
And we then finally got the numbers we wanted.
Fantastic.
So let's talk about the numbers then.
So how did you go about evaluating Solid?
And I guess what were the questions you were trying to answer?
So how were the – and I guess what did you compare against it as well, right?
So there's a few questions here.
You can take them one at a time.
Yeah. well right so there's a few questions here you can take take them one at a time yeah so throughout the paper we had a lot of micro benchmarks to just give the meat behind our decisions right to be like this is why we do that here you can see clearly see why we did this in the micro benchmarks
um for this year i think just read the paper and then the second one is the evaluation of our paper, of course. And here we touched first on how many updates can we do throughput per second. And we
always compared against all the competitors I named doing related work. So we did throughput
per second. We compared how much slower are we in any of the others compared to the static gold standard CSR. We compared how much
do we scale up for inserts and then we looked at edge tap. So what happens if we run the
graphalytics benchmark extremely read heavy while inserting as many edges as we still could at the
same time. So it's basically combining the two first experiments
and then show how much there was a slowdown on either side.
And yeah, that's our experiment.
So that's a question we try to answer.
Cool.
So space, of course.
We look at space efficiency as well.
Yeah.
Yes.
So what was the experimental setup in terms of the hardware
and the variety of benchmarks you used?
So we basically used one benchmark, which is known,
the Graphalytics benchmark.
And for all systems, we used the same implementation of the algorithms.
Mostly they were based on the benchmark suite from SCOPE. So the implementation of the algorithm
is the same for each data source. You only see the difference in the data structures there.
And the other benchmark is, I mean, it's not a clear benchmark,
it's simply you take a lot of graphs
and you try to insert all the edges
as fast as possible.
And in our case, I think we
inserted 80%
of the edges first and then the last 20
and measured the last 20 because
it's not that interesting
inserting it into an empty graph.
So that's that.
We have a slight variation of this experiment
of inserting everything as fast as possible.
And the HSTEP experiment is really a clear combination of the two before.
We ran them at the same time on the same machine.
You asked about the machine.
We evaluated on an Intel CPU, which I think had 48 physical cores.
So I guess now, what were the key results?
What were the headlines?
What's the interesting findings from all of these experiments?
Yeah, so how many updates can we support?
It's around 4 million, and this is averaged over a wide range of graphs.
So the smallest would be Dota League
with 50 million edges, super small,
because you also need to be fast on small graphs.
We found out that at least one competitor
wasn't that fast on small graphs.
And then on the other hand,
you need to be, of course, good on big graphs.
And we took the Friendster graph there with slightly over 2 million edges as our biggest graph.
Average, and this is relatively stable overall graph sizes, we can have 4 million updates.
So our industry needs 2 million as far as we are aware of.
And then the second thing is how fast are we running read-heavy workloads? And we are roughly 85% of CSR.
So we are very close to the static data structure here.
And finally, we are two times as big as the C's are and two times as big
as more or less the theoretically minimum because you need to leave some space off for edges to
insert and then you need some space for versions. Cool do you think it's possible to get so that's
like 85% of the CSR do you think it is possible to get it closer with some more,
or is it theoretically there's a bound there that you're going to
look up against, no matter how good the implementation is
or whatever, you're never going to quite get there?
That's a very interesting question.
To be honest, I have no clear answer to that.
I think we left very many optimizations in our data structure on the
way because towards the end of this paper, we really needed to get it published because
Teseo came out and they were really close to what we wanted to have. And then, yeah,
we really felt the pressure to put it out there now before somebody comes in front of
us. Just in the same time, we got ours in submission.
Terrace came out.
This is in Sigmund, covering quite similar ideas
and also the idea that you can use an adjacency list
to have different set implementations
depending on the characteristics of the neighborhoods,
which was an idea we were probably aiming for as well.
So things came out
and then we were pressured to finish up. I forgot the question, what were we going to?
Whether it's possible to close the gap on the CSR.
So yeah, we let many optimizations out
so there might be
5% to get
even in our
implementation still
but no
there's no
theoretically limit
but there's a limit
of
something needs
to cost right
there's no free lunch
if you want updates
you won't get
to 100%
and the last few
percents
will be harder
and harder
and then also becomes sequestration.
I'll show you diminishing returns there for sure.
So are there any situations in when
Salt-O-Lon's performance is suboptimal?
And I guess this is just kind of getting at
what are the general limitations of this data structure?
There are two general limitations I'm aware of.
The one is fixable.
So we aim for single-edge updates.
As I said, we believe that to be very important for graph database systems
in the beginning of our work.
And then later on, when I look back at it, being finished writing it,
I was like, okay, in the end, I didn't find a workload
which really, really needed single-edge updates.
Many workloads could be done with batched updates.
It's basically just a question of
freshness in between.
And there are papers out there that show
you can get really close to the freshness
with batched updates as you could do
with single-edged updates.
So looking back at it, what I would have
done is batched updates. You can get
much higher throughput on updates,
make system design
of an H-trip system
much simpler.
And so that's one thing,
which I think is a limitation,
but it's not so much
in the design.
We could change
the transactional system
behind it,
and then we could support
batch updates
on the same data.
And the second thing
is that
there's a current trend to going away from in-memory systems towards systems which use SSDs really well.
I think Victor Lais is really leading there, and a lot of his PhD students have great work towards that.
And our data structure is designed for in-memory and if you want to go off
memory with that you would have a harder path than if you choose to say which is the other
system I talked about multiple times before CWI, Dean De Leon and Peter Bongs
are the authors great system great work and I think if you wanted to go SSD with it,
they have a simpler path to boards,
but therefore a much more complex in-memory implementation.
And there might be a combination of these two
where you can get the best of both worlds.
Interesting. That's really interesting.
Has there been, when you were working,
did you ever think about doing some form of a distributed date structure
and how that might look like with Sorted?
Distributed, yeah.
So if you wanted a distributed in-memory system,
again, I think Sorted would do really well
because it can so easily be pulled apart.
It's an index, and then it points to places where the neighborhood is.
So in this regard, I don't see any issue with sort-written
to be pulled apart across multiple systems.
But the real challenge between distributed graph processing
is, of course, somewhere else.
It's about getting the partitioning right.
And this is a whole different profession,
a whole different line of research, and I think it's a question a whole different line of research
and I think it's a very very hard line
of research where
I didn't see any
results that
where I read the paper and were like that's it
the only result I saw which was
really convincing is
from Alibaba
which basically showed that
depending on your workload,
different partitionings will do you well.
And this just makes it harder because they clearly show there's a very clear connection
between partitioning and workload.
And this makes it very, very hard.
Complete moving target.
Yeah.
Yeah.
Yeah.
Interesting.
That's cool.
So Solon is publicly available, right?
You can get a repo there.
You can go and pull it down and have a play with it if the listener wishes to.
Yeah, it's publicly available.
It's open source.
I think, though, the implementation is probably more of a guiding or proof of concept.
So if I wanted to use that in an industry system,
I would probably re-implement it.
And this is very easily done because it's clearly set
which data structures are used in each of these data structures,
skip list or blocked skip list.
And block-free vectors are well known.
It's not hard to implement them yourself.
You maybe get a much cleaner implementation
than a PhD student trying to wrap up a paper
and having his first C++ project within a few years.
So, yeah.
Yeah.
Cool.
So what's the API like of the data structure?
How, as the end user of it, how do I interact with it?
I think exactly as you expect.
It has insert edge, delete edge for the updates.
And then you can ask it to scan neighborhoods for you.
And it will basically return you either,
it will return you an iterator over the
neighborhood but the iterator is batched so you don't have to call an operation for each new edge
but you have to call an operation for each batch of edges and this is basically yeah this is one
of two ways of doing it the other way would be to offer you a scan function and you give a lambda in, which is executed on each edge.
But the important thing you need to get right here is
don't call a function on every edge.
These two things work,
and I think they have really similar performance.
But if you call a function, so an iterator over every edge,
you will have a huge performance.
And this is also something
at least one competitor got wrong.
It's an
implementation issue you shouldn't make.
We really
think about how you access the edges in the end.
Okay, cool.
Does it support
properties? What's the graph model here?
Is it just edges, an undirected graph,
or can I have labels and edges?
I'm getting at, can I support a label property graph model in this?
It's meant to be a topology data structure,
so it's mostly vertices and edges.
It supports edge properties.
You can have as many as you want,
and you could make them as big as you want,
but if you make them very big, it won't work anymore.
So you can have a few small edge properties that will work really well.
Vertex properties, we didn't implement them,
but basically there would be a vector next to the vertex vector, right?
It's kind of a mirroring data structure.
And then the last question
was labels. Labels are a really
interesting thing, right?
It really depends how many you have.
And
if you want to partition a graph
totally on your labels, then you could just
use sort of multiple times.
So one extreme.
And the other idea is to have labels in line with the
edges. It's kind of the other extreme then you don't have much filtering power over them, but
you need to still scan all edges and filter them out over the labels. But then you can use as many
labels as you want kind of properties become labeled. So these are the two extremes of the design space you can look at.
And both are kind of supportable, sort of,
but what you maybe want to do is something in between
and this is something you would need to teach yourself.
And I think the best paper is the A-plus index
with Simi Saihuglu as one of his main authors.
I think they have the deepest treatment of labels and properties
within graph database systems out there.
Very great paper, but it's a static graph data structure.
Cool, cool.
Yeah, so when you were working on SOTLIM,
what was the most interesting thing that you learned or maybe also the most unexpected lesson that kind of came out of working on this?
Under, I guess, such a high pressure environment as well when you're racing to publish because it feels like everyone else is publishing the same thing at the same time.
Yeah.
I think the most unexpected lesson wasn't in the technicalities.
It was in paper writing. And I think this is a lesson wasn't in the technicalities. It was in paper writing.
And I think this is a lesson for every PhD student.
In the very beginning of this paper, I had the idea from the beginning
and it didn't change much towards the end.
And I told this idea to a very well-known professor in the area and said,
well, there is no paper left in this area.
No way you can publish in this area.
It doesn't work anymore.
It's full.
And, yeah, it turned out to be a best paper runner-up, so
apparently it wasn't
full.
So I think this is a very interesting
lesson for every PhD student. If you have a great
idea and you're convinced that you can
convince a few other people of it,
it might be very well
publishable although some people say it's unpublishable fantastic that's a great lesson
to take yeah awesome um so my next question is progress in research is very bumpy right
there's ups and downs so from the conception of the initial idea for Solidon to that final publication,
what were the things that you tried along the way that failed
that maybe your listener would benefit from knowing about?
So they don't make the same mistakes, I guess.
The only thing that really failed was my attempt to implement this fast,
serializable, motion control in a very short amount of time trying to get it right.
It's harder to get it right than one thinks.
And then, yeah, I changed for a simple implementation,
which turned out actually to work super well
and be kind of interesting
because nobody used it before in terms of graph processing.
So that was one thing that didn't play out.
I think all the other ideas played out.
Yeah, it just worked really well, actually.
There's one thing I also learned during writing papers.
It's a lot about the story.
Like I came up with the idea for SORTLAND very early on
and then implemented that and measured that
and then had the numbers I wanted.
And then TSEO came out and became a much stronger paper
over the terms of the supervision.
Very, very strong paper by then.
So the last part of getting this paper out
was really storytelling of explaining
why a data structure which is just simpler
is still really well worthwhile to read about,
although it has some weaknesses compared to CCOs,
had some strong points compared to CCO,
but in the end was really similar.
And the main difference being it being simpler.
And then, yeah, there was a lot of points to make
and telling very nice different stories about it.
The next question is maybe not as relevant for you now,
now that you've moved on to Tableau, but what's the next step?
What's the future research for Sotterdam?
Where does it go from here?
Yeah, so basically my future research would have been
to integrate this in some kind of system
to make it a much better graph database system.
There are two ways going about this.
One way is to integrate it as an index,
one-to-many index basically,
into an existing relational system.
And I think many people tried that,
and I'm aware of at least two ongoing research projects
on very good shares trying to do that.
And I think it's a very worthwhile approach.
However, there's one weakness with this approach, I think,
is that it kind of describes it as a second
citizen, right, like second-class citizen.
You put it in a relational
system and then it kind of needs to fit the
relational model, and
the best would be if you could run SQL
on top of it, and yes,
you can do page rank in SQL, but
it's kind of unnatural,
and it gets harder for things
to show a a path.
So there is a bit of an idea,
sort of my supervisor, Jana Gischewa,
where they say, well, what should we do instead?
Instead of bringing everything to SQL,
like graph processing, machine learning,
you can put it all in a relational system.
But what we can also do is use relational techniques
to build the system,
but break down the operators into smaller parts.
So the operators of a relational system is, for example, a join,
but you can actually break down a join into much smaller operators
and build the join out of that
and then expose these smaller operators and build the join out of that and then expose these smaller operators,
which are like, for example, map,
or they could be reduced or filter.
You can expose these operators as the interface of the database
and then have a much more natural way of expressing machine learning
or graph processing system and relational system
workloads in the same system. I think this is a great chance for graph processing or graph
database systems because it gives us a way to be a first-class citizen in these systems.
Fantastic. So can you maybe tell us a little bit about some of the other research you did?
Yeah, so mostly I've been looking
at this work, of course,
and then started
looking into how you would use
these small operators, they're called sub-operators,
for graph workloads
and came up with some initial
work of how you can express different
graph workloads in a wide range of graph
workloads as these sub-operators. So this would have been my next work
kind of like setting up
a processing system on top or an execution
layer on top of the graph storage I had.
What do you think is the biggest challenge
in the graph space at the minute?
The biggest challenge is integration.
All the ideas, all the basic ideas for really strong graph database systems,
I think by now are out there.
And it's time to pick them up and bundle them in one system
and show how they all belong together,
how they can work all in one system.
And yeah, just cover a wide range of topics with these.
And then we convince people that it's worthwhile using and building, using the system in the
industry.
Fantastic.
And last question now.
What's the one key thing you want the listeners to take away
from your research and from this episode?
The one key thing is that once you want to build
an in-memory graph processing system
and want it to be really strong,
like nearly as strong as a static data structure
and have updates,
a very simple and well-implemented adjacency list can do that.
And the reason for that is that there are five access patterns
in graph workloads,
and one of these five is so much less important than all the others.
It's the only thing that makes a difference between a CSR
or a complex data structure and a JCCL is a much simpler data structure.
Fantastic. And so let's end it there. Thank you so much, Per.
That was fantastic.
If the listener is more interested to know about Per's work,
I'll put links to the paper and everything in the show notes.
And we will see you all next time for some more awesome computer science research.