Disseminate: The Computer Science Research Podcast - Gábor Szárnyas | The LDBC Social Network Benchmark: Business Intelligence Workload | #44
Episode Date: December 4, 2023Summary: In this episode, Gábor Szárnyas takes us on a journey through the LDBC Social Network Benchmark's Business Intelligence workload (SNB BI). Developed through collaboration between academia a...nd industry the SNB BI is a comprehensive graph OLAP benchmark. It pushes the boundaries of synthetic and scalable analytical database benchmarks, featuring a sophisticated data generator and a temporal graph with small-world phenomena. The benchmark's query workload, rooted in LDBC's innovative design methodology, aims to drive future technical advancements in graph database systems. Gabor highlights SNB BI's unique features, including the adoption of "parameter curation" for stable query runtimes across diverse parameters. Join us for a succinct yet insightful exploration of SNB BI, where Gábor Szárnyas unveils the intricacies shaping the forefront of analytical data systems and graph workloads.Links: VLDB'23 PaperGabor's HomepageLDBC HomepageLDBC GitHub Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate the Computer Science Research Podcast. I'm your host,
Jack Wardby. A reminder, if you enjoy the show, please do consider supporting us through
Buy Me a Coffee. It really helps us to keep making the show. Today, it's my great pleasure
to say I'm joined by one of my good friends, Gabor Sanyesh, who will be telling us everything
we need to know about the LBBC social network benchmarks, business intelligence workload. Gabor is a dev rel at
DuckDB. And yeah, that's a little brief intro for you there, Gabor. So can you tell us more
about yourself and how you became interested in database management research? Sure. So thanks for
having me. And I had a lengthy stint in academia. So I became fascinated with databases as an undergraduate student.
Back in 2010, I had a really nice undergraduate database course, and it was quite theoretical.
And as usual at the time, it was about normal forms and calculus and making theorems on
transactions and proving them, but I really enjoyed it. So I chose my undergraduate topic
on this, and I did a thesis
on bitmap indexing, which is more sort of a systems work. And then I joined a group which
worked on a distributed graph query engine. This was challenging because we wanted to make it
incremental. So my master's project was prototyping this. And of course, my master's projects was also
discovering that this is not only
a master's project. This is something that you can do for a more extended period of time.
So I joined the same research group as a PhD student, and I kept building this system.
And as I was building it, I realized that we need a really good benchmark.
And there was already a benchmark that other researchers in the group have started. They have been struggling with getting it published for a couple of years.
So I joined that project and got it to publication.
And this was my first serious attempt at benchmark framework building.
And it was very interesting.
We did incremental graph queries.
We had dozens of systems from various spaces like RDF and graph databases and SQL.
So I already played with Neo4j, MySQL, SQLite, and many other systems.
So this was a very good introduction.
And then I finished my PhD.
And around the end of my PhD, I started to work on the LDB-C benchmark, which is a topic
of today's conversation. So I did my work on that as a postdoc and now I moved to industry. I'm at
DuckDB Labs, which is making the DuckDB embeddable database product. Fantastic. I'm learning more
about you as every day, Gabo. I didn't know all that stuff. So yeah, and the benchmark you
referenced there, was that the train benchmark? It was a train benchmark.
So this is something that's set in the railway domain.
So it is capturing a safety-critical domain,
which you build with some sort of a dedicated design software.
And as you build it,
ideally you would want to have timely feedback on any error or any issue that you may insert into the system.
And it turns out that if you translate this to the language of databases,
what you basically want is an incremental query engine that also understands graphs,
so it can do reachability and similar queries.
So that was the benchmark.
And, of course, it was very difficult to tackle for systems
because it has big global queries on a complex graph that
you can think about as a property graph or you can think about it as an object graph.
Nice, nice. Well, let's talk about another benchmark then. So before we do that,
I should probably start with some background for the listeners. So you can maybe tell us
what LVBC is, what it stands for. And I should probably say at this point as well,
as a bit of a disclaimer, I'm also a co-author on this paper, but I'm going to try and remain impartial all the way through
this. So yeah, the floor is yours, Gabor. Tell us about LDBC. All right. So LDBC originally started
as a European Union research project. It was two and a half years long and started in late 2012
with the initial goal of creating a consortium where
industry and academia can collaborate. They can define standards in this space, which were much
needed. And these standards included benchmarks where they agree on a given workload that this
is important and this is worth competing on. And it also included things like graph schema
and graph query languages.
And of course, two and a half years
is nowhere near long enough to complete all this.
So once the EU project concluded successfully,
they created a nonprofit company.
And this nonprofit company had many of the same members
that the European Union project had,
as well as new members. And they kept working on all of these research members that the European Union project had, as well as new members,
and they kept working on all of these research objectives that I just said.
So this was from 2015, and this is alive to this day.
So the LDBC is now roughly 10, 11 years old, and it has more than 20 members who collaborate on various issues in the form of
benchmark task forces and working groups on query language and schema questions.
Fantastic. And this spans industry and academia, right? There's people from all over the shop
involved in this collaboration, which is really good fun. So we're going to be talking specifically
today about the business intelligence workload. And obviously, you mentioned earlier on that Ben a benchmark has a framework with it in a suite so maybe you can start by
giving the listeners on a high level overview of the various components that make up the business
intelligence workload all right so as the name suggests the business intelligence workload is
an analytical benchmark in nature so the main goal of it is to simulate a scenario where you have lots of data and you would like to extract some sort of a value from the data, find some observations that may help influence your business decisions.
It can be thought of as a decision support benchmark in all terms.
So the workload basically consists of a couple of components, the first being the dataset.
The name says social network benchmark, and this is indeed modeled after a social network.
When we did this benchmark, we dug into the literature on social networks.
This is actually something you and I did together, Jack.
So we read a bunch of papers on how the social networks work in practice,
how people are added to the network, how they leave the network. And this forms the data set that we use in our benchmark.
And there is an important disclaimer here. We don't use a social network because we believe
this is the only use case for graph databases or even the most important use case. We use the social network because it's easy to understand.
We had accessible real-life data, or at least information on real-life data.
And this is really easy to explain to users.
So we have this social network.
And in this imaginary scenario that's captured by the benchmark,
we run different queries on this workload.
And the queries are quite complex. I think we will touch more on this workload. And the queries are quite complex.
I think we will touch more on that later,
but suffice to say, these are not microsecond queries.
Probably a query will take 10 seconds or half a minute.
These are really heavy-hitting queries,
like go through all the people in India and do some aggregation
or find all the triangles of people living in China. Obviously,
these are quite difficult to compute and also incur a lot of IO because they need to access
a lot of data. So we have the data set, we have the queries. The data set is not static. So we
have a stream of updates, which is a mix of insert operations with new content and new people added
to the network. And it's also a bunch of deletes with new content and new people added to the network.
And it's also a bunch of deletes with people deleting themselves or retracting a like or
retracting a message that they made, which makes the workload quite challenging for systems.
And we have a set of parameters that we use to run the queries with. And finally, we have something called scheduling.
So obviously there are many ways to run all these queries
and all these updates,
but we have scheduling rules
on how the queries must follow each other
when the updates may or may not happen,
how many threads must be used, and so on.
Nice. That's a really nice overview there of everything.
Yeah, I know when we were actually designing, trying to kind of model it on some of the real world there we had some fun and i did
like sort of making a facebook account and seeing what happens if i have if i join a group leave a
group and if i delete this and delete that and yeah it was good fun and but anyway i digress so
let's talk about how we actually like go about generating this data then so you've spoke about the data set there and there was before we did this this paper there was already some pre-existing
software called the data generator data gen for short that actually produced this this data set
so maybe you can tell us more about data gen its background and kind of the adjustments we had to
make to get to support the data set or generate the data set for this benchmark? Sure. So we were lucky enough to inherit a functioning data gen.
And it turns out that these things are not easy to make.
So the official data gen back when I joined the project was written in Apache Hadoop,
and it ran MapReduce jobs for scalability.
And the idea of it was that it tried to create a network that's just realistic enough to
actually stress important aspects of query processing in the systems and actually have
interesting distributions.
And the way they achieved this is that, of course, they didn't want to sacrifice the
scalability.
So they had a couple of tricks.
One trick was that they generated the core graph,
which is the person-nose-person graph,
in a way where the nose edge may be created along three dimensions.
And these dimensions are that the two people went to the same university.
This increases the probability of becoming friends,
as does it in real life. Another dimension is interest. So if they are interested in the same
topic, they are more likely to become friends. And the third dimension is just random. So people
sometimes just bump into other people from all walks of life. And these three dimensions are
layered upon each other. And this is how we get
a person-nose-person graph, which is actually quite realistic. So this is how the core of the
graph is created. And this is what gives the graph the computational complexity that the queries will
exploit. But what gives the weight of the graph, meaning the heavy part of the graph, is actually
the content that the people produce in the social part of the graph is actually the content that
the people produce in the social network. If you think about it, Facebook only has maybe a few
billion people, but it has a lot more posts and a lot more comments and a lot more likes edges.
And this is reflected in our data generator. So about 80% of our data is forums and comments.
And these were generated based on data from Dictionary and DBpedia.
So what the data generator originally did is that it just stitched together random words and random topics and slapped on a number of tags on them.
You can think of the tags as hashtags in Twitter and other social networks. So this was
great. This was something that we had, but it had a couple of drawbacks. The first drawback was that
it did not scale beyond scale factor 1000. The scale factor here means the size of the data set
in a CSV format, in an uncompressed texture representation. And 1,000, it means roughly one terabyte of CSV,
which is actually not that big of a deal for modern systems. And especially in the territory
of business intelligence where we wanted to position this benchmark, it's not really big
data anymore. So we wanted to crank this up to 10,000 or possibly even larger. The other problem was that the original
data generator did not have any delete operations. The network just kept growing. And this allows for
some shortcuts in the systems. For example, you can use append-only data structures. You can use
algorithms which compute shortest paths, but they do not take deletes into account upon updates. So you can
cut corners and we don't want that. So we ended up reworking this data generator quite thoroughly.
It was a gradual process. So we migrated away from Hadoop to Spark, which was the more modern stack
at the time of the migration. And we also added the deletes. This is actually quite interesting because I'm
from Hungary and there was a Hungarian social network, the IWIW, the International Who is Who.
And luckily for us, once they went defunct, they published their dataset, anonymized, of course,
but the dataset had something called the time of the last login. And this was picked up by researchers at CEU,
the Central European University,
and ELT, another Hungarian university.
And they analyzed how the network
is basically getting disintegrated.
Who leave the network first
and try to come up with theories on why they leave first.
So it was incorporated into the updated data
generator so now we have something that is more realistic in terms of how the influx of people
are coming in and how they then leave the network yeah it's a fantastic answer that question and it
all of this sort of makes for a really really really really really really
really really realistic um data set right to to be the sort of the bedrock of the the benchmark and
really allows us to sort of express some really interesting queries on top of that and kind of
with that it's a nice little segue into how we go about actually designing the workload and the
queries and what our methodology was for for doing that and and that is called the chokepoint
methodology right so can you kind of explain to the listener what the choke point
methodology is and how that was used to design the read queries in the business intelligence
workload? Absolutely. So you can, of course, design a benchmark by just coming up with ideas
that you would want to ask on a social network. And that's a good basis, but it is not very principled and it will likely
emphasize some aspects a lot more than others. What the choke points-based methodology does
is that it tries to extract the core difficulty of the queries. And by core difficulty, you can
think about aggregation performance, certain types of joins, like outer joins or
semi-joins. You can think of graph workloads like traversal and cheapest path computation.
And then we have this matrix of choke points on one dimension and queries on the other dimension.
And what we try to achieve is to have sufficient coverage for all of the choke points by the queries,
but we try not to overstress any of the choke points.
So we are looking for a balanced coverage of all the choke points.
And that is, of course, a very iterative process.
We keep refining the queries.
We bump into new choke points as we do so.
But ultimately, we got a matrix of 20 queries and roughly 30 choke points that we deemed to be sufficient.
And we were happy with how they covered them.
Can you give us an example of one of the queries?
You must have a favorite query in the workload.
Yes, my particular favorite is the triangle query.
So this says that you should start from a country and you traverse to all the cities of those country.
And you try to find triangles of people who live in those cities and they know each other.
And we added one twist.
This is already difficult enough, but we added a twist that they must have become friends in a given time range. Maybe all of them went to the same concert and then they just added each other, or maybe
two of them added each other and then the day after they bumped into someone.
And just to make matters difficult enough, here we only select from the largest countries.
So when we do the parameter selection, it has to be either India or China, the two countries with the largest populations.
And we give quite wide ranges for the dates.
So we have a large pool of candidate peoples who can be part of the triangles. is really tying into modern database research because in order to be efficient in this query
your system is recommended to employ something called versca's optimal joins and those versca's
optimal joins can run the triangle query a lot more efficiently than the traditional
binary joins that are available in most relational database systems awesome stuff so yeah that's a
nice little um segue there and how do we actually go about selecting
these parameters for these queries so there's a this obviously i guess on the face of it sounds
simple right yeah i don't know you just pick you look at the graph pick a node id plug it into the
query and away you go but it's a little bit more complicated than that right for for this data set
and for this domain so can you maybe explain why why parameter curation is more challenging here and then explain like our approach to to um to solving this problem in this paper yeah absolutely
so obviously the first try that anyone does when um creating a benchmark is what you described we
just randomly pick input parameters and run the queries But the shock comes when you plot the results
because you will have these very wild distributions
with very little actual substance whatsoever.
And we have such distributions in the paper,
so the listeners can look it up.
You get these weird point clouds of random data points.
The reason behind this is that in a graph workload,
you have the data which is very graphy, quote-unquote,
because it has nodes that are supernodes in the network
and it has nodes that are on the verge of the network.
This has many names.
It is called the Justin Bieber problem, for example,
when someone in the social network may have millions of followers, and you also have people
who just joined. And obviously, if you select someone who is really popular and do a two-hop
query along the nose edges, you will touch on millions of nodes, whereas if you select someone
who just joined, you maybe only touch on a dozen nodes.
And this will give a very uneven distribution. And that is going to make it very difficult to
reason about your query results, because it's very difficult to compare different distributions,
especially if you only have a couple of data points. So what we try to achieve instead of this
is we try to ensure that you have something
resembling a normal distribution.
So we selected, for example, in the person and two hop case,
we selected people who have roughly a similar amount of friends
and friends of friends.
And also because the query is also query
for other properties like tags and birthdays,
we tried to use those to counterbalance the effect.
So if someone maybe has a bit more friends,
then we include that person with a narrower range
in terms of the date filtering.
And this tends to work reasonably well.
And we got rid of those weird bimodal,
trimodal and random distributions.
Our distributions are pretty close
to the ideal bell shape or bell curve.
Fantastic.
Yeah, it's funny you say about
the Justin Bieber problem.
I think over the course of my PhD,
it was the Lady Gaga problem for a bit.
Probably the Taylor Swift problem
was probably the most appropriate.
She's probably the most famous musician in the world at the minute maybe
so yeah i don't know it the super nerd name changes with whoever's the most popular at the
time right um yeah it's you as well so yeah that's true yeah um cool anyway so yeah we've been through
all the major components of the of the of the benchmark now so maybe you can kind of talk us
through the sort of end
to end flow. So if I wake up tomorrow and I don't want to benchmark my new fancy graph system,
what does that look like? How do we go from end to end with it?
Right. So the benchmark workflow itself has to start with loading the data in the system. So
you have a bunch of files. It's most often a bunch of CSV files, but you can also
convert them to Parquet or have RDF data. And then you have to load it into the system. This is
something that in the BI workload, we care about a bit, but we also don't care about it too much.
So this will factor in into your final score, but it will have a small effect on the final score. In any case,
you can load it. It may take a couple of hours and then you have to start querying.
And the querying has these blocks which measure query performances. And the blocks always start
with running updates. So you apply a day's worth of inserts and a day's worth of reads. Basically, you can imagine checking yourself the social network at
2 a.m., you see what happened the last day, and you apply
all those changes in your copy of the social network, and then
you run your queries on the network. So this is the workflow, and you keep
repeating these batches of updates
and reads.
And of course, you asked, how do you go on implementing about this?
Well, you first implement your loader, possibly with a schema and then some automated way to load the CSV files.
Then you implement the queries.
We have 20 queries that you have to implement.
And we have eight insert and eight delete operations and then you just have
to stage this together and you will have your end-to-end workload running of course then you
have to validate it against some existing system and you have to ensure that your system gives the
correct results awesome so what's the end what's the end score at the end and what am i competing
and what's the final value i'm'm like, my system's performed this
compared to other systems.
I've also run the benchmark.
We have two scores inspired by TPC benchmarks.
We have the power score,
which is the single query performance.
And the rationale for this is that if you're an analyst
and you're sitting in front of your computer,
then you want timely answers to your queries.
So we basically want to have a number that's connected to the sheer speed of the system.
If you have a complex query and you hit enter and you have to wait X seconds.
But the other use case is that many of these systems run somewhere on a server.
Maybe they produce some reports.
Maybe they look for fraud or
pre-compute recommendations. And in this case, the individual query runtimes are not that relevant.
What's relevant is the throughput of the system. So we have a throughput score where you can run
queries concurrently, and that will be characterized by the final throughput score.
And we actually have two more scores which are the
aforementioned power score and throughput score but they are adjusted with the system price because
we take the system price into account so we have dollar adjusted throughput and power scores awesome
so yeah so you in the paper you took two benchmarks for spins you what you went through this way and
this workflow with them so can you maybe tell us about that experience and i guess before before i do
that how long do you think it should take someone on average i know this is probably kind of how
long is a piece of string sort of question to kind of implement this and get kind of a v
kind of a first iteration of the whole workload benchmark set up running and generating scores? Okay, so it's a good question. The way I
explained it before made it sound very simple because you have to load some data in your system
that you likely already know very well. And then you have to implement 20 queries and a bunch of
updates. But it turns out that implementing those 20 queries alone is a huge undertaking because there are no standard languages in the graph space
as of today, barring the SQL PGQ extension. So it is quite a lot of work just to formulate your
queries in a different language. It can be Cypher, GSQL, GQL, Sparkle, et cetera.
This will take probably a few weeks for you to implement the queries correctly.
The updates are usually simpler to implement, but the challenge then is to ensure that everything is correct.
And that's challenging for both the reads and the updates.
So you will likely spend a few more weeks debugging and optimizing your query performance.
And this is not to mention that usually when people do this, they discover some bottlenecks
in the system and that will necessitate some work from the engineering department and likely
a new release of the system.
So when you put all this together, you can get an initial implementation in about a month,
but to attempt a serious implementation is probably going to be a multi-month effort
fantastic yeah so when you actually start kind of thinking about it ends up taking a lot longer
than you'd kind of initially think of it like i'll do it a long weekend and then it becomes like a
multi-month month project and cool so in the paper there there are two systems that you took the benchmark for a spin with.
Can you tell us about the systems and what the experience was like kind of getting it to work on those systems and what your analysis of the results were?
Of course. So we picked two systems and these couldn't be much more different.
So the first system that we used was Umbra. This is developed at TU Munich,
so it is an academic system. And it's a single node, so it's a non-distributed system,
and relational database management system. They support most of the syntax of PostgreSQL,
and they go to great lengths to ensure that they do larger than memory processing,
so they can spill to disk, specifically with NVMe SSDs in mind. And they have a state-of-the-art
JIT compiled query engine, which actually takes the SQL query and emits x86 bytecode from it.
So it compiles not to LLVM, but to an even lower level, which makes it super fast
for compilation. And it produces very good query plans and very good code to run as a consequence.
So this was one system that we used. And the other system was TigerGraph. TigerGraph
is an American company which creates a distributed database system. This is a native graph database,
so they don't have a relational storage engine, but they treat all your great data as a graph.
And they support a language called GSQL, which is specifically created for graph traversal
and graph algorithms. Cool. So that's fantastic. So are there any limitations with the benchmark at the
moment then? And are there any gaps in it at the moment that you think maybe could be improved in
the future? Definitely. There is always more work to be done in benchmarks. I think one of the most
glaring omissions that we made in the BI workload is that it doesn't really have graph algorithms
in the sense that we never do something like a connected components computation
or a community detection or something like that on the graph. This is something that I could imagine
in an actual workload running on a similar data set. That would be probably interesting and also gnn's as a whole graph neural networks
are graph algorithms that many people run on different graph shape data and this would make
a great addition to the benchmark it will of course come at the expense of more complexity
yeah of course it'd be a really interesting benchmark right and kind of i guess would
be a sort of a better approximation of what people might want to do in practice or a closer approximation and maybe the the bi 2.0 i don't
know yeah yeah cool i guess yeah and i guess i guess on that we can kind of maybe touch on what's
what's next for um ldbc and the bi workload is there going to be a 2.0 or where's what's it
looking like at the moment so ldbc has some guidelines on renewal
because obviously these benchmarks don't stay current and fresh for an infinite amount of time
so we try to do renewals in five to ten years this is still quite some distance out so the bi
workload is safe our goal with the BI workload is to have
more audited systems. And I think this is a good opportunity to talk about audits.
LDBC has a process where we put out our benchmarks in the public with really detailed specifications
on how they should be used. And companies can then hire an independent third-party auditor
who is actually running the benchmark for them.
So it's a complete reproduction of the benchmark, followed by a very detailed report called the full disclosure report.
And this is reviewed by the benchmark task force.
So all of this is to guarantee that the benchmark was done correctly.
It's a faithful implementation, and the execution didn't have any issues in it.
So for the BI workload, we are currently in this phase where we are facilitating audits.
We already had a couple of audits that were conducted successfully and we have more on the way.
So this is the current status for us.
Fantastic. How many have been completed so far?
So there are four successful ones.
All of them were done by the Tiger Graph system on different setups. One was on-premises machine and others were in the cloud.
And there is one more that's currently underway and more planned.
Fantastic stuff.
So how does this work?
What then do you think?
Let's position it against other benchmarks in the space,
and how do you think it compares with some of the things from TPC
and other LDBC workloads?
Because this isn't the only one, right?
We've got other ones as well, so maybe you can talk about those as well.
Sure. So LDBC has a couple of benchmarks.
It has the original social network benchmark,
which is called the interactive workload.
So the interactive workload
has a transactional processing that's being simulated. So it has short writes and short
reads. Compared to this, this is basically the other end of the scale. So this is OLTP versus
OLAP. Then we have a RDF-focused benchmark called the semantic publishing benchmark.
That's more focused on inferencing and being able to handle RDF-shaped data with SPARQL queries.
And we have a new, very interesting benchmark called the FinBench, which targets distributed systems.
And it does graph processing on huge amounts of data with relaxed consistency constraints.
This is something that's just been released as version v0.1.
And this is something where you really have to ensure
that your system cannot only answer the benchmark queries,
but it can do so very quickly
under 100 milliseconds in most cases.
So this is, again,
probably an even more transactional benchmark
than the IDBC-SMB interactive workload.
Nice. Yeah, cool.
So I'd like to ask now, kind of around this time,
what impact do you think that your work can have?
Obviously, with all of the audits and that's some real sort of tangible real-world impact.
Do you think there's anything else, maybe from more of like a software developer
or data engineer sort of perspective,
what impact it can have on their sort of perspective that can kind of what impact
it can kind of have on their sort of day-to-day lives or how they could actually leverage the
findings from the paper or from the benchmark? Sure. So if you have been to research, and I
know you have, then you know how valuable a good data set is. So I think by far the biggest
contribution of this paper is having openly available realistic graph data sets that you can download for free and you can download them in many different formats and many different scale factors.
So I actually see some statistics on how many people download these and they can be quite popular and they can pop up in places that are not restricted to graph databases or graph data
management. Even if you have something like a relational query engine, you can use this data
to load it into the engine and test it. Yeah, so as I said, I expect a lot of impact from
these datasets. And I just looked up the statistics that we have on the surfsara.nl repository.
And we have more than 12,000 downloads amassed for the old interactive workload datasets that range to scale factor 1,000.
So obviously, we expect similar interest or even more for the larger and more interesting datasets.
Yeah, there's some real world impact for you right there. That's fantastic.
Cool. Yeah, so I guess kind of the next question I also kind of like to ask my guests on the show is
that when you've been working on this project, and it's obviously a multi-year project,
you've worked on this for a long time. What was the sort of most interesting lesson that you kind
of encountered or got from that whole experience of working on this paper and project and as the lbbc as a whole right
because it's not just this one paper you've been you've worked on as part of the whole project
sure so obviously it has been a huge learning experience but because we went for large-scale
data that taught me an awful lot about cloud computing and how to run software in the
cloud. And it had one very important lesson, which is we have these dataset sizes called
scale factors that we scale exponentially. So we have scale factors 1, 3, 10, 30, 100, 300,
and so on. So basically every dataset is 3x larger than the previous one. And this is already an
exponential increase. But because in reality, if you start scaling up, then things start to break
and some problems start to show up. Scaling up the data set along this exponential scale is actually
a super exponential problem because you will have more frequent crashes and you will end up spending quite a lot of money on cloud compute and cloud storage costs and
running elastic map reduce in AWS. And so this is something that you have to plan really well.
And then you have to make sure that you have the funds to do that. And you also have the time
to generate test package
and just handle all these data sets awesome yeah i guess on the flip side of that then
where the things kind of on this journey that sort of that failed that you things that you
tried and failed so the more war stories are they obviously it's never it's never plain sailing
right it's never just one big upward trajectory. So yeah, what were some of the more challenging aspects of the journey?
Sure. So one interesting finding was that there are these graph analytical systems
which are geared towards taking a graph that's mostly static in nature
and then doing fast computations on top of it.
In the distributed systems area, you can think of Apache GraphX
or you can think of Apache Giraffe, Flink Jelly.
And in the single node systems, you have GraphBlast and other systems.
And on the surface, these would be kind of ideal for a BI workload.
You can load your data into them, and then they are really efficient in doing the graph
computations.
But it turns out that most of these are really geared towards untyped,
unattributed graphs. So even though we could express most of the LDB-CBI queries in these
systems, expressing them was a lot of work and the performance wasn't great. So we ended up dropping
these and we restricted ourselves to database systems in the workload so this is
something we tried and didn't really work out yeah i remember this was when we were waking
was it the rapid match or something was it that one of the systems we tried yes so there was there
was uh an interesting tangent to this work it's funny you should mention this because
we realized after a while that this workload is very complex. And as we just discussed, if you start implementing this, probably it's going to take you months.
So we did a simplified version of this called the LSQB, the Labeled Subgraph Queue Benchmark,
which is in essence the data set of the BI workload stripped back to the typed graph.
So you still have things like person and tag and city, but you no longer have the attributes. It's just a labeled graph. So you still have things like person and tag and city, but you no longer have the
attributes. It's just a labeled graph. And we took the queries of the BI workload. We selected
six that had some interesting graph patterns, and we also stripped them back to the type. So we no
longer had attributes or any filtering conditions in them. And then we had this micro benchmark,
which you could implement really quickly, like in a day for most systems.
And we didn't mean that to be an official benchmark as an LDB-C benchmark that you can audit.
But it was rather a tool that the researchers working on a system can use as a litmus test.
And they can ask questions like, can I beat Postgres? Can I beat Umbra? Can I beat existing state-of-the-art graph database systems?
And that's where we used the RapidMatch system, which was specifically created to compute
graph isomorphism and graph homomorphism problems.
So there we had some interesting findings.
And there, of course, with this approach, the graph analytical slash graph algorithm systems
had a better chance
because there were no properties in the graph.
Yeah, and I guess, well, this is like another,
we spoke about real-world impact earlier on,
but this also is sort of making this a benchmark
more accessible to sort of academics as well, right?
And not needing all the sort of trimmings
and everything else that you need to go with
sort of an industry-grade benchmark and still having some sort of impact in of academics as well right and not needing all the sort of trimmings and everything else that you need to go with a sort of an industry grade benchmark and still
having some sort of impact in that space as well and having something more accessible which was a
really nice sort of i think a really nice contribution anyway to to the space um yeah
it was fun to work on for sure um cool but yeah we were talking about other research here so i know
you've worked on so many things over the course of your career in academia, Gabor.
So can you tell us maybe about some of the other things you've worked on as well?
Sure. So I spent quite a long stint working on GraphBlast, which I just mentioned in the category of graph analytical systems. because most people are probably familiar with BLAS, B-A-L-A-S,
which is there for creating a standard API
for dense matrix and vector operations.
This is something that has been around for 30 plus years.
What GraphBlast does is that it takes this idea
to the realm of sparse matrices,
which are excellent for representing graphs,
and to the realm of semi-ring computations on these matrices, which are excellent for representing graphs, and to the realm of semi-ring computations
on these matrices, which are a great tool for defining graph computations. And I always found it
extremely elegant that you can say, I have a matrix representing the graph, I have a vector
representing the positions in the graph, the nodes that I am currently visiting. And then you can do something called the min plus multiplication.
So you use matrix multiplication with a semi ring,
where the plus operator is actually the minimum operator.
And the multiplication operator is actually the plus, the addition operator.
And this allows you to find shortest paths in the graph.
So it is very concise and it is very machine friendly
because you can have a sparse matrix computation algorithm
that has no understanding of graphs.
It just knows how to multiply a sparse matrix in a sparse vector.
And then you can express all of these problems
like single source shortest path, BFS
and other graph traversal
algorithms really elegantly and really performantly i'm kind of thinking there about
something has there been any sort of crossover between sort of your two your two loves of your
life there you got your graph blast and ldbc have you ever tried it on many of the benchmarks any
of the workloads because we have a graph algorithms benchmark right so if you tried it on that and how
did it perform so we have a graph algorithms benchmark, right? So if you tried it on that, how did it perform? So we have a graph algorithms benchmark, as you just mentioned.
It's called Graph Analytics, which is a portmanteau of graph and analytics.
And we have a GraphBlast workload on it.
And we even have a GraphBlast competition running this year,
but we didn't announce the results yet.
So maybe your listeners can check back to our websites. have a graph blast competition running this year but we didn't announce the results yet so maybe
your listeners can check back to our websites but graph blast did participate and there were other
systems so i'm personally also really curious to see what the results are and how it fares against
other systems which may use a completely different computation model yeah sure do we know when those
results are going to be made public around the end of the year so in the coming weeks okay just in time for christmas yeah
cool great now we've just got two more questions now gabriel so the the penultimate one is my
favorite and it's about the creative process so how do you go about generating ideas and then
once you have generated them how do you select which ones to work on
well as you may have guessed i like to select process with projects which have a chance of
creating a real world impact so i actually spent a lot of time just reading forums reading stack
workflow even answering questions in the graph space, and then joining Slack and talking to people on these Slack channels and gathering an impression of what they find difficult
and what they deem interesting.
And of course, I also do this at conferences and any opportunity that I get to talk to
fellow researchers.
And I try to compact it down to core ideas,
like what is the main problem?
Is it the data loading?
Is it that the join is slow?
Is it aggregation performance?
Is it the visualization?
And then I tried to tackle that in my research.
Because I did benchmark design for a long time,
this was really natural
because if you hear something that's bugging people
in terms of performance
then you just create a new choke point in the idbc benchmark and then you design new queries
and you got that covered and hopefully a couple years later someone will implement optimization
that will tackle just this same issue fantastic it's another good answer to that question gotta
look inside your mind then see how it all works great stuff cool yeah so it's another good answer to that question gotta look inside your mind then
see how it all works great stuff cool yeah so it's last question now so the last word so what's
the one thing you want the listener to take away from from your work and from this this podcast
today so i've worked on this benchmark between early 2017 and the end of 2022. So that spans across a six-year time span.
And this implies that it was difficult.
So it took a long time.
And I think benchmarks in general are surprisingly hard.
If you want to have something that's really realistic, objective,
and it's representative of a real-world use case and also
scales, then you will have a lot of time to invest into actually making this a reality.
So this has a big implication on systems builders, because if they build a system and they want to
benchmark it in their papers,
they have to be really careful about selecting a good benchmark and then abiding the principles
of those benchmarks. And I would slightly discourage people to try to invent their
own benchmarks, because we know that in most cases, this will end with the authors of the paper winning in almost every single aspect of
the benchmark, which is not necessarily helpful for evaluating the merit of an idea. So I would
take an off-the-shelf benchmark, probably one that's widely used. TPC benchmarks are great,
and I can also recommend LDBC benchmarks, and try to use that.
You can simplify them.
You can definitely implement them partially.
It's not expected that all papers implement the full workload.
But if you do this, you will have a head start in terms of using a more realistic data set,
using more interesting queries, more updates and so on so ultimately i
think it will benefit your paper fantastic and let's uh let's end it there then thank you so
much gabar for coming on the show we'll put links to everything in the show notes so the list you
can go and can figure out find out and research all of the cool things we've spoke about today
and where can we on on the on the socials twitter or linkedin where can we find you on the socials, Twitter or LinkedIn? Where can we find you? What's the handle?
My handle is my last name, Sarnios, and my first name, G.
So Sarnios G on Twitter and GitHub and most other platforms.
And also, LDBC has a Twitter account.
It is twitter.com slash LDB Council.
Perfect.
We'll put those links in as well.
And a reminder that if you do enjoy the show
please do consider supporting us through buy me a coffee it really helps us to keep
making the show and we'll see you all next time for some more awesome computer science research Thank you.