Disseminate: The Computer Science Research Podcast - Gábor Szárnyas | The LDBC Social Network Benchmark: Business Intelligence Workload | #44

Episode Date: December 4, 2023

Summary: In this episode, Gábor Szárnyas takes us on a journey through the LDBC Social Network Benchmark's Business Intelligence workload (SNB BI). Developed through collaboration between academia a...nd industry the SNB BI is a comprehensive graph OLAP benchmark. It pushes the boundaries of synthetic and scalable analytical database benchmarks, featuring a sophisticated data generator and a temporal graph with small-world phenomena. The benchmark's query workload, rooted in LDBC's innovative design methodology, aims to drive future technical advancements in graph database systems. Gabor highlights SNB BI's unique features, including the adoption of "parameter curation" for stable query runtimes across diverse parameters. Join us for a succinct yet insightful exploration of SNB BI, where Gábor Szárnyas unveils the intricacies shaping the forefront of analytical data systems and graph workloads.Links: VLDB'23 PaperGabor's HomepageLDBC HomepageLDBC GitHub Hosted on Acast. See acast.com/privacy for more information.

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Disseminate the Computer Science Research Podcast. I'm your host, Jack Wardby. A reminder, if you enjoy the show, please do consider supporting us through Buy Me a Coffee. It really helps us to keep making the show. Today, it's my great pleasure to say I'm joined by one of my good friends, Gabor Sanyesh, who will be telling us everything we need to know about the LBBC social network benchmarks, business intelligence workload. Gabor is a dev rel at DuckDB. And yeah, that's a little brief intro for you there, Gabor. So can you tell us more about yourself and how you became interested in database management research? Sure. So thanks for having me. And I had a lengthy stint in academia. So I became fascinated with databases as an undergraduate student.
Starting point is 00:01:06 Back in 2010, I had a really nice undergraduate database course, and it was quite theoretical. And as usual at the time, it was about normal forms and calculus and making theorems on transactions and proving them, but I really enjoyed it. So I chose my undergraduate topic on this, and I did a thesis on bitmap indexing, which is more sort of a systems work. And then I joined a group which worked on a distributed graph query engine. This was challenging because we wanted to make it incremental. So my master's project was prototyping this. And of course, my master's projects was also discovering that this is not only
Starting point is 00:01:45 a master's project. This is something that you can do for a more extended period of time. So I joined the same research group as a PhD student, and I kept building this system. And as I was building it, I realized that we need a really good benchmark. And there was already a benchmark that other researchers in the group have started. They have been struggling with getting it published for a couple of years. So I joined that project and got it to publication. And this was my first serious attempt at benchmark framework building. And it was very interesting. We did incremental graph queries.
Starting point is 00:02:23 We had dozens of systems from various spaces like RDF and graph databases and SQL. So I already played with Neo4j, MySQL, SQLite, and many other systems. So this was a very good introduction. And then I finished my PhD. And around the end of my PhD, I started to work on the LDB-C benchmark, which is a topic of today's conversation. So I did my work on that as a postdoc and now I moved to industry. I'm at DuckDB Labs, which is making the DuckDB embeddable database product. Fantastic. I'm learning more about you as every day, Gabo. I didn't know all that stuff. So yeah, and the benchmark you
Starting point is 00:03:01 referenced there, was that the train benchmark? It was a train benchmark. So this is something that's set in the railway domain. So it is capturing a safety-critical domain, which you build with some sort of a dedicated design software. And as you build it, ideally you would want to have timely feedback on any error or any issue that you may insert into the system. And it turns out that if you translate this to the language of databases, what you basically want is an incremental query engine that also understands graphs,
Starting point is 00:03:33 so it can do reachability and similar queries. So that was the benchmark. And, of course, it was very difficult to tackle for systems because it has big global queries on a complex graph that you can think about as a property graph or you can think about it as an object graph. Nice, nice. Well, let's talk about another benchmark then. So before we do that, I should probably start with some background for the listeners. So you can maybe tell us what LVBC is, what it stands for. And I should probably say at this point as well,
Starting point is 00:04:02 as a bit of a disclaimer, I'm also a co-author on this paper, but I'm going to try and remain impartial all the way through this. So yeah, the floor is yours, Gabor. Tell us about LDBC. All right. So LDBC originally started as a European Union research project. It was two and a half years long and started in late 2012 with the initial goal of creating a consortium where industry and academia can collaborate. They can define standards in this space, which were much needed. And these standards included benchmarks where they agree on a given workload that this is important and this is worth competing on. And it also included things like graph schema and graph query languages.
Starting point is 00:04:47 And of course, two and a half years is nowhere near long enough to complete all this. So once the EU project concluded successfully, they created a nonprofit company. And this nonprofit company had many of the same members that the European Union project had, as well as new members. And they kept working on all of these research members that the European Union project had, as well as new members, and they kept working on all of these research objectives that I just said.
Starting point is 00:05:10 So this was from 2015, and this is alive to this day. So the LDBC is now roughly 10, 11 years old, and it has more than 20 members who collaborate on various issues in the form of benchmark task forces and working groups on query language and schema questions. Fantastic. And this spans industry and academia, right? There's people from all over the shop involved in this collaboration, which is really good fun. So we're going to be talking specifically today about the business intelligence workload. And obviously, you mentioned earlier on that Ben a benchmark has a framework with it in a suite so maybe you can start by giving the listeners on a high level overview of the various components that make up the business intelligence workload all right so as the name suggests the business intelligence workload is
Starting point is 00:05:58 an analytical benchmark in nature so the main goal of it is to simulate a scenario where you have lots of data and you would like to extract some sort of a value from the data, find some observations that may help influence your business decisions. It can be thought of as a decision support benchmark in all terms. So the workload basically consists of a couple of components, the first being the dataset. The name says social network benchmark, and this is indeed modeled after a social network. When we did this benchmark, we dug into the literature on social networks. This is actually something you and I did together, Jack. So we read a bunch of papers on how the social networks work in practice, how people are added to the network, how they leave the network. And this forms the data set that we use in our benchmark.
Starting point is 00:06:53 And there is an important disclaimer here. We don't use a social network because we believe this is the only use case for graph databases or even the most important use case. We use the social network because it's easy to understand. We had accessible real-life data, or at least information on real-life data. And this is really easy to explain to users. So we have this social network. And in this imaginary scenario that's captured by the benchmark, we run different queries on this workload. And the queries are quite complex. I think we will touch more on this workload. And the queries are quite complex.
Starting point is 00:07:25 I think we will touch more on that later, but suffice to say, these are not microsecond queries. Probably a query will take 10 seconds or half a minute. These are really heavy-hitting queries, like go through all the people in India and do some aggregation or find all the triangles of people living in China. Obviously, these are quite difficult to compute and also incur a lot of IO because they need to access a lot of data. So we have the data set, we have the queries. The data set is not static. So we
Starting point is 00:07:58 have a stream of updates, which is a mix of insert operations with new content and new people added to the network. And it's also a bunch of deletes with new content and new people added to the network. And it's also a bunch of deletes with people deleting themselves or retracting a like or retracting a message that they made, which makes the workload quite challenging for systems. And we have a set of parameters that we use to run the queries with. And finally, we have something called scheduling. So obviously there are many ways to run all these queries and all these updates, but we have scheduling rules
Starting point is 00:08:33 on how the queries must follow each other when the updates may or may not happen, how many threads must be used, and so on. Nice. That's a really nice overview there of everything. Yeah, I know when we were actually designing, trying to kind of model it on some of the real world there we had some fun and i did like sort of making a facebook account and seeing what happens if i have if i join a group leave a group and if i delete this and delete that and yeah it was good fun and but anyway i digress so let's talk about how we actually like go about generating this data then so you've spoke about the data set there and there was before we did this this paper there was already some pre-existing
Starting point is 00:09:09 software called the data generator data gen for short that actually produced this this data set so maybe you can tell us more about data gen its background and kind of the adjustments we had to make to get to support the data set or generate the data set for this benchmark? Sure. So we were lucky enough to inherit a functioning data gen. And it turns out that these things are not easy to make. So the official data gen back when I joined the project was written in Apache Hadoop, and it ran MapReduce jobs for scalability. And the idea of it was that it tried to create a network that's just realistic enough to actually stress important aspects of query processing in the systems and actually have
Starting point is 00:09:55 interesting distributions. And the way they achieved this is that, of course, they didn't want to sacrifice the scalability. So they had a couple of tricks. One trick was that they generated the core graph, which is the person-nose-person graph, in a way where the nose edge may be created along three dimensions. And these dimensions are that the two people went to the same university.
Starting point is 00:10:22 This increases the probability of becoming friends, as does it in real life. Another dimension is interest. So if they are interested in the same topic, they are more likely to become friends. And the third dimension is just random. So people sometimes just bump into other people from all walks of life. And these three dimensions are layered upon each other. And this is how we get a person-nose-person graph, which is actually quite realistic. So this is how the core of the graph is created. And this is what gives the graph the computational complexity that the queries will exploit. But what gives the weight of the graph, meaning the heavy part of the graph, is actually
Starting point is 00:11:04 the content that the people produce in the social part of the graph is actually the content that the people produce in the social network. If you think about it, Facebook only has maybe a few billion people, but it has a lot more posts and a lot more comments and a lot more likes edges. And this is reflected in our data generator. So about 80% of our data is forums and comments. And these were generated based on data from Dictionary and DBpedia. So what the data generator originally did is that it just stitched together random words and random topics and slapped on a number of tags on them. You can think of the tags as hashtags in Twitter and other social networks. So this was great. This was something that we had, but it had a couple of drawbacks. The first drawback was that
Starting point is 00:11:51 it did not scale beyond scale factor 1000. The scale factor here means the size of the data set in a CSV format, in an uncompressed texture representation. And 1,000, it means roughly one terabyte of CSV, which is actually not that big of a deal for modern systems. And especially in the territory of business intelligence where we wanted to position this benchmark, it's not really big data anymore. So we wanted to crank this up to 10,000 or possibly even larger. The other problem was that the original data generator did not have any delete operations. The network just kept growing. And this allows for some shortcuts in the systems. For example, you can use append-only data structures. You can use algorithms which compute shortest paths, but they do not take deletes into account upon updates. So you can
Starting point is 00:12:46 cut corners and we don't want that. So we ended up reworking this data generator quite thoroughly. It was a gradual process. So we migrated away from Hadoop to Spark, which was the more modern stack at the time of the migration. And we also added the deletes. This is actually quite interesting because I'm from Hungary and there was a Hungarian social network, the IWIW, the International Who is Who. And luckily for us, once they went defunct, they published their dataset, anonymized, of course, but the dataset had something called the time of the last login. And this was picked up by researchers at CEU, the Central European University, and ELT, another Hungarian university.
Starting point is 00:13:30 And they analyzed how the network is basically getting disintegrated. Who leave the network first and try to come up with theories on why they leave first. So it was incorporated into the updated data generator so now we have something that is more realistic in terms of how the influx of people are coming in and how they then leave the network yeah it's a fantastic answer that question and it all of this sort of makes for a really really really really really really
Starting point is 00:14:05 really really realistic um data set right to to be the sort of the bedrock of the the benchmark and really allows us to sort of express some really interesting queries on top of that and kind of with that it's a nice little segue into how we go about actually designing the workload and the queries and what our methodology was for for doing that and and that is called the chokepoint methodology right so can you kind of explain to the listener what the choke point methodology is and how that was used to design the read queries in the business intelligence workload? Absolutely. So you can, of course, design a benchmark by just coming up with ideas that you would want to ask on a social network. And that's a good basis, but it is not very principled and it will likely
Starting point is 00:14:47 emphasize some aspects a lot more than others. What the choke points-based methodology does is that it tries to extract the core difficulty of the queries. And by core difficulty, you can think about aggregation performance, certain types of joins, like outer joins or semi-joins. You can think of graph workloads like traversal and cheapest path computation. And then we have this matrix of choke points on one dimension and queries on the other dimension. And what we try to achieve is to have sufficient coverage for all of the choke points by the queries, but we try not to overstress any of the choke points. So we are looking for a balanced coverage of all the choke points.
Starting point is 00:15:32 And that is, of course, a very iterative process. We keep refining the queries. We bump into new choke points as we do so. But ultimately, we got a matrix of 20 queries and roughly 30 choke points that we deemed to be sufficient. And we were happy with how they covered them. Can you give us an example of one of the queries? You must have a favorite query in the workload. Yes, my particular favorite is the triangle query.
Starting point is 00:15:59 So this says that you should start from a country and you traverse to all the cities of those country. And you try to find triangles of people who live in those cities and they know each other. And we added one twist. This is already difficult enough, but we added a twist that they must have become friends in a given time range. Maybe all of them went to the same concert and then they just added each other, or maybe two of them added each other and then the day after they bumped into someone. And just to make matters difficult enough, here we only select from the largest countries. So when we do the parameter selection, it has to be either India or China, the two countries with the largest populations. And we give quite wide ranges for the dates.
Starting point is 00:16:51 So we have a large pool of candidate peoples who can be part of the triangles. is really tying into modern database research because in order to be efficient in this query your system is recommended to employ something called versca's optimal joins and those versca's optimal joins can run the triangle query a lot more efficiently than the traditional binary joins that are available in most relational database systems awesome stuff so yeah that's a nice little um segue there and how do we actually go about selecting these parameters for these queries so there's a this obviously i guess on the face of it sounds simple right yeah i don't know you just pick you look at the graph pick a node id plug it into the query and away you go but it's a little bit more complicated than that right for for this data set
Starting point is 00:17:39 and for this domain so can you maybe explain why why parameter curation is more challenging here and then explain like our approach to to um to solving this problem in this paper yeah absolutely so obviously the first try that anyone does when um creating a benchmark is what you described we just randomly pick input parameters and run the queries But the shock comes when you plot the results because you will have these very wild distributions with very little actual substance whatsoever. And we have such distributions in the paper, so the listeners can look it up. You get these weird point clouds of random data points.
Starting point is 00:18:22 The reason behind this is that in a graph workload, you have the data which is very graphy, quote-unquote, because it has nodes that are supernodes in the network and it has nodes that are on the verge of the network. This has many names. It is called the Justin Bieber problem, for example, when someone in the social network may have millions of followers, and you also have people who just joined. And obviously, if you select someone who is really popular and do a two-hop
Starting point is 00:18:56 query along the nose edges, you will touch on millions of nodes, whereas if you select someone who just joined, you maybe only touch on a dozen nodes. And this will give a very uneven distribution. And that is going to make it very difficult to reason about your query results, because it's very difficult to compare different distributions, especially if you only have a couple of data points. So what we try to achieve instead of this is we try to ensure that you have something resembling a normal distribution. So we selected, for example, in the person and two hop case,
Starting point is 00:19:34 we selected people who have roughly a similar amount of friends and friends of friends. And also because the query is also query for other properties like tags and birthdays, we tried to use those to counterbalance the effect. So if someone maybe has a bit more friends, then we include that person with a narrower range in terms of the date filtering.
Starting point is 00:20:00 And this tends to work reasonably well. And we got rid of those weird bimodal, trimodal and random distributions. Our distributions are pretty close to the ideal bell shape or bell curve. Fantastic. Yeah, it's funny you say about the Justin Bieber problem.
Starting point is 00:20:17 I think over the course of my PhD, it was the Lady Gaga problem for a bit. Probably the Taylor Swift problem was probably the most appropriate. She's probably the most famous musician in the world at the minute maybe so yeah i don't know it the super nerd name changes with whoever's the most popular at the time right um yeah it's you as well so yeah that's true yeah um cool anyway so yeah we've been through all the major components of the of the of the benchmark now so maybe you can kind of talk us
Starting point is 00:20:44 through the sort of end to end flow. So if I wake up tomorrow and I don't want to benchmark my new fancy graph system, what does that look like? How do we go from end to end with it? Right. So the benchmark workflow itself has to start with loading the data in the system. So you have a bunch of files. It's most often a bunch of CSV files, but you can also convert them to Parquet or have RDF data. And then you have to load it into the system. This is something that in the BI workload, we care about a bit, but we also don't care about it too much. So this will factor in into your final score, but it will have a small effect on the final score. In any case,
Starting point is 00:21:26 you can load it. It may take a couple of hours and then you have to start querying. And the querying has these blocks which measure query performances. And the blocks always start with running updates. So you apply a day's worth of inserts and a day's worth of reads. Basically, you can imagine checking yourself the social network at 2 a.m., you see what happened the last day, and you apply all those changes in your copy of the social network, and then you run your queries on the network. So this is the workflow, and you keep repeating these batches of updates and reads.
Starting point is 00:22:05 And of course, you asked, how do you go on implementing about this? Well, you first implement your loader, possibly with a schema and then some automated way to load the CSV files. Then you implement the queries. We have 20 queries that you have to implement. And we have eight insert and eight delete operations and then you just have to stage this together and you will have your end-to-end workload running of course then you have to validate it against some existing system and you have to ensure that your system gives the correct results awesome so what's the end what's the end score at the end and what am i competing
Starting point is 00:22:42 and what's the final value i'm'm like, my system's performed this compared to other systems. I've also run the benchmark. We have two scores inspired by TPC benchmarks. We have the power score, which is the single query performance. And the rationale for this is that if you're an analyst and you're sitting in front of your computer,
Starting point is 00:23:03 then you want timely answers to your queries. So we basically want to have a number that's connected to the sheer speed of the system. If you have a complex query and you hit enter and you have to wait X seconds. But the other use case is that many of these systems run somewhere on a server. Maybe they produce some reports. Maybe they look for fraud or pre-compute recommendations. And in this case, the individual query runtimes are not that relevant. What's relevant is the throughput of the system. So we have a throughput score where you can run
Starting point is 00:23:37 queries concurrently, and that will be characterized by the final throughput score. And we actually have two more scores which are the aforementioned power score and throughput score but they are adjusted with the system price because we take the system price into account so we have dollar adjusted throughput and power scores awesome so yeah so you in the paper you took two benchmarks for spins you what you went through this way and this workflow with them so can you maybe tell us about that experience and i guess before before i do that how long do you think it should take someone on average i know this is probably kind of how long is a piece of string sort of question to kind of implement this and get kind of a v
Starting point is 00:24:17 kind of a first iteration of the whole workload benchmark set up running and generating scores? Okay, so it's a good question. The way I explained it before made it sound very simple because you have to load some data in your system that you likely already know very well. And then you have to implement 20 queries and a bunch of updates. But it turns out that implementing those 20 queries alone is a huge undertaking because there are no standard languages in the graph space as of today, barring the SQL PGQ extension. So it is quite a lot of work just to formulate your queries in a different language. It can be Cypher, GSQL, GQL, Sparkle, et cetera. This will take probably a few weeks for you to implement the queries correctly. The updates are usually simpler to implement, but the challenge then is to ensure that everything is correct.
Starting point is 00:25:18 And that's challenging for both the reads and the updates. So you will likely spend a few more weeks debugging and optimizing your query performance. And this is not to mention that usually when people do this, they discover some bottlenecks in the system and that will necessitate some work from the engineering department and likely a new release of the system. So when you put all this together, you can get an initial implementation in about a month, but to attempt a serious implementation is probably going to be a multi-month effort fantastic yeah so when you actually start kind of thinking about it ends up taking a lot longer
Starting point is 00:25:54 than you'd kind of initially think of it like i'll do it a long weekend and then it becomes like a multi-month month project and cool so in the paper there there are two systems that you took the benchmark for a spin with. Can you tell us about the systems and what the experience was like kind of getting it to work on those systems and what your analysis of the results were? Of course. So we picked two systems and these couldn't be much more different. So the first system that we used was Umbra. This is developed at TU Munich, so it is an academic system. And it's a single node, so it's a non-distributed system, and relational database management system. They support most of the syntax of PostgreSQL, and they go to great lengths to ensure that they do larger than memory processing,
Starting point is 00:26:46 so they can spill to disk, specifically with NVMe SSDs in mind. And they have a state-of-the-art JIT compiled query engine, which actually takes the SQL query and emits x86 bytecode from it. So it compiles not to LLVM, but to an even lower level, which makes it super fast for compilation. And it produces very good query plans and very good code to run as a consequence. So this was one system that we used. And the other system was TigerGraph. TigerGraph is an American company which creates a distributed database system. This is a native graph database, so they don't have a relational storage engine, but they treat all your great data as a graph. And they support a language called GSQL, which is specifically created for graph traversal
Starting point is 00:27:39 and graph algorithms. Cool. So that's fantastic. So are there any limitations with the benchmark at the moment then? And are there any gaps in it at the moment that you think maybe could be improved in the future? Definitely. There is always more work to be done in benchmarks. I think one of the most glaring omissions that we made in the BI workload is that it doesn't really have graph algorithms in the sense that we never do something like a connected components computation or a community detection or something like that on the graph. This is something that I could imagine in an actual workload running on a similar data set. That would be probably interesting and also gnn's as a whole graph neural networks are graph algorithms that many people run on different graph shape data and this would make
Starting point is 00:28:33 a great addition to the benchmark it will of course come at the expense of more complexity yeah of course it'd be a really interesting benchmark right and kind of i guess would be a sort of a better approximation of what people might want to do in practice or a closer approximation and maybe the the bi 2.0 i don't know yeah yeah cool i guess yeah and i guess i guess on that we can kind of maybe touch on what's what's next for um ldbc and the bi workload is there going to be a 2.0 or where's what's it looking like at the moment so ldbc has some guidelines on renewal because obviously these benchmarks don't stay current and fresh for an infinite amount of time so we try to do renewals in five to ten years this is still quite some distance out so the bi
Starting point is 00:29:22 workload is safe our goal with the BI workload is to have more audited systems. And I think this is a good opportunity to talk about audits. LDBC has a process where we put out our benchmarks in the public with really detailed specifications on how they should be used. And companies can then hire an independent third-party auditor who is actually running the benchmark for them. So it's a complete reproduction of the benchmark, followed by a very detailed report called the full disclosure report. And this is reviewed by the benchmark task force. So all of this is to guarantee that the benchmark was done correctly.
Starting point is 00:30:00 It's a faithful implementation, and the execution didn't have any issues in it. So for the BI workload, we are currently in this phase where we are facilitating audits. We already had a couple of audits that were conducted successfully and we have more on the way. So this is the current status for us. Fantastic. How many have been completed so far? So there are four successful ones. All of them were done by the Tiger Graph system on different setups. One was on-premises machine and others were in the cloud. And there is one more that's currently underway and more planned.
Starting point is 00:30:38 Fantastic stuff. So how does this work? What then do you think? Let's position it against other benchmarks in the space, and how do you think it compares with some of the things from TPC and other LDBC workloads? Because this isn't the only one, right? We've got other ones as well, so maybe you can talk about those as well.
Starting point is 00:30:56 Sure. So LDBC has a couple of benchmarks. It has the original social network benchmark, which is called the interactive workload. So the interactive workload has a transactional processing that's being simulated. So it has short writes and short reads. Compared to this, this is basically the other end of the scale. So this is OLTP versus OLAP. Then we have a RDF-focused benchmark called the semantic publishing benchmark. That's more focused on inferencing and being able to handle RDF-shaped data with SPARQL queries.
Starting point is 00:31:30 And we have a new, very interesting benchmark called the FinBench, which targets distributed systems. And it does graph processing on huge amounts of data with relaxed consistency constraints. This is something that's just been released as version v0.1. And this is something where you really have to ensure that your system cannot only answer the benchmark queries, but it can do so very quickly under 100 milliseconds in most cases. So this is, again,
Starting point is 00:32:00 probably an even more transactional benchmark than the IDBC-SMB interactive workload. Nice. Yeah, cool. So I'd like to ask now, kind of around this time, what impact do you think that your work can have? Obviously, with all of the audits and that's some real sort of tangible real-world impact. Do you think there's anything else, maybe from more of like a software developer or data engineer sort of perspective,
Starting point is 00:32:23 what impact it can have on their sort of perspective that can kind of what impact it can kind of have on their sort of day-to-day lives or how they could actually leverage the findings from the paper or from the benchmark? Sure. So if you have been to research, and I know you have, then you know how valuable a good data set is. So I think by far the biggest contribution of this paper is having openly available realistic graph data sets that you can download for free and you can download them in many different formats and many different scale factors. So I actually see some statistics on how many people download these and they can be quite popular and they can pop up in places that are not restricted to graph databases or graph data management. Even if you have something like a relational query engine, you can use this data to load it into the engine and test it. Yeah, so as I said, I expect a lot of impact from
Starting point is 00:33:18 these datasets. And I just looked up the statistics that we have on the surfsara.nl repository. And we have more than 12,000 downloads amassed for the old interactive workload datasets that range to scale factor 1,000. So obviously, we expect similar interest or even more for the larger and more interesting datasets. Yeah, there's some real world impact for you right there. That's fantastic. Cool. Yeah, so I guess kind of the next question I also kind of like to ask my guests on the show is that when you've been working on this project, and it's obviously a multi-year project, you've worked on this for a long time. What was the sort of most interesting lesson that you kind of encountered or got from that whole experience of working on this paper and project and as the lbbc as a whole right
Starting point is 00:34:10 because it's not just this one paper you've been you've worked on as part of the whole project sure so obviously it has been a huge learning experience but because we went for large-scale data that taught me an awful lot about cloud computing and how to run software in the cloud. And it had one very important lesson, which is we have these dataset sizes called scale factors that we scale exponentially. So we have scale factors 1, 3, 10, 30, 100, 300, and so on. So basically every dataset is 3x larger than the previous one. And this is already an exponential increase. But because in reality, if you start scaling up, then things start to break and some problems start to show up. Scaling up the data set along this exponential scale is actually
Starting point is 00:35:00 a super exponential problem because you will have more frequent crashes and you will end up spending quite a lot of money on cloud compute and cloud storage costs and running elastic map reduce in AWS. And so this is something that you have to plan really well. And then you have to make sure that you have the funds to do that. And you also have the time to generate test package and just handle all these data sets awesome yeah i guess on the flip side of that then where the things kind of on this journey that sort of that failed that you things that you tried and failed so the more war stories are they obviously it's never it's never plain sailing right it's never just one big upward trajectory. So yeah, what were some of the more challenging aspects of the journey?
Starting point is 00:35:45 Sure. So one interesting finding was that there are these graph analytical systems which are geared towards taking a graph that's mostly static in nature and then doing fast computations on top of it. In the distributed systems area, you can think of Apache GraphX or you can think of Apache Giraffe, Flink Jelly. And in the single node systems, you have GraphBlast and other systems. And on the surface, these would be kind of ideal for a BI workload. You can load your data into them, and then they are really efficient in doing the graph
Starting point is 00:36:20 computations. But it turns out that most of these are really geared towards untyped, unattributed graphs. So even though we could express most of the LDB-CBI queries in these systems, expressing them was a lot of work and the performance wasn't great. So we ended up dropping these and we restricted ourselves to database systems in the workload so this is something we tried and didn't really work out yeah i remember this was when we were waking was it the rapid match or something was it that one of the systems we tried yes so there was there was uh an interesting tangent to this work it's funny you should mention this because
Starting point is 00:36:59 we realized after a while that this workload is very complex. And as we just discussed, if you start implementing this, probably it's going to take you months. So we did a simplified version of this called the LSQB, the Labeled Subgraph Queue Benchmark, which is in essence the data set of the BI workload stripped back to the typed graph. So you still have things like person and tag and city, but you no longer have the attributes. It's just a labeled graph. So you still have things like person and tag and city, but you no longer have the attributes. It's just a labeled graph. And we took the queries of the BI workload. We selected six that had some interesting graph patterns, and we also stripped them back to the type. So we no longer had attributes or any filtering conditions in them. And then we had this micro benchmark, which you could implement really quickly, like in a day for most systems.
Starting point is 00:37:47 And we didn't mean that to be an official benchmark as an LDB-C benchmark that you can audit. But it was rather a tool that the researchers working on a system can use as a litmus test. And they can ask questions like, can I beat Postgres? Can I beat Umbra? Can I beat existing state-of-the-art graph database systems? And that's where we used the RapidMatch system, which was specifically created to compute graph isomorphism and graph homomorphism problems. So there we had some interesting findings. And there, of course, with this approach, the graph analytical slash graph algorithm systems had a better chance
Starting point is 00:38:29 because there were no properties in the graph. Yeah, and I guess, well, this is like another, we spoke about real-world impact earlier on, but this also is sort of making this a benchmark more accessible to sort of academics as well, right? And not needing all the sort of trimmings and everything else that you need to go with sort of an industry-grade benchmark and still having some sort of impact in of academics as well right and not needing all the sort of trimmings and everything else that you need to go with a sort of an industry grade benchmark and still
Starting point is 00:38:47 having some sort of impact in that space as well and having something more accessible which was a really nice sort of i think a really nice contribution anyway to to the space um yeah it was fun to work on for sure um cool but yeah we were talking about other research here so i know you've worked on so many things over the course of your career in academia, Gabor. So can you tell us maybe about some of the other things you've worked on as well? Sure. So I spent quite a long stint working on GraphBlast, which I just mentioned in the category of graph analytical systems. because most people are probably familiar with BLAS, B-A-L-A-S, which is there for creating a standard API for dense matrix and vector operations.
Starting point is 00:39:31 This is something that has been around for 30 plus years. What GraphBlast does is that it takes this idea to the realm of sparse matrices, which are excellent for representing graphs, and to the realm of semi-ring computations on these matrices, which are excellent for representing graphs, and to the realm of semi-ring computations on these matrices, which are a great tool for defining graph computations. And I always found it extremely elegant that you can say, I have a matrix representing the graph, I have a vector representing the positions in the graph, the nodes that I am currently visiting. And then you can do something called the min plus multiplication.
Starting point is 00:40:08 So you use matrix multiplication with a semi ring, where the plus operator is actually the minimum operator. And the multiplication operator is actually the plus, the addition operator. And this allows you to find shortest paths in the graph. So it is very concise and it is very machine friendly because you can have a sparse matrix computation algorithm that has no understanding of graphs. It just knows how to multiply a sparse matrix in a sparse vector.
Starting point is 00:40:38 And then you can express all of these problems like single source shortest path, BFS and other graph traversal algorithms really elegantly and really performantly i'm kind of thinking there about something has there been any sort of crossover between sort of your two your two loves of your life there you got your graph blast and ldbc have you ever tried it on many of the benchmarks any of the workloads because we have a graph algorithms benchmark right so if you tried it on that and how did it perform so we have a graph algorithms benchmark, right? So if you tried it on that, how did it perform? So we have a graph algorithms benchmark, as you just mentioned.
Starting point is 00:41:09 It's called Graph Analytics, which is a portmanteau of graph and analytics. And we have a GraphBlast workload on it. And we even have a GraphBlast competition running this year, but we didn't announce the results yet. So maybe your listeners can check back to our websites. have a graph blast competition running this year but we didn't announce the results yet so maybe your listeners can check back to our websites but graph blast did participate and there were other systems so i'm personally also really curious to see what the results are and how it fares against other systems which may use a completely different computation model yeah sure do we know when those
Starting point is 00:41:43 results are going to be made public around the end of the year so in the coming weeks okay just in time for christmas yeah cool great now we've just got two more questions now gabriel so the the penultimate one is my favorite and it's about the creative process so how do you go about generating ideas and then once you have generated them how do you select which ones to work on well as you may have guessed i like to select process with projects which have a chance of creating a real world impact so i actually spent a lot of time just reading forums reading stack workflow even answering questions in the graph space, and then joining Slack and talking to people on these Slack channels and gathering an impression of what they find difficult and what they deem interesting.
Starting point is 00:42:33 And of course, I also do this at conferences and any opportunity that I get to talk to fellow researchers. And I try to compact it down to core ideas, like what is the main problem? Is it the data loading? Is it that the join is slow? Is it aggregation performance? Is it the visualization?
Starting point is 00:42:54 And then I tried to tackle that in my research. Because I did benchmark design for a long time, this was really natural because if you hear something that's bugging people in terms of performance then you just create a new choke point in the idbc benchmark and then you design new queries and you got that covered and hopefully a couple years later someone will implement optimization that will tackle just this same issue fantastic it's another good answer to that question gotta
Starting point is 00:43:24 look inside your mind then see how it all works great stuff cool yeah so it's another good answer to that question gotta look inside your mind then see how it all works great stuff cool yeah so it's last question now so the last word so what's the one thing you want the listener to take away from from your work and from this this podcast today so i've worked on this benchmark between early 2017 and the end of 2022. So that spans across a six-year time span. And this implies that it was difficult. So it took a long time. And I think benchmarks in general are surprisingly hard. If you want to have something that's really realistic, objective,
Starting point is 00:44:03 and it's representative of a real-world use case and also scales, then you will have a lot of time to invest into actually making this a reality. So this has a big implication on systems builders, because if they build a system and they want to benchmark it in their papers, they have to be really careful about selecting a good benchmark and then abiding the principles of those benchmarks. And I would slightly discourage people to try to invent their own benchmarks, because we know that in most cases, this will end with the authors of the paper winning in almost every single aspect of the benchmark, which is not necessarily helpful for evaluating the merit of an idea. So I would
Starting point is 00:44:54 take an off-the-shelf benchmark, probably one that's widely used. TPC benchmarks are great, and I can also recommend LDBC benchmarks, and try to use that. You can simplify them. You can definitely implement them partially. It's not expected that all papers implement the full workload. But if you do this, you will have a head start in terms of using a more realistic data set, using more interesting queries, more updates and so on so ultimately i think it will benefit your paper fantastic and let's uh let's end it there then thank you so
Starting point is 00:45:30 much gabar for coming on the show we'll put links to everything in the show notes so the list you can go and can figure out find out and research all of the cool things we've spoke about today and where can we on on the on the socials twitter or linkedin where can we find you on the socials, Twitter or LinkedIn? Where can we find you? What's the handle? My handle is my last name, Sarnios, and my first name, G. So Sarnios G on Twitter and GitHub and most other platforms. And also, LDBC has a Twitter account. It is twitter.com slash LDB Council. Perfect.
Starting point is 00:46:02 We'll put those links in as well. And a reminder that if you do enjoy the show please do consider supporting us through buy me a coffee it really helps us to keep making the show and we'll see you all next time for some more awesome computer science research Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.