Disseminate: The Computer Science Research Podcast - Per Fuchs | Sortledton: a Universal, Transactional Graph Data Structure | #13

Episode Date: November 28, 2022

Summary (VLDB abstract):Despite the wide adoption of graph processing across many different application domains, there is no underlying data structure that can serve a variety of graph workloads (anal...ytics, traversals, and pattern matching) on dynamic graphs with transactional updates. In this episode, Per talks about Sortledton, a universal graph data structure that addresses the open problem by being carefully optimizing for the most relevant data access patterns used by graph computation kernels. It can support millions of transactional updates per second, while providing competitive performance (1.22x on average) for the most common graph workloads to the best-known baseline for static graphs – csr. With this, we improve the ingestion throughput over state-of-the-art dynamic graph data structures, while supporting a wider range of graph computations under transactional guarantees, with a much simpler design and signifcantly smaller memory footprint (2.1x that of csr).Links:PaperPer's LinkedInGraph Framework EvaluationImplementation Hosted on Acast. See acast.com/privacy for more information.

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Disseminate, a computer science research podcast. I'm your host, Jack Wardby. I'm delighted to say I'm joined today by Per Fuchs, who we'll be talking about his VLDB22 paper, Sortle Don, a Universal Transactional Graph Data Structure. And I believe this was, you can correct me if I'm wrong here, Per, but this was nominated for Best Paper, right? So this is a run up. Yeah, fantastic. So Per is or was sorry should I say a PhD student at the Technical University of Munich and he's now working at Tableau and his research interests are graph database systems and dynamic graphs. So welcome to the show Per. Yeah thanks a lot for the introduction I'm happy to be here. Fantastic. So can we start off by explaining to the listener what a graph database system is, what graph processing is, and just sort of give us some background to the paper?
Starting point is 00:01:14 Yes, so basically for me, graph processing or graph database systems are really around handling at least three different graph vectors. These are graph pattern matching, analytics that would be page rank on traversals like single source shortest path. And all of these have really different requirements to watch the system, different access patterns to work the underlying data structure, which is where this comes into connection to my work. For me, it's really important to say it's universal. You need to have all three of these workloads. Otherwise, nobody wants a system for one workload, this system, and then another system for his company to support the second workload and then the third one maybe for the third workload.
Starting point is 00:01:58 And as well, you want updates on the systems. You don't want to reload the graph every time. It takes longer to reload a graph than to run PageRank. So a lot of the static graph processing systems out there are a bit... it's difficult to put them into industry, I would say. So there's a connection to what can we really use in industry here. Fantastic. So just kind of, I guess, going back a little bit, how did you get interested in graphs in the first place? What was the attraction here? I started really early. My bachelor thesis was already connected to graph database systems. My supervisor let me use it to detect denial of service attacks
Starting point is 00:02:34 and programs and graphs, which represent programs. And I always found them fascinating because they concentrate more on the connection between things. So relational database systems have a lot of entities in them, and then you join them to get results. But graph database systems put like this idea of there's a very important connection between many things as a first world citizen. And it makes it really easy to express these ideas normally in an interface
Starting point is 00:03:06 so that's that's another big challenge here which i didn't touch on in my paper but it would have liked to my phd awesome so i know in your paper you say that it's an open problem that building a system that can efficiently process this whole range of graph algorithms over dynamically changing graph dataset. So it's still an open problem. But so why is this? Can we like maybe elaborate on that a little bit more? Yeah, the challenge is really you have different access patterns for each workload. So there's intersections or generally set operations for graph pattern matching, which you need, which requires you to have a sorted data structure on a hash set.
Starting point is 00:03:47 You want very, very fast inserts up to 2 million a second over the whole graph for edges. That's known industry needs of Twitter and Alibaba. And finally, you want this to be really easy to scan because this is what graph analytics and analytics and traversals mostly depend on. And you want this to have random access towards each neighborhood, which is super important for traversals.
Starting point is 00:04:11 And each of these things is solved in its own. So there are data structures out there which we can just put together to make a graph data structure. And this was kind of an observation I had really early on. So my feeling was there is a nice and easy solution reusing existing things out there to solve all these challenges in one. And then the next big challenge is to do that with utmost efficiency
Starting point is 00:04:39 because in single edge, it's a super small entity. It's 32 or 64 bytes. So that's just super small and every overhead bushel. Cool. So I guess that segues nicely into what is Saltedon. And I struggle to pronounce this,
Starting point is 00:05:00 so I'm not pronouncing it correctly because I think when I first read it, I was like, is it Saltedon? Yeah, it's confusing. So how do I pronounce it, I guess? And I'm not pronouncing it correctly because I think when I first read it I was like Sotledon, Sotledon, yeah it's confusing so how do I pronounce it I guess and I'm not pronouncing it correctly and what is it yeah Sotledon it comes from being Sotled
Starting point is 00:05:13 Sorted and it's basically a simpleton idea you know it's very simple it's and that's the combination Sotledon in the end. Maybe a bad joke on my side. I think nobody ever got it.
Starting point is 00:05:31 No, I didn't realize that was it, but no. Cool, yeah, so I guess what is it? Yeah, so can you give us a high-level overview of what it is, and then we can dig into the details. Yeah, so Sautelton is the data structure which this paper is about and it's very simple indeed. It's basically a very well implemented adjacency list.
Starting point is 00:05:52 So everybody knows the adjacency list and then you dig into all the requirements I said before and you can find out that you can reuse existing data structures to build something and that's a Sautelton. A very highly efficient can find out that you can reuse existing data structures to build something. And that is Sautelton, a very highly efficient adjacency list.
Starting point is 00:06:12 But I think the paper has much more to offer. So the data structure is what's all arranged about. The discussion is all around this data structure. But we touch on each design decision we took for coming up with this data structure in the end. So you can see the trade-off we made, and you can see which trade-offs are there, and the process of designing a well-working graph data structure. So my hope is that with this paper, after you've read it, you have a much deeper understanding of what it takes to build a data structure.
Starting point is 00:06:40 And then you can take Saltwood, which is so simple that it's easy to understand and explain to build your own data structure on top of that. Because you understood it so well, you will know which parts you would need to change to make it really well working for your specific workload. And it's also, due to its universality, a good base to start with for a graph database system. You touched on it there, but can we go talk a little bit more about what the design goals its universality a good base to start with for a graph database system. You touched on it there, but can we talk a little bit more about what the design goals were when you were developing Solidon? The design goals, I think, were on two sides.
Starting point is 00:07:15 One, it should be very dynamic. Talking, as I said, 2 million updates per second. We actually made 4 million in the end. And this is fully transactional, serializable updates of single edges. And the second design goal was to be as close as possible to the static gold standard data structures for graphs, namely the compressed graph row representation. And we wanted to show how close you could get while being dynamic and transactional. With that, there was a third design
Starting point is 00:07:46 goal I think that should be in every data structure you design, it's space efficiency. It should be very space efficient. These, I think, were the three design goals we gave to us and wanted to press them to the bare minimum of what's possible and explain
Starting point is 00:08:02 why we believe this being the minimum. I guess my next kind of question is, and I know that the sort of the key idea, and I guess now I know what the idea behind the name is, is what were the key ideas underpinning Sotterdon? The key ideas underpinning Sotterdon is, well, first of all, there are a few principles underpinning it. It's simple and it should be explainable. It should reuse existing things.
Starting point is 00:08:30 And then it should be a very, very performant solution in the end. So that was a principle we were aiming for. And the key idea to get to that is to analyze the access patterns of each workload, clearly write them down, and then say we need that. Where did we get that? And find out that there are five access patterns in the end. And out of these five, one turns out to be just very less important than all the others, which makes it easy to design something in the end. So kicking out the one property, which a lot of people believed in the beginning to be very important for a strong graph data structure. And then it turned out,
Starting point is 00:09:11 experimentally, it's not that important. You can't get that much from it. And then by removing that, you can show that an adjacency list is basically as strong as a CSR. And that makes it so much simpler. Cool. So can you maybe go through these five things and these five properties and talk about how, because I know you did some sort of initial experiments, micro-benchmarks to explore the different space here. Can you maybe run through those and talk through your initial methodology for working out what's going to, and arriving at the point of this one that everyone thinks is really important, is actually not that important?
Starting point is 00:09:45 Yeah, sure. I can do that. So basically CSR has five access patterns covered really well. The first one is if you want to go you have a vertex and you want to find out where the neighborhood is, that's a single indirection lookup.
Starting point is 00:09:59 That's the first access pattern or property. I think access pattern is probably the better word here. The second thing is once you have a neighborhood, it's all sequential memory, so you can scan it with the most efficient as you can do E with modern hardware. The third thing is that it's sorted. Each neighborhood is sorted in each other.
Starting point is 00:10:19 All vertices are sequential, which is important for some algorithms as well. They want to access all verticesges in the same order. And the final fifth access pattern is if you want to access all neighborhoods in the order they are stored. And this is happening a lot with analytics. And this is the only difference, if you think about it, between a CSR and an adjacency list. An adjacency list can cover all the other four with ease, but this one, having all the neighborhoods sequentially in memory,
Starting point is 00:10:50 that makes the CSR really different from an adjacent C-list. And it's very hard to get right if you want to have it dynamic because you have this gigantic data structure holding all edges of the graph. And this is just a very hard data structure to get right. There are existing approaches to do that out there, but they get quite complex. And then you can run a simple micro experiment
Starting point is 00:11:13 of building the simple adjacency list you can think of of standard vectors and a CSR, and then you compare them across a wide range of workloads. So the Graphalytics benchmark we used, and it shows that there's very, very little difference between a very simple adjacency list and a C-SAR. The difference is 1.1 or 1.2x in performance, meaning latency of all the workloads. And yeah, by this experiment, you can just say, say okay we leave this one property out and then we build an adjacency list which is a bit more complex than the one we just used
Starting point is 00:11:51 and we can expect very good results to give a little context here more context here we also ran micro experiments for each of the other access patterns, the next most important access pattern here would be a slowdown of 3x over all the workloads. So one can clearly see that there's a huge difference between the importance of these access patterns. And the most important access patterns in this benchmark and the microbenchmarks would be actually to be sorted, because you simply can't run graph pattern matching without being at least sorted. Awesome. So you touched on it a little bit there and that there are other existing data structures out there. How does Sotterland compare with these?
Starting point is 00:12:34 Maybe can you give a brief overview of what's out there and then kind of how you contrast with those? Yeah, let me run through that. Basically, the other graph data structures we looked at were graph data structures with supported single-edge updates. We believe single-edge updates to be really important for graph database systems. However, this is something we kind of reconsidered after writing this paper. So we maybe can touch on that later again. So Stinger, Lama, and GraphOne, they can support analytics and traverses quite well. Really recent papers on that. And Stinger is one of the old ones, and Lama's as well, whoever everybody refers back to again and again. Then we have LiveGraph, which added transactional support to the picture.
Starting point is 00:13:21 So again, really important for graph database systems. And finally, we have TCO. And TCO is the only one which has sorted neighborhoods. So it's the only one supporting graph pattern matching so far. TCO was only out the year before us on the same conference.
Starting point is 00:13:37 And basically, they were finishing up while we were writing our paper. And so the main difference to TCO is we are much simpler. CCO has this last access patterns of the CSR. They have it, but this makes them so complex and design that, yeah, it was nicely to show that you could also do the same
Starting point is 00:13:59 or maybe have better performance, which is a simpler system. So that's the difference here. There's one system we didn't compare to, but I think it's really notable. It's Aspen. And this is something one should have an eye on. We didn't compare to it because it only supports batched updates. But again, as I said, we reconsider this towards the end of our work, I think. And they do really strong work as well.
Starting point is 00:14:26 How did you go about implementing SolidRenso? How was it implemented? How difficult was it to implement? I'm guessing with the fact that it was designed to be simple, it's much simpler to implement than something like Teseo would be. Yeah, I would say so. So the implementation is in C++ and from scratch. It's basically an implementation of a vector for the index,
Starting point is 00:14:55 which we actually used from the building blocks. And then we have the implementation of a blocked skip list. So a blocked skip list is a skip list where each node holds multiple elements. Yeah, that's mainly it. And over that, we built transactional support by well, we redesigned the transactional support for relational systems, took the paper from our share, and again, made this simpler, realizing two things about Graph Workloads. The first thing is mostly it's very, very small updates, like a single edge or two or three edges.
Starting point is 00:15:33 So updates tend to be very, very small. And second, you know normally what you might touch before you do the update. You know you want to have at least five edges, but that means I need to lock the vertices for each side, so at most 10 vertices will be touched. So you knew the right side before, and by that you can save on, for example, you don't need an undo list then anymore. Cool. So how did you arrive at that assumption
Starting point is 00:16:01 that there are small updates and that you know the size of them a priori? Yeah, I mean it's one of the biggest challenges to actually get really good graph workloads. We found this very hard. So what we did is we just looked at all the benchmarks which are out there, mainly the SMB benchmark. And we find that the SMB benchmark focuses on these really small updates plus one thing. They have deletions in there which can be cascading.
Starting point is 00:16:33 So there is actually one workload out there which we are aware of which is not falling into this area of really small updates, cascading deletes. But we also came to the conclusion that cascading deletes is something totally new, which so far no transactional system we are aware of could do well.
Starting point is 00:16:56 Any transactional protocols we know of wouldn't work so well with something that can basically touch a whole graph, and you have no idea how much of the graph it will touch before. If you want to run this with really heavy workloads like PageRank in concurrency, that would be very hard to get right. So we excluded this workload. So this workload is something we couldn't do. And I believe this would be a paper in its own
Starting point is 00:17:25 to actually come up with a good transactional support for cascading the leads or other things that can start one word and go to the whole graph and then do read and write. How long was the implementation effort? Are we talking the magnitude of months, weeks, years? So the implementation effort was maybe around, I think, two months.
Starting point is 00:17:51 Kind of started in January to March. So roughly around that. And then there was a prolonged thing of actually getting this evaluated. We were really happy. Dean DeLeon from CWI just published Teseo, kind of our main competitor there, and he published a test hunch with that. So we could just reuse that. So that was just amazing for us,
Starting point is 00:18:19 made it so much easier, and we had great support from him. Yeah, but however, we had to find out a few bugs or, let's say, consistency issues between our implementation and how he saw things. We got the numbers right such that when we measured on our system, we got the same number of measurements with his test harness and this took quite a long while. I think it was maybe another month or two. And we then finally got the numbers we wanted. Fantastic.
Starting point is 00:18:50 So let's talk about the numbers then. So how did you go about evaluating Solid? And I guess what were the questions you were trying to answer? So how were the – and I guess what did you compare against it as well, right? So there's a few questions here. You can take them one at a time. Yeah. well right so there's a few questions here you can take take them one at a time yeah so throughout the paper we had a lot of micro benchmarks to just give the meat behind our decisions right to be like this is why we do that here you can see clearly see why we did this in the micro benchmarks um for this year i think just read the paper and then the second one is the evaluation of our paper, of course. And here we touched first on how many updates can we do throughput per second. And we
Starting point is 00:19:30 always compared against all the competitors I named doing related work. So we did throughput per second. We compared how much slower are we in any of the others compared to the static gold standard CSR. We compared how much do we scale up for inserts and then we looked at edge tap. So what happens if we run the graphalytics benchmark extremely read heavy while inserting as many edges as we still could at the same time. So it's basically combining the two first experiments and then show how much there was a slowdown on either side. And yeah, that's our experiment. So that's a question we try to answer.
Starting point is 00:20:19 Cool. So space, of course. We look at space efficiency as well. Yeah. Yes. So what was the experimental setup in terms of the hardware and the variety of benchmarks you used? So we basically used one benchmark, which is known,
Starting point is 00:20:40 the Graphalytics benchmark. And for all systems, we used the same implementation of the algorithms. Mostly they were based on the benchmark suite from SCOPE. So the implementation of the algorithm is the same for each data source. You only see the difference in the data structures there. And the other benchmark is, I mean, it's not a clear benchmark, it's simply you take a lot of graphs and you try to insert all the edges as fast as possible.
Starting point is 00:21:12 And in our case, I think we inserted 80% of the edges first and then the last 20 and measured the last 20 because it's not that interesting inserting it into an empty graph. So that's that. We have a slight variation of this experiment
Starting point is 00:21:32 of inserting everything as fast as possible. And the HSTEP experiment is really a clear combination of the two before. We ran them at the same time on the same machine. You asked about the machine. We evaluated on an Intel CPU, which I think had 48 physical cores. So I guess now, what were the key results? What were the headlines? What's the interesting findings from all of these experiments?
Starting point is 00:21:59 Yeah, so how many updates can we support? It's around 4 million, and this is averaged over a wide range of graphs. So the smallest would be Dota League with 50 million edges, super small, because you also need to be fast on small graphs. We found out that at least one competitor wasn't that fast on small graphs. And then on the other hand,
Starting point is 00:22:22 you need to be, of course, good on big graphs. And we took the Friendster graph there with slightly over 2 million edges as our biggest graph. Average, and this is relatively stable overall graph sizes, we can have 4 million updates. So our industry needs 2 million as far as we are aware of. And then the second thing is how fast are we running read-heavy workloads? And we are roughly 85% of CSR. So we are very close to the static data structure here. And finally, we are two times as big as the C's are and two times as big as more or less the theoretically minimum because you need to leave some space off for edges to
Starting point is 00:23:13 insert and then you need some space for versions. Cool do you think it's possible to get so that's like 85% of the CSR do you think it is possible to get it closer with some more, or is it theoretically there's a bound there that you're going to look up against, no matter how good the implementation is or whatever, you're never going to quite get there? That's a very interesting question. To be honest, I have no clear answer to that. I think we left very many optimizations in our data structure on the
Starting point is 00:23:45 way because towards the end of this paper, we really needed to get it published because Teseo came out and they were really close to what we wanted to have. And then, yeah, we really felt the pressure to put it out there now before somebody comes in front of us. Just in the same time, we got ours in submission. Terrace came out. This is in Sigmund, covering quite similar ideas and also the idea that you can use an adjacency list to have different set implementations
Starting point is 00:24:17 depending on the characteristics of the neighborhoods, which was an idea we were probably aiming for as well. So things came out and then we were pressured to finish up. I forgot the question, what were we going to? Whether it's possible to close the gap on the CSR. So yeah, we let many optimizations out so there might be 5% to get
Starting point is 00:24:46 even in our implementation still but no there's no theoretically limit but there's a limit of something needs
Starting point is 00:24:56 to cost right there's no free lunch if you want updates you won't get to 100% and the last few percents will be harder
Starting point is 00:25:04 and harder and then also becomes sequestration. I'll show you diminishing returns there for sure. So are there any situations in when Salt-O-Lon's performance is suboptimal? And I guess this is just kind of getting at what are the general limitations of this data structure? There are two general limitations I'm aware of.
Starting point is 00:25:24 The one is fixable. So we aim for single-edge updates. As I said, we believe that to be very important for graph database systems in the beginning of our work. And then later on, when I look back at it, being finished writing it, I was like, okay, in the end, I didn't find a workload which really, really needed single-edge updates. Many workloads could be done with batched updates.
Starting point is 00:25:46 It's basically just a question of freshness in between. And there are papers out there that show you can get really close to the freshness with batched updates as you could do with single-edged updates. So looking back at it, what I would have done is batched updates. You can get
Starting point is 00:26:01 much higher throughput on updates, make system design of an H-trip system much simpler. And so that's one thing, which I think is a limitation, but it's not so much in the design.
Starting point is 00:26:14 We could change the transactional system behind it, and then we could support batch updates on the same data. And the second thing is that
Starting point is 00:26:24 there's a current trend to going away from in-memory systems towards systems which use SSDs really well. I think Victor Lais is really leading there, and a lot of his PhD students have great work towards that. And our data structure is designed for in-memory and if you want to go off memory with that you would have a harder path than if you choose to say which is the other system I talked about multiple times before CWI, Dean De Leon and Peter Bongs are the authors great system great work and I think if you wanted to go SSD with it, they have a simpler path to boards, but therefore a much more complex in-memory implementation.
Starting point is 00:27:14 And there might be a combination of these two where you can get the best of both worlds. Interesting. That's really interesting. Has there been, when you were working, did you ever think about doing some form of a distributed date structure and how that might look like with Sorted? Distributed, yeah. So if you wanted a distributed in-memory system,
Starting point is 00:27:34 again, I think Sorted would do really well because it can so easily be pulled apart. It's an index, and then it points to places where the neighborhood is. So in this regard, I don't see any issue with sort-written to be pulled apart across multiple systems. But the real challenge between distributed graph processing is, of course, somewhere else. It's about getting the partitioning right.
Starting point is 00:28:01 And this is a whole different profession, a whole different line of research, and I think it's a question a whole different line of research and I think it's a very very hard line of research where I didn't see any results that where I read the paper and were like that's it the only result I saw which was
Starting point is 00:28:18 really convincing is from Alibaba which basically showed that depending on your workload, different partitionings will do you well. And this just makes it harder because they clearly show there's a very clear connection between partitioning and workload. And this makes it very, very hard.
Starting point is 00:28:38 Complete moving target. Yeah. Yeah. Yeah. Interesting. That's cool. So Solon is publicly available, right? You can get a repo there.
Starting point is 00:28:48 You can go and pull it down and have a play with it if the listener wishes to. Yeah, it's publicly available. It's open source. I think, though, the implementation is probably more of a guiding or proof of concept. So if I wanted to use that in an industry system, I would probably re-implement it. And this is very easily done because it's clearly set which data structures are used in each of these data structures,
Starting point is 00:29:13 skip list or blocked skip list. And block-free vectors are well known. It's not hard to implement them yourself. You maybe get a much cleaner implementation than a PhD student trying to wrap up a paper and having his first C++ project within a few years. So, yeah. Yeah.
Starting point is 00:29:36 Cool. So what's the API like of the data structure? How, as the end user of it, how do I interact with it? I think exactly as you expect. It has insert edge, delete edge for the updates. And then you can ask it to scan neighborhoods for you. And it will basically return you either, it will return you an iterator over the
Starting point is 00:30:06 neighborhood but the iterator is batched so you don't have to call an operation for each new edge but you have to call an operation for each batch of edges and this is basically yeah this is one of two ways of doing it the other way would be to offer you a scan function and you give a lambda in, which is executed on each edge. But the important thing you need to get right here is don't call a function on every edge. These two things work, and I think they have really similar performance. But if you call a function, so an iterator over every edge,
Starting point is 00:30:43 you will have a huge performance. And this is also something at least one competitor got wrong. It's an implementation issue you shouldn't make. We really think about how you access the edges in the end. Okay, cool.
Starting point is 00:30:59 Does it support properties? What's the graph model here? Is it just edges, an undirected graph, or can I have labels and edges? I'm getting at, can I support a label property graph model in this? It's meant to be a topology data structure, so it's mostly vertices and edges. It supports edge properties.
Starting point is 00:31:20 You can have as many as you want, and you could make them as big as you want, but if you make them very big, it won't work anymore. So you can have a few small edge properties that will work really well. Vertex properties, we didn't implement them, but basically there would be a vector next to the vertex vector, right? It's kind of a mirroring data structure. And then the last question
Starting point is 00:31:48 was labels. Labels are a really interesting thing, right? It really depends how many you have. And if you want to partition a graph totally on your labels, then you could just use sort of multiple times. So one extreme.
Starting point is 00:32:04 And the other idea is to have labels in line with the edges. It's kind of the other extreme then you don't have much filtering power over them, but you need to still scan all edges and filter them out over the labels. But then you can use as many labels as you want kind of properties become labeled. So these are the two extremes of the design space you can look at. And both are kind of supportable, sort of, but what you maybe want to do is something in between and this is something you would need to teach yourself. And I think the best paper is the A-plus index
Starting point is 00:32:40 with Simi Saihuglu as one of his main authors. I think they have the deepest treatment of labels and properties within graph database systems out there. Very great paper, but it's a static graph data structure. Cool, cool. Yeah, so when you were working on SOTLIM, what was the most interesting thing that you learned or maybe also the most unexpected lesson that kind of came out of working on this? Under, I guess, such a high pressure environment as well when you're racing to publish because it feels like everyone else is publishing the same thing at the same time.
Starting point is 00:33:18 Yeah. I think the most unexpected lesson wasn't in the technicalities. It was in paper writing. And I think this is a lesson wasn't in the technicalities. It was in paper writing. And I think this is a lesson for every PhD student. In the very beginning of this paper, I had the idea from the beginning and it didn't change much towards the end. And I told this idea to a very well-known professor in the area and said, well, there is no paper left in this area.
Starting point is 00:33:43 No way you can publish in this area. It doesn't work anymore. It's full. And, yeah, it turned out to be a best paper runner-up, so apparently it wasn't full. So I think this is a very interesting lesson for every PhD student. If you have a great
Starting point is 00:34:00 idea and you're convinced that you can convince a few other people of it, it might be very well publishable although some people say it's unpublishable fantastic that's a great lesson to take yeah awesome um so my next question is progress in research is very bumpy right there's ups and downs so from the conception of the initial idea for Solidon to that final publication, what were the things that you tried along the way that failed that maybe your listener would benefit from knowing about?
Starting point is 00:34:32 So they don't make the same mistakes, I guess. The only thing that really failed was my attempt to implement this fast, serializable, motion control in a very short amount of time trying to get it right. It's harder to get it right than one thinks. And then, yeah, I changed for a simple implementation, which turned out actually to work super well and be kind of interesting because nobody used it before in terms of graph processing.
Starting point is 00:35:07 So that was one thing that didn't play out. I think all the other ideas played out. Yeah, it just worked really well, actually. There's one thing I also learned during writing papers. It's a lot about the story. Like I came up with the idea for SORTLAND very early on and then implemented that and measured that and then had the numbers I wanted.
Starting point is 00:35:32 And then TSEO came out and became a much stronger paper over the terms of the supervision. Very, very strong paper by then. So the last part of getting this paper out was really storytelling of explaining why a data structure which is just simpler is still really well worthwhile to read about, although it has some weaknesses compared to CCOs,
Starting point is 00:35:59 had some strong points compared to CCO, but in the end was really similar. And the main difference being it being simpler. And then, yeah, there was a lot of points to make and telling very nice different stories about it. The next question is maybe not as relevant for you now, now that you've moved on to Tableau, but what's the next step? What's the future research for Sotterdam?
Starting point is 00:36:24 Where does it go from here? Yeah, so basically my future research would have been to integrate this in some kind of system to make it a much better graph database system. There are two ways going about this. One way is to integrate it as an index, one-to-many index basically, into an existing relational system.
Starting point is 00:36:47 And I think many people tried that, and I'm aware of at least two ongoing research projects on very good shares trying to do that. And I think it's a very worthwhile approach. However, there's one weakness with this approach, I think, is that it kind of describes it as a second citizen, right, like second-class citizen. You put it in a relational
Starting point is 00:37:10 system and then it kind of needs to fit the relational model, and the best would be if you could run SQL on top of it, and yes, you can do page rank in SQL, but it's kind of unnatural, and it gets harder for things to show a a path.
Starting point is 00:37:25 So there is a bit of an idea, sort of my supervisor, Jana Gischewa, where they say, well, what should we do instead? Instead of bringing everything to SQL, like graph processing, machine learning, you can put it all in a relational system. But what we can also do is use relational techniques to build the system,
Starting point is 00:37:50 but break down the operators into smaller parts. So the operators of a relational system is, for example, a join, but you can actually break down a join into much smaller operators and build the join out of that and then expose these smaller operators and build the join out of that and then expose these smaller operators, which are like, for example, map, or they could be reduced or filter. You can expose these operators as the interface of the database
Starting point is 00:38:18 and then have a much more natural way of expressing machine learning or graph processing system and relational system workloads in the same system. I think this is a great chance for graph processing or graph database systems because it gives us a way to be a first-class citizen in these systems. Fantastic. So can you maybe tell us a little bit about some of the other research you did? Yeah, so mostly I've been looking at this work, of course, and then started
Starting point is 00:38:52 looking into how you would use these small operators, they're called sub-operators, for graph workloads and came up with some initial work of how you can express different graph workloads in a wide range of graph workloads as these sub-operators. So this would have been my next work kind of like setting up
Starting point is 00:39:11 a processing system on top or an execution layer on top of the graph storage I had. What do you think is the biggest challenge in the graph space at the minute? The biggest challenge is integration. All the ideas, all the basic ideas for really strong graph database systems, I think by now are out there. And it's time to pick them up and bundle them in one system
Starting point is 00:39:43 and show how they all belong together, how they can work all in one system. And yeah, just cover a wide range of topics with these. And then we convince people that it's worthwhile using and building, using the system in the industry. Fantastic. And last question now. What's the one key thing you want the listeners to take away
Starting point is 00:40:09 from your research and from this episode? The one key thing is that once you want to build an in-memory graph processing system and want it to be really strong, like nearly as strong as a static data structure and have updates, a very simple and well-implemented adjacency list can do that. And the reason for that is that there are five access patterns
Starting point is 00:40:33 in graph workloads, and one of these five is so much less important than all the others. It's the only thing that makes a difference between a CSR or a complex data structure and a JCCL is a much simpler data structure. Fantastic. And so let's end it there. Thank you so much, Per. That was fantastic. If the listener is more interested to know about Per's work, I'll put links to the paper and everything in the show notes.
Starting point is 00:40:58 And we will see you all next time for some more awesome computer science research.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.