Disseminate: The Computer Science Research Podcast - Roger Waleffe | MariusGNN: Resource-Efficient Out-of-Core Training of Graph Neural Networks | #37

Starting point is 00:00:00 Hello and welcome to Disseminate the Computer Science Research Podcast. I'm your host, Jack Wardby. A reminder that if you enjoy the show, please consider supporting us through Buy Me a Coffee. It really helps us to continue making the podcast. It gives me great pleasure to this. I'm joined by Roger Willeff, who will tell us everything we need to know about Marius GNN, Resource Efficient Out-of-Core Training of Graph Neural Networks. Roger is a PhD student at the University of Wisconsin-Madison. Roger, welcome to the show. Hi, yeah, great to be here. Thanks for having me. Cool. Well, let's jump straight in then. So can you maybe tell us a little bit more about yourself and how you became interested in systems and machine learning

Starting point is 00:01:00 research? Yeah, sure. So as you said, I'm a Peter Science PhD student at University of Wisconsin-Madison. I'm in my fourth year now, so getting at least sort of halfway done maybe. Working with Professor Theo Rokatsinas, who was at UW, and he's since transitioned to ETH Zurich and now at Apple, but still working with him.

Starting point is 00:01:25 And yeah, at a high level, sort of been focusing on resource efficient training of large scale machine learning models with a lot of sort of specific focus on graph neural networks. And I come from a physics background. So in undergrad, I kind of primarily did physics. But then toward the end of undergrad, sort of my senior year, started transitioning towards CS because I kind of always like to program and you have this sort of creativity in CS that you don't necessarily have in physics, right, where you get to sort of design and build your own ideas. So yeah, so CS sort of kind of became my sort of main area of focus toward the end of undergrad. And then, you know, once you sort of get in CS, obviously, machine learning these days is very exciting. You know, it's becoming sort of more and more mainstream and sort of lots of machine learning models.

Starting point is 00:02:19 And sort of the models are continuing to get bigger and bigger and more expensive to train and to run, as we all know. So that makes, you know, machine learning systems and making training and inference of these machine learning models sort of a very exciting and very timely topic. And it also gives you sort of a nice machine learning systems, give you a nice sort of like middle ground in CS where you get to sort of, in some sense, in my opinion, maybe have the best of both worlds where you're sort of like building software, a lot of programming, a lot of building things that others can use, but also you're sort of working on the algorithmic side of machine learning models, which, you know, transformers, et cetera, that everyone's very excited about these days. So you get to be involved in everything. And so, yeah, that's kind of, you know,

Starting point is 00:03:04 kind of evolved naturally, but sort of over time and undergrad and at the end of undergrad, just started getting more and more excited about kind of the machine learning with how the times were going. And also just because, you know, love to program and love to sort of get my hands dirty. So that's kind of the long winded story. Um, you know, I don't know, like maybe like Eureka moments, but, uh, yeah, just over time, it's gravitated toward, toward that field.

Starting point is 00:03:32 Awesome stuff. Yeah. So you say you fall through at the moment. So, I mean, how long, how much road is it left to run there? So you got another year,

Starting point is 00:03:40 two years. Yeah. Something like that. I'm trying to try to start thinking about finishing up here um in the next year or so but we'll see how it goes um it's not you know there's no hard deadline and there's no major rush well let's get stuck into to marius gnn so i mean i think i'm pronouncing it right it is marius right cool yeah um so what is what's the the the backstory there with marius what what is marius like what is it got some meaning i think greek god or something yeah it's a roman general i think

Starting point is 00:04:13 okay it's uh yeah so the the name is sort of the name was chosen by some of the co-leads sort of on the marius project jason only and and so there's sort of been two works, an original work, which we called like the Marius system, and then Marius GNN is sort of a follow-up and an extension, which kind of extends the original work to GNN's graph neural networks specifically. Okay, cool. So let's kind of dig into some background and set the listener up there for what we're going to be talking about today.

Starting point is 00:04:44 So can you tell us a little bit more about kind of graph neural networks, GNNs, kind of give us some backstory there? So I think the best place to start is sort of just like what is a graph, right? So basically, the idea here is that we have sort of the input to these graph neural networks is a graph, which is a really sort of abstract and general data structure. But basically, you have some set of nodes, which are like objects. And then you have some set of relationships, which connects those nodes together, and we call those edges. So you have this graph, which is made up of nodes and edges. And again, yeah, very flexible data structure, and often sort of the best representation for a lot of forms of data. So for example, you know, some classic graphs would be like a social network where you have sort of nodes are people or businesses,

Starting point is 00:05:31 et cetera, like on Facebook. And then you have sort of, you know, the edges are friendships between them or something. But, you know, beyond sort of that sort of standard example, you have things like road networks, right? cities connected by roads, you have even more sort of scientific and maybe complicated graphs like molecules, you know, made up of atoms and bonds, chemical bonds between them. So sort of, you know, a wide variety of things that can be represented as graphs. And then often these graphs have have information associated with them. So for example, the nodes in the graph can have some associated data, which we usually call in machine learning, features, right? So, for example, the cities in a road network may have some information about traffic laws there or whatever, density, traffic, things like that.

Starting point is 00:06:22 And you can sort of group all that information for each node into some sort of something that you'd call it a feature vector. For example, again, for another example, like social network, you might have information about the business information about the person, things that they like, things they don't like, which may, you know, give you some valuable information about who they might be friends with, things like that. So these are sort of additional information that you include in the graph, along with the nodes and the edges, these feature vectors for all the nodes. And basically, at a high level, that's sort of the input that you're working with when you're working with graph neural networks. And then once you sort of have that input, graph neural networks are just basically a specific type of neural network that's designed to operate over that input.

Starting point is 00:07:08 So just like you have, you know, convolutional neural networks designed to operate over images and, you know, transformers for text, you have graph neural networks for that sort of input data structure, which is made up of these nodes, their feature information, and then the edges between them. And, you know, it's sort of, again, a mirror of a standard neural network where once you have that input data, the idea is to sort of successively transform that input data through a series of layers into sort of more meaningful representations so that you can do some sort of downstream task. So yeah, happy to give a little bit more detail about exactly what those layers do, which I think will be useful. Yeah, I have a quick question before we go into those a little bit deeper.

Starting point is 00:07:54 On the feature vectors, you said that they're associated with the nodes. Do the edges also have feature vectors or do you have to put that information in the node? Yeah, you can definitely put feature vectors on the in the nerd yeah you can you can definitely put uh uh feature vectors on the edges as well that's sort of like uh you know gnn's version two in some sense right simplest and kind of the main um sort of set of standard benchmarks is just usually node node information but the the feature vectors the edges can have feature vectors as well for sure

Starting point is 00:08:21 yeah so i interrupted you when he were talking about the different layers you can have and stuff. Yeah, no problem. No, I think it's good to take a pause after you sort of have the input in mind and the basic idea, because the layers are a little bit more complicated, but basically what the idea is that each GNN layer is going to take

Starting point is 00:08:39 the sort of vector representations, these feature vectors for all the nodes, and then the graph structure, which are the edges. And basically, you're going to use that information to produce new vector representations for each node, which are sort of like higher level representations. And those higher level representations, you create them by combining the sort of original vectors, the original features of each node with the features of the neighbors of that node. And the neighbors here are defined by nodes which are connected to each other by edges, through edges. So, you know, it's usually helpful to give an example. So the kind of common example

Starting point is 00:09:18 that I've given in some talks is, you know, consider a graph that is made up of maybe the states of the United States. And, you know, the node that we care about is the node for the state Wisconsin, right? And then we have maybe an edge to neighboring nodes, Illinois and Minnesota, right? Neighbors of Wisconsin geographically. And then maybe we have some base set of features for each state, right? So for one for Wisconsin, one for Illinois and Minnesota could be, you know, dependent on whatever the downstream task is, you know, population, things like that. But let's let's group all that features that we have for each of those three nodes into some vector representation that we usually call h zero. So the zero here, meaning that it's sort of the first representation of that node.

Starting point is 00:10:07 And then, you know, we label that feature vector as H. And then what the first GNN layer is going to do is it's going to compute a higher level representation for the node Wisconsin, which we're going to call H1. So the one being sort of after the first layer. And you compute that higher level representation by aggregating the original features that you had for the state Wisconsin H0 with the original features that you had for Illinois and Minnesota, the H0 for those two nodes. And the aggregation function is really like the sort of neural network layer. And that's what has sort of learnable weights, parameterized function that gets updated through gradient descent, just like a standard neural network layer. And so for example, you know, that aggregation function

Starting point is 00:10:48 could be, you sort of sum the three feature vectors, three H zeros together, and then you multiply by a weight matrix, which is learnable, a parameterized weight matrix. And then that gives you the H1 for the node Wisconsin, for example. So that's kind of the idea. You have these original feature vectors for each node, and then you sort of transform them into new feature vectors for each node at each GNN layer. And that transformation is done by running the sort of feature vectors through some parameterized aggregation function and combining them with the feature vectors of the neighbors. So the idea is to sort of take your information that you have about your node and to sort of augment it and transform it using the information from your

Starting point is 00:11:31 neighbors. And then once you have that sort of H1 for the node Wisconsin, you can compute H1 for Illinois, Minnesota, all the other states, for example. And then you can sort of, again, stack another layer on, compute sort of H2 for the node Wisconsin if you wanted to, and so on. And you can keep doing this until you get some higher level representation that you're sort of happy with, right, which has learned enough information from yourself and from your neighbors. And then you can use that sort of output, call it like H10, for example, for some sort of downstream task like node classification, right? So then you could use it for prediction, just like you would the output of a convolutional neural network, right? For image classification, you can use the output of all the nodes in the graph, like all the higher level representations for all the nodes to do some sort of graph classification, just by sort of combining those, all the nodes in the graph together, the features of all the

Starting point is 00:12:24 nodes in the graph together. Or you can do some sort of edge prediction, all the nodes in the graph together, the features of all the nodes in the graph together, or you can do some sort of edge prediction task if you wanted to. So you have sort of maybe two nodes and you have their two vector representations, two higher level vector representations, and then you sort of use those two vectors to decide something about,

Starting point is 00:12:38 is there an edge between these two nodes? Is there not an edge? Things like that. So that's basically the the gist of chan ends and i have a question on the graph um classifications at the end so is that basically sort of saying this graph is i don't know kind of in the context of the example of of the states of the u.s like what would what would i be classifying that basically so maybe a better maybe a better one is um for that example is some sort of like molecule where you have, you know, sort of a molecule represents a graph and you want to sort of predict some information about that molecule.

Starting point is 00:13:14 Does it interact with this drug? Does it, you know, could it be used as a candidate to, you know, sort of like drug discovery, those sort of things. That's maybe a better example for graph classification. Node classification, a good example for that is like citation network. You have like a paper, all the archive papers. And so each node is a paper, and then they're connected to other papers based on the citations. And you want to sort of do like automated grouping. So these are CS papers, these are machine learning papers, these are physics papers, that sort of thing. Edge prediction, which is often called link prediction as well. The common example there is you want to sort of discover new edges.

Starting point is 00:13:54 So you sort of, you want to predict new friendships, right? And so Facebook wants to predict who you should be friends with. You want to, again, you can do sort of drug discovery if you have sort of a different graph, but now each node is like a molecule and you have like interactions between known interactions between molecules based on prior experiments and you want to sort of predict unknown interactions things like that so all of these things yeah there's a lot of applications awesome stuff that's great and great so yeah let, let's dig into kind of the background for Marius GNN. And so you said a lot of it is that doing this training is very resource or doing it in a resource efficient manner is like quite challenging of a large scale graphs. Why is that? And kind of

Starting point is 00:14:37 how big are these graphs? How big, when does it become a challenge? Yeah. So I think, so now that we've kind of like established the notion of what we're working with, the data that we're working with, sort of the key, the key first issue to realize is that we have these feature vectors for every node in the graph, right? At the very least, maybe even for the edges, as you mentioned. And then that means we sort of have to store them. And then we have to sort of store feature vectors for very large graphs. And that's really where the sort of start of the challenges comes in, right? So for example, large scale graphs these days have billions of edges and nodes.

Starting point is 00:15:14 You can take one concrete example, which is the hyperlink graph from the web common crawl dump from 2012. So basically, each node in that graph is a web page. And then the hyperlinks between web pages give you the edges. And this is sort of even, you know, 11 years old now. But even that hyperlink graph already had billions of nodes, 3 billion nodes, to be exact, and then 128 billion edges. And so if you say, okay, well, I have to store some sort of feature vector for each of those 3 billion nodes, you know, if it's a sort of 100 dimensional or 1000 dimensional vector, then you have sort of,

Starting point is 00:15:49 you know, 100 or 1000 floats times 3 billion, right? So you have large amount of data to store. I think we used 50 or 100 when we were when we were training with it. And we had like 3.5 terabytes of total, just data just to store the input, which just means storing basically the node features and the edges. And then once you're sort of working with that kind of data size, then that gives you sort of a host of challenges of sort of systems challenges and sort of training challenges. The first being that, you know, exceeds the the memory capacity of any gpu accelerator that we kind of currently have on the market so you know you can't store the graph in gpu memory and so you either have to sort of offload to cpu memory but even even cpu memory you know getting 3.4 terabytes of cpu memory is is that's also a challenge right so so you

Starting point is 00:16:43 potentially even have to go even further where you have to scale out to sort of a distributed CPU memory, or sort of a more complex single machine. And then at the same time, though, you can't just sort of abandon the GPUs completely, because you do need to sort of use those things to do the matrix multipliers that are in all the sort of GNN layer transformations. So very quickly, you are sort of forced to have some sort of mixed CPU, maybe even a distributed CPU, GPU training environment, where you have to sort of move data between sort of all hierarchies,

Starting point is 00:17:23 all memory hierarchies, and sort of even between machines, between CPU and GPU, et cetera. And that sort of introduces kind of some of the key challenges. How do you go about doing all that sort of data movement and et cetera? So at a high level, that's the sort of main challenge, right? Is that very quickly you get a larger amount of data with these graphs and that kind of causes you to, you're forced to use these sort of complicated full memory hierarchy architectures, etc. Yeah, nice. So kind of given that then, given that sort of the lay of the land there, what are the kind of state-of-the-art current solutions out there in a distributed fashion? How are they architected and kind of what are the problems with them? Basically, they sort of opt for that architecture that I just described, which is basically,

Starting point is 00:18:09 you know, you can start in the GPU memory for small graphs, and then you sort of fall back to CPU memory for larger graphs. And then for even larger graphs, you go to sort of a distributed CPU memory where you have maybe the graph partitioned across a sort of set of CPU machines. And then you sort of create, you know, mini batches for training in this sort of CPUs, in these distributed CPUs, and you sort of transfer them to the GPUs for the sort of model computation. But we found sort of the existing solutions at the time when we were sort of starting this Marius GNN work, primarily sort of distributed DGL or DGL, we found that they were pretty expensive. And basically, there's sort of two main reasons for that. Because the first is that no matter

Starting point is 00:18:55 what you do, when you make this sort of decision to just sort of scale out and get a lot of like sort of distributed CPU memory to scale to these large graphs, then you sort of are forced to pay for those resources no matter what. Right. So if you think about like on AWS, they sort of standard GPU machines that have like V100 GPUs, they only have maybe a maximum 500 gigabytes of memory. So to scale the hyperlink, you would need already on the order of 10 of those machines to get into this sort of three or four terabyte range. And each of those is already, you know, $24 an hour. So you're very quickly paying many hundreds of dollars per hour.

Starting point is 00:19:36 And then the second issue is that you sort of have low GPU utilization as sort of by default in these sort of complex architectures because you have this sort of distributed CPU memory, and then you have to sort of transfer, create mini-batches across these multiple machines, which may require communication between the machines, and then you have to transfer those mini-batches to the GPU, etc. So we found that these systems, not just DJL, but PyTorch Geometric and some of the other popular ones, they have low GPU utilization and maybe, you know, 30, 40%. And that GPU utilization gets even lower as the deployment becomes more complicated. So as you scale to larger graphs and you have multiple machines and more machines, then the GPU utilization sort

Starting point is 00:20:26 of slowly continues to drop, you know, down to maybe 30, 20%. So basically, in the end, you end up having a lot of these expensive resources that you're paying for, which aren't fully utilized. So it's sort of not a great, it's not a good use of the money in some sense, right? And then that's sort of the, that's kind of the kind of key problem potentially with the existing solutions. Nice, yeah. I mean, at $24 an hour, I know why Jeff Bezos is so rich, right?

Starting point is 00:20:56 I mean, that's just crazy. There's some like Lambda AI is trying to come along with some of these cheaper cloud-based VMs. But yeah, generally those costs add up quickly for sure. Yeah, yeah. That's definitely something you won't want to leave running longer than you had to as well. Exactly, exactly.

Starting point is 00:21:17 You want to minimize the time that you have to run those things. So I guess kind of with all these problems in mind, you revisited this question of when do we need distributed training for GNNs? So I guess what's the answer? So the idea behind Marius GNN is to sort of first minimize the costs by optimizing as much as you fully optimized the single machine, and then you sort of, you know, if you have all of the resources fully utilized on a single machine, and you still need to further reduce runtime, that's when you need distributed training. But only then, right? So, and then the question then in Mario's GNN is basically, how do you sort of fully utilize the resources on a single machine? That's kind of the key question.

Starting point is 00:22:07 Yeah, it reminds me of the cost paper, right? And I think that was focusing on stream processing systems, was it? Or was it graph processing systems? I can't quite remember. Yeah, it was graph processing, but sort of like a standard graph, like page rank, more sort of things.

Starting point is 00:22:20 Yeah, yeah. Graph algos, right? Yeah. Cool. So let's talk about Marius GNN. So tell us, how does it work? So maybe we can kick things off with the system architecture, and then we can talk into some of its key features.

Starting point is 00:22:33 Yeah. So I think maybe I'll spend a little bit of time talking a little bit more about a little bit of background of sort of what are the key challenges first once you move off the gpu right so we talked about we have this uh these sort of node features for every for every node and then you know those can sort of quickly exceed the gpu memory capacity so just give maybe like a couple minutes of like you know once you put those in the cpu what do you have to do and then that'll sort

Starting point is 00:23:02 of um lead us into sort of the Mars GNN system architecture. So once you sort of move your graph and the node feature representations, and you sort of store those in CPU memory, then you have to sort of do this mixed CPU, GPU mini batch training. And basically, the idea is that you start with some sort of mini batch, which is sort of like a set of training examples that you sample from the graph. So those can be a set of sort of maybe nodes, or they can be edges, depending on the task. But let's consider it's just a set of nodes. So you say, okay, I'm going to do, you know, node classification for these set of nodes. So I have some labels for them. And then, you know, I have their base, their original features. And so what do you need to do? So first you need to sample some of their neighbors

Starting point is 00:23:46 so that you can sort of have these required information for the GNN layers, right? The GNN layers need to know some of the neighbors to be able to do these aggregation functions. And then once you sort of have those neighbors sampled, you can then get the sort of initial H0 feature vectors for all the original nodes that you started with in your mini batch, as well as the neighbors, right? You need the feature vectors

Starting point is 00:24:09 for the neighbors. And that whole process is sort of like mini batch preparation. And you need to do that all on the CPU because that's where your data is stored. And only once you've done that, then you can transfer all of that information to the GPU to actually be able to do the model computation. So that kind of, you know, we've been kind of hinting at once you move off the GPU, you have all of this sort of complex steps that need to be done on the CPUs. And then this data movement, which can limit the GPU utilization. But that's just the idea in a little bit more detail. And then, you know, the kind of key other thing there is that this neighborhood sampling is sort of specific to GNNs, right?

Starting point is 00:24:50 So every time you want to compute the GNN output for a specific node, you need to go and sample the neighbors. So that's sort of a new thing in GNNs compared to maybe like image classification, for example. And it's important to note that this, it's often for these large graphs required to actually sample the neighbors. In other words, not use all the neighbors of a specific node, because for really large graphs, you know, one node, think about like the Twitter graph with like, it's like, you know, users and then their followers, some graphs like, you know, Obama or whatever, Elon Musk, right, they're going to have like millions and millions of followers.

Starting point is 00:25:25 So you can't use all of their neighbors because then just their neighbors alone are like too big for the GPU memory, right? So this is sort of the difficulty in a little bit more detail about, you know, once you move off the GPU is you have this data movement issue that you have to sort of, you know, create these things off device and then transfer these mini batches. And then creating these mini batches themselves is not necessarily trivial because of this neighborhood sampling step where for all the nodes in the mini batch, you have to like sort of look up their neighbors, maybe do some random sampling process, etc. So with that in mind, the sort of key ideas in Mari's GNN are to sort of first make that mixed CPU GPU minibatch training as efficient as possible, right? We want to sort of maximize the GPU utilization. We want to get 100% GPU utilization, which means the GPU is always

Starting point is 00:26:21 sort of busy computing the forward pass and the backward pass of the GNN, which means we need to figure out how to make the overhead of this data movement and this mini batch preparation on the CPU. We need to minimize that as much as possible. And then the second key idea is that once we have an even larger graph, right, which sort of exceeds the memory capacity of the CPU, then we don't want to go to multiple CPUs, distributed CPU memory, because then we're going to need to sort of coordinate mini batch preparation and data movement between CPUs themselves before having to transfer to the GPU. So instead, we're going to take a different approach, which is we're going to store

Starting point is 00:27:02 the sort of even larger graph that exceeds the CPU memory capacity, we're going to store the sort of even larger graph that exceeds the CPU memory capacity, we're going to store that on disk. And therefore, we don't have to actually pay for any more CPU resources to scale beyond the graphs that fit in CPU memory. So those are sort of the key ideas. And then those kind of give you naturally, the system architecture in Mario's GNN, which is basically sort of a full memory hierarchy and out of core pipelined training architecture, which means that we sort of use disk, we use CPU memory, and we use the GPU memory and the GPU, so three layers of the hierarchy. And there's sort of two pipelines. So there's

Starting point is 00:27:46 sort of the first piece, which is the storage layer, which is basically responsible for transferring data from disk to CPU memory, and sort of pipelining and overlapping data movement there. And then there's sort of the CPU to GPU memory piece of the architecture, which is responsible for, you know, creating mini batches in the CPU and then transferring them to the GPU and sort of optimizing that piece of the architecture, pipelining that,

Starting point is 00:28:15 making sure the mini batch preparation is fast, making sure the data movement is pipelined, et cetera, so that you can sort of maximize your GPU utilization. And then, yeah, we usually kind of call that piece the processing layer, which is, yeah, responsible for the sort of mixed CPU, GPU, minibatch training. And basically the idea for each of these pieces is that you sort of break the architecture into sort of a set of stages. So for example, if we think about the CPU to GPU stage, the processing layer, in order to sort of, you know, one piece of the puzzle for maximizing GPU utilization is to sort

Starting point is 00:28:51 of pipeline all of these, these pieces of the processing layer, so that you can sort of overlap the all of this, all the steps and sort of overlap data movement and mini bash preparation with the mini bash computation on the GPU. So you basically separate all of these sort of higher level pieces of the system architecture, the storage layer and the processing layer. You separate their individual components into stages, pipeline stages, and then you connect them by queues and you give each sort of stage its own set of workers. So for example, for the processing layer, you might have a set of CPU workers, which is responsible for creating mini batches on the CPU,

Starting point is 00:29:32 you might have another set of workers, which is responsible for transferring mini batches from the CPU to the GPU. And then you might have another set of workers, which is actually the GPU worker doing the computation themselves. And then each sort of set of workers can in parallel, you know, create a mini batch while another one is in parallel transferring the previous mini batch to the GPU. And they can just read and write their input and output to sort of set of cues. So they, you know, they don't have to worry about sort of synchronization, right, sort of a standard kind of like multi-consumer, multi-producer, you know, pipeline. And this is sort of the first piece of the puzzle to sort of maximize GP utilization.

Starting point is 00:30:13 So that's kind of the high-level overview of the system architecture. Maybe that was a lot said at once. I don't know if you have any questions or... I do have one question. So I'm kind of imagining there's kind of a production line in my head right and these exactly a conveyor belt right exactly do you have like a what's what's the the communication like between the between

Starting point is 00:30:35 different stages is there like an overseer slash coordinator that's sort of informing these workers to speed up or slow down or whatever to kind of keep the whole system balanced or is no that not really yeah there's no coordinator to do that but there is there is there are ways to sort of throttle the pipeline at different pieces so the way we do it is is like in the configuration file when you sort of launch marius or marius gn you basically specify the number of workers for each of these stages and then you also specify the queue stages in the queue sizes, sorry, the queue sizes between each stage. So for example, you know, if the CPU mini batch preparation workers are, you know, really fast, and they're, you know,

Starting point is 00:31:18 putting a ton of stuff on the queue, that's for the transfer worker. You know, once that queue fills up, according to some size that you set then the mini batch work preparation workers have to wait right so they can't keep creating mini batches and putting them on this queue um and then and then they just you know they just sleep basically right until until there's space again there's no back pressure there from the queue saying like slow down a little bit give me some information okay right yeah and then there's a sort of like global back pressure which is there's a there's a global limit on the number of mini batches that can be in the pipeline at once right okay so that that's and

Starting point is 00:31:55 that's yeah that's useful for sort of like staleness issues which are so i mean maybe this isn't really i'm going to kind of off on a tangent a little bit here. But is the resource pool that each of these stages has, is that dynamic at all, or is it fixed? Is it up front, it's a config parameter you set, or is there a way for that to be dynamic at runtime and change? Yeah, so right now it's fixed in the config up front, and it's even sort of fixed without necessarily any information on how to set that, right?

Starting point is 00:32:25 It's sort of like, yeah, there's no like magical, exactly. But we have, you know, thought about and even explored a little bit making it dynamic and making it dynamic is certainly doable. It's just, you know, it hasn't made the top of our priority list, but you can definitely imagine dynamically measuring the size of these queues, you know, figuring out which one is the bottleneck, reallocating resources to sort of, you know, reduce the bottleneck, et cetera. And, you know, that's something that can be done for sure.

Starting point is 00:32:59 Yeah, one for future work. Yeah, exactly. Yeah, so let's go into it in a little bit more detail. And so there's this data structure you use in Marius called DENSE. So what does the acronym stand for? I'm guessing it's an acronym, right? And then how does it work?

Starting point is 00:33:13 Yeah, it stands for the Delta Encoding of Neighborhood Samples. Nice. As we've been discussing a little bit, you have this mixed CPU-GPU mini-batch training. That's sort of the first piece of the puzzle, once you have a large graph that exceeds the GPU memory capacity. And when we've sort of said a little bit, you know, we store the graph, the nodes, their feature vectors in CPU memory. And then you sort of sample mini batches, which basically, think of that, again, just as a set of nodes for each node,

Starting point is 00:33:46 then you need to sample their neighbors, you're still on the CPU, right? Because you still need to need to sample the neighbors on the CPU first, so that you can get their feature vectors from the CPU before you transfer to the GPU. And so basically, this dense data structure comes in in that in that piece of the puzzle and this mini batch preparation processing, mini batch preparation. And the reason it's sort of needed is that that neighborhood sampling step is very take 1200 milliseconds or something to sample the neighbors. And it only takes 100 milliseconds maybe to do the computation on the GPU. So even if you have this pipeline where you're sort of allocating more workers to do the mini batch processing and the neighborhood sampling, you still have this sort of large differential to where even with, you know,

Starting point is 00:34:46 a lot of workers running in parallel, it's going to be hard to saturate the GPU, because the GPU is by far not the bottleneck, the CPU preparation, and even the data transfer is much slower than the GPU computation. So in order to sort of sat, you know, saturate the GPU, it's more than just pipelining, and you really need to bring down the neighborhood sampling time as much as possible. And why is this neighborhood sampling sort of so slow? Well, fundamentally, these GNNs are up against like a sort of exponential scaling issue. So if you have a multi-layered GNN, you need to actually sample a multi-hop neighborhood, which grows exponentially in size. So what do I mean by that? So for example, if we have a

Starting point is 00:35:33 two-layered GNN, that means we want to compute H2 for some node, right? Let's say node A. We start with node A in our mini batch. And we want to compute like the output of a two layer GNN for that node to do like node classification, let's say. So that means we want to compute the second layer representation H2, as we've kind of been calling them. And in the representation H2 for node A that we have, it depends on the representation H1 for node A, but it also depends on the representation H1 for the neighbors of node A. And the representation for H1, all of the representations H1 are also, you know, sort of computed representations, right? Which means that the H1 representation for node A depends on the H0 representation for node A. That's fine. But the H1 representation for node A depends on the H0 representation for node A.

Starting point is 00:36:25 That's fine. But the H1 representation for the neighbors of node A, which we need to compute H2 for node A, those representations depend on their neighbors, right? So in other words, if we have a node B, which is a neighbor of node A, we need the neighbors of node B in order to compute H1 for node B, which is a neighbor of node A, we need the neighbors of node B in order to compute H1 for node B, right? So this is sort of like a two hop neighbor of node A, you can think of, right? You go from node A to its neighbors, to their neighbors. And that's how you compute sort of a two layer GNN. And you can sort of, you can think of this as, you know as it continues as you go.

Starting point is 00:37:06 So if you have a three-layer network, three-layer GNN, you want to compute H3 for node A, for some node A, then you need the three-hop neighborhood of node A. So when you're doing the mini-batch preparation, you need to know this ahead of time, how many layers you need to compute. And therefore, that tells you how many hops away from these nodes you need to compute. And therefore, that tells you how many sort of hops away from these nodes you need to sample. And these hops, these multi-hop neighborhoods grow exponentially, right? So because if you start with, let's say, you know, 10 nodes, and then you sample 10 neighbors

Starting point is 00:37:39 for each of those nodes, well, then now you have 100 nodes. And then if again, you sample sort of 10 neighbors for each of those 100 nodes, well, now you have 100 nodes. And then if again, you sample sort of 10 neighbors for each of those 100 nodes, well, now you have 1000 nodes, right? So you sort of very quickly have this exponential increase in the sort of number of nodes which are participating in this mini batch, because of how these multi hop neighborhoods are constructed. So yeah, so just to reiterate, right, these, these multi layered GNNs, they require these multi-hop neighborhoods, which are constructed sort of step by step, right? You start with a set of nodes, you sample their neighbors.

Starting point is 00:38:12 And then the second step is to sample the neighbors of the first set of neighbors, right? And so on. And this sort of whole set of neighborhood sampling, this is really sort of why it's slow and why it's sort of the kind of key bottleneck because of this exponential explosion and because this is sort of a very complicated process, actually, in the end. So basically, that becomes, yeah, that becomes a big bottleneck. Like I said, you have pie torch geometric. It's like over 10 times slower. Same thing with DGL than the computation itself does that make sense it's a little hard to see without these

Starting point is 00:38:49 sort of like diagrams but hopefully yeah no it amplifies right like over like it's kind of an inherent property of graphs i mean there's always that kind of example of like you only in six hops you can go across the whole exactly right so it's the same sort of problem manifesting there in the sense that you end right the more layers you add you want though you're on the h10 then you're gonna have like it's gonna explode right yeah you're gonna have the whole graph exactly after a few layers you have so many nodes participating in this multi-hop neighborhood that that you really have to that it really just the sampling time really just blows up yeah so once you once you sort of have that mind, then the sort of the key idea behind dense is to recognize that in this sort of expansion, as you're sort of starting from a small set of nodes and you're expanding out

Starting point is 00:39:36 in the graph, as you sample more and more hops away from those nodes, the key observation is that you're going to return to nodes that you've already seen. In other words, because of how the graph is structured, you know, the neighbors of some node, the two hop neighbors of some node may be a one hop neighbor of another node, right? And that's the sort of key sort of observation that we try to take advantage of with Dense. Another way to say that is when you do these sort of multi-hop sampling in existing systems, you perform a lot of redundant one-hop sampling to construct these multi-hop neighborhoods, right? Because these multi-hop neighborhoods are constructed by first sampling the one-hop

Starting point is 00:40:20 neighbors for a set of nodes and then sampling the one hop neighbors for their neighbors and then you know sort of just by building you know one hop at a time and the observation that we're talking about here is you sort of you build the one hop neighborhood of the same node multiple times because it reappears at different pieces in this overall multi-hop neighborhood you basically are expanding the one hop neighbors of the same nodes many many times and that's a lot of redundant work so that's what we're going to try to get rid of with dense i'll try to give like just another quick example of this redundancy with like let's say we have we have two nodes a and b and and again we're trying to sample their two hop neighborhood for both of those nodes now we have a mini batch they're both part of the they're both part of our mini batch we want their two hop neighborhood. For both of those nodes, now we have a mini batch, they're both part of the, they're both part of our mini batch, we want the two hop neighborhood.

Starting point is 00:41:08 So if we sample the one hop neighbors of node A to start with, let's say that it happens to have a neighbor, B is its neighbor, right? So now, in the second step, when we're sampling the neighbors of the neighbors of A, we're going to sample the one-hop neighbors of node B. But we would have already done that in the first step because we needed the one-hop neighbors of node B in the first step, right? So these nodes reappear, and you sort of end up sampling one-hop neighbors

Starting point is 00:41:38 for the same nodes many, many times in these multi-hop neighborhoods. Does that make sense without any figures? It's a little hard to follow these sort of like but it makes sense when you think about a graph right you you if you're traversing a graph but you're traversing from many nodes at once like think about doing a random walk but you're starting around like 10 random walks from 10 nodes those random those random walks are likely to sort of like maybe,

Starting point is 00:42:05 you know, come across each other, right? At which point you can sort of reuse the old random walk from that point, or you can sort of start your own new random walk again. So yeah, the idea behind dense is to sort of minimize this redundancy of sampling one hop neighbors for the same nodes um and and the way to do that is just to cache and reuse the one hop neighbors those sampled one hop neighbors for nodes that you've already seen earlier during this multi-hop neighborhood construction and that's the high level idea nice now that makes total sense yeah yeah yeah and so basically dense is this data structure that that we use to do that caching and reusing and tracking because you need to sort of track which nodes you've seen while you're constructing these multi-hop neighborhoods

Starting point is 00:42:53 and you need to track their one-hop neighborhoods. And basically Dense is the data structure to sort of do that. And the way it does that is by sort of, again, building these multi-hop neighborhoods sort of step by step. And it uses each step now is built using an incremental delta. So you sort of sample the one hop neighbors for some nodes, A and B, for example, in step one. And then in step two, you're going to sample, you know, the one hop neighbors for the neighbors of nodes A and B, but only for the neighbors for which you have yet to already sample one hop neighbors for the the neighbors of nodes a and b but only for the neighbors for which you have yet to already sample one hop neighbors right so you sort of each delta

Starting point is 00:43:32 is a is a new set of nodes and and those are the new ones that you're going to sample one hop neighbors for in the next step but that delta are a completely new nodes you haven't seen them before in the previous sort of steps. And that's basically the idea behind Dense. And, you know, we spent a lot of time developing sort of efficient algorithms as well to go along with constructing it or to go along with this high-level idea, right?

Starting point is 00:43:57 You also still need to be able to construct it quickly on the CPU, and then you also need to be able to use this data structure on the GPU, right? Once you have Dense, you still need to be able to read it data structure on the GPU, right? Once you have Dense, you still need to be able to read it and find out the multi-tab neighborhoods on the GPU because you need that for the computation. And so we have some sort of parallel algorithms

Starting point is 00:44:13 on the CPU to construct Dense and we have and it's sort of nice for GPU kernels as well. And yeah, it worked really quite well in the end. It allowed us to sort of sample you know up to 15 times faster for three and four layer gnn's and it also allowed us to even do the computation faster because it sort of gave us this compressed format of these multi-hop

Starting point is 00:44:38 neighborhoods and was very amenable to you know sort of optimized gpu kernels and and the scaling it works really well with as the it works even better as the number of layers increase, which makes sense, right? The more layers you have, the more redundancy you have in these multi-hop neighborhoods. As you said, you know, once you get to sort of five, six layers, you have almost the whole graph. And so you can see Dense actually outperform like optimized sampling kernels that are specifically developed to minimize sampling time. Like for example,

Starting point is 00:45:08 those in next door and you know, dense out starts outperforming them after three, four layers, just because it scales so much better with respect to the number of GNN layers. Awesome. So I mean, there's kind of, there's two other sort of features that you mentioned in your paper about

Starting point is 00:45:23 marriage. Maybe you can give us sort of like a, a a rundown of how the partition replacement policies work and the altitude, and then we can talk some more about the numbers and the experiments. Let's see how fast Marius goes. Yeah. So once you have that pipeline and dense data structure, that gives you the CPU to GPU piece. And then the next piece is, well, what if of the CPU to GPU piece, right? And then the next piece is like, well, what if the graph doesn't fit in CPU memory? So then again, as we said,

Starting point is 00:45:51 you have sort of two options. You can get a bunch of CPUs and do some sort of like distributed CPU memory and then distributed training process. But in Mars, we sort of kind of continue with this theme we have where we're like going to, you know, extract everything we can out of a single machine. And with a single machine, you always have this sort of disk available to you that's not really made use of, right? So you're sort of paying for it. It's not the dominant cost. The dominant cost is the GPUs and then the CPUs. But you do have this available.

Starting point is 00:46:21 And so, you know, why not try to use it and basically the high level idea behind disk based training is that you're going to sort of then now you're going to store your sort of graph and the base node feature vectors on disk but unfortunately it's not as simple as just sort of having a pipeline where you do mini batch preparation on disk and then you know move that mini batch to the cpu and then move the mini batch to the GPU. Because first of all, you can't really do mini batch preparation on a disk, right? It's not a processing unit. And then also because of the fact that mini batch preparation requires an enormous amount of random access, right? So you're sort of randomly sampling the training examples in the mini batch, and then you do all this neighborhood sampling, which is all random. And the disk just doesn't have the ability to do random access at that granularity. So what you have to do to use disk is you have to do this, you have to move data

Starting point is 00:47:16 between disk and CPU memory using these larger sequential reads and writes. So the way we do that is we store the sort of the nodes and the edges of the graph on disk and the base feature vectors for each node. But we partition them. We partition the nodes and their feature vectors into these sequential blocks. And we also group the edges according to those partitions. And then once you have that data layout on disk, you can then load subsets of the graph into CPU memory by sequentially reading these blocks. So you read a sequentially read a subset of those partitions and their corresponding edges, and you bring that into memory. And that's a much more efficient way to sort of transfer

Starting point is 00:47:57 data between disk and CPU memory. And then once you have the sort of subgraph and CPU memory, then you can do the same sort of mixed CPU GPU training on that in memory subgraph that we've been talking about using dense using the pipeline, etc. And you can do random access to sample your training examples and your multi, what do you do when not all partitions fit in CPU memory at once? And that's where you get into this notion of having to sort of load some subset of the partitions in CPU memory, train on the training examples in that subset. And then you have to swap some partition back out to disk and bring a new partition in. And so you get this partition swapping such that over a sequence of swaps, eventually, you bring in the whole graph from disk into CPU memory by swapping partitions in and out. And that leads to the notion of having then some partition replacement policy, which decides what partitions you're going to have in memory and when you're going to have the partitions in memory. And then you sort of, the first thing you come to when you think about,

Starting point is 00:49:13 okay, how do I design this partition replacement policy is that it should sort of minimize the number of swaps and therefore correspondingly minimize the disk IO and the training time. And so, you know, we developed a policy that does that, that minimizes the number of swaps. It's called beta. That was sort of part of the original Marius work. But what we found in Marius GNN, when we tried to use that policy for more complicated GNN models, is that it leads to lower accuracy compared to training when you just have the full graph in memory. So this setup of moving these partitions between disk and CPU memory actually leads to

Starting point is 00:49:52 sort of an accuracy drop compared to just buying a bigger machine with a bigger CPU memory, basically. And that's where things get really kind of interesting with all the disk-based stuff in Mars GNN is to sort of understand why that is the case and then how you can sort of improve that. And basically at a high level, why does that occur? Well, it occurs because you're focusing heavily on the in-memory subgraph to generate your training examples and their neighborhoods. So you're only doing random access to this sub graph instead of the full graph. And that leads to sort of a biased training process. So the idea, the sort of analogy here is to imagine that you have all of your images on disk, and then you just load in the images of like the cats and CPU memory, you train on those randomly, then you swap those out,

Starting point is 00:50:43 you load in the images of dogs, you train on those randomly, you you swap those out. You load in the images of dogs. You train on those randomly. You swap those, so on and so forth. Instead of just putting all the images from all the classes in memory together and randomly sampling from all of them. So that's sort of the lack of randomness that you get by default almost because of this only random access to the CPU memory. And then you sort of do all the training on this,

Starting point is 00:51:05 the data you have in CPU memory, you swap some out, you bring some new in, you focus on that for a little bit, and then you do another swap, etc. So you get into this predicament, which is that when you use disk, in order to have high throughput, etc, you need to sort of do these sequential reads and writes and do these larger swaps, not randomly access disk. But then at the same time, in order to get high accuracy, you need to randomly access the whole graph. So it's actually, it's almost a, you can't, it's like a conflict that you can't really necessarily solve because there's

Starting point is 00:51:36 two fundamentally different, you know, requirements for high accuracy and then also for high throughput. But in Mario's GNN, we try to make a more flexible disk-based policy that allows us to sort of get the best of both worlds in some sense. Cool. Yeah. So should we talk numbers and experiments? Yeah, sure. So yeah, tell us about how you went, I guess, evaluating Mario's GNN

Starting point is 00:52:01 and what you compared it against and what your findings were. Yeah. So at the time we ran these experiments, sort of DJL, Deep Graph Library and PyTorch Geometric were kind of the two most popular systems. So we just compared basically the main experiments who are end to end training in Marius GNN directly with those two systems. And we focused on node classification and link prediction tasks using sort of the largest graphs available, which are sort of the open graph benchmark graphs. And you have about 100 million nodes and maybe a few hundred million to a billion edges. And we focused just kind of on common models like GraphSage, the GraphSage GNN, and Graph Attention Network as well. And basically, the setup we used, yeah, we used AWS GPU machines. And for baselines, basically, since they require the graph to be in CPU memory,

Starting point is 00:52:57 we basically, we chose the smallest machine which had enough CPU memory to actually store the graph. But that smallest machine, you know, may have been a few hundred gigabytes of CPU memory and potentially came with multiple GPUs. So we allowed the baselines to use those multiple GPUs if they required a sufficiently large machine that had multiple GPUs. And from RSGNN, what we did is we ran two versions.

Starting point is 00:53:27 So we ran one version, which was on that same machine as what the baselines required to store the graph and CPU memory. And so then we also just used sort of CPU and GPU training, but we only used a single GPU for Mars GNN. And then we also ran a second version of Mars GNN, which was the disk-based training version where we used the smallest GPU machine on AWS, which didn't have enough CPU memory to store any of these graphs. But that's okay for the

Starting point is 00:53:57 disk-based Mars GNN because we stored it on disk and then we used CPU as sort of that partition cache with the partition replacement policy. That's sort of the experiment setup. And then the results, what we found basically is that when Marius GNN is on the same machine as each of the baselines, and they're all sort of doing CPU GPU training, Marius was maybe three to four times faster than the baselines, even though it was only using a single GPU. And the baselines were often using maybe four or even eight GPUs. And that's because of the sort of reduced sampling time with dense and then the pipelining.

Starting point is 00:54:39 So we were able to sort of get very high GPU utilization. Nowadays, we actually have 100% GPU utilization with Mars GNN. And then the disk-based training gives you sort of a different option. So it was also about a couple times faster than the baselines, but then it was significantly cheaper as well because you're on a much cheaper machine, maybe only a $3 machine instead of a $24 an hour machine. So we were able to

Starting point is 00:55:05 reduce the training costs between one and two orders of magnitude up to 64 times, if I remember correctly. And this was pretty consistent across both node classification and link prediction that we saw these sort of numbers. I mean, then some great numbers. Great. Are there any areas where it's like where Mario's GNN is, is like suboptimal and kind of what are the limitations of it at the moment? You know, there is this, there is this, we didn't talk about it like super detailed, but there is still this, even with the sort of new partition replacement policies that we have and the sort of new techniques in Mario's GNN to, to help disk-based accuracy, there is still

Starting point is 00:55:43 the potential that disk-based training gets you a little bit lower accuracy than if you were to train with a full graph and memory. Of course, if that's an issue for you, then Marius GN has the option to do CPU-GPU training, but you'll have to pay for enough CPU memory to do that. And one problem that Marius does have is if you do pay for a large CPU machine to do CPU GPU training,

Starting point is 00:56:06 then you may get a lot of GPUs. And at the time, Marius G9 only had a single GPU support. So we've, we've since worked on that and sort of gotten multi GPU support, but that's, that's sort of relatively recent and it's not even sort of available quite yet on, on GitHub, but that's coming. And then there are some other like, you know, limitations. One limitation compared to DGL and PyG is sort of like development support, documentation, ease of use in some sense, like Marius has written C++. And we've been working on a Python bindings, but they're not fully sort of supported yet. So it definitely can be a little bit harder to work with in those systems,

Starting point is 00:56:48 especially if you need to sort of write some custom models or code or whatever. But we're trying to sort of work on the engineering and the usability aspect as well. And then the last piece is like, it doesn't quite have a distributed, like a multi-CPU machine yet um implementation as well but that's another thing we have sort of on our list as well to do i see you preempting my next question now is because they're where do you go next with with with marish and sort of the overall

Starting point is 00:57:15 project but i guess kind of focus on the usability working on multi-cpu those things kind of what's on the roadmap for the next sort of yeah I don't know, six to seven months. The next like sort of main research piece is like a distributed version, you know, so how can you maintain a hundred percent GPU utilization when you have many GPUs and many machines? That's sort of the next research piece. And then, yeah, there's a lot of engineering work that can always be done to sort of, you know, make everything from the installation to the of you know make make the for everything from the

Starting point is 00:57:45 installation to the you know to the ease of to you know writing your own models to to sort of pre-processing post-processing all those sort of things that that people want when they want to actually use it for an application yeah for sure i mean that and that's sort of one angle that i don't i mean it's harder to kind of get of get papers published in it when it's focused on usability. People care about numbers and like things going faster. Yeah, that's. This piece of software is more usable, right? Right.

Starting point is 00:58:12 That's a sort of like PhD versus, you know, startup difference a little bit right now. You know, in your PhD, there's not necessarily bonus points for you to get papers papers in the phd but that's it right yeah and cool so yeah so i guess kind of this next question is kind of what impact do you think it can have longer term and also what impact is as maria's g and had already have you seen much use of it in already are people out there using it or what's the feedback like yeah so hope i mean our hope is that you know we've we've open sourced it and we're like we said we're working on making it easy to use it's not it's not perfect yet but we're working on and we we hope that that allows sort of like researchers um in many you know science scientific areas many different areas to sort of easily and sort of

Starting point is 00:59:01 quickly train gnns on their own data right and then therefore enables them to sort of quickly train GNNs on their own data, right? And then therefore enables them to sort of do, to sort of further their research in a way that maybe it was too expensive or too difficult to do without Mars GNN. So that's kind of the goal. And yeah, I mean, we've had a lot of people, we've had a good number of people starting to use it on GitHub and, you know, posting questions and interacting with it. So overall, I've been happy with how people have seen the system and the number of people trying to start using the system. And motivates us to keep making it better and giving more functionality for sure. Yeah, definitely. I mean, it was really rewarding rewarding seeing kind of

Starting point is 00:59:45 yeah your actual research be used by people having that feedback loop right it motivates back to that yeah yeah it goes back to that sort of initial um you know motivation for ml systems right where you sort of get to build these things which are you know actually like cloned and and used right yeah for sure i mean i mean i kind of just on a tangent i know a lot of my work in my phd like i mean someone might read the paper a tangent i know a lot of my work in my phd like i mean someone might read the paper one day and do a game but like the implementation is probably just going to be stay like dead somewhere like get up and i was ever going to look at it again which is pretty sad so it's really nice that you have that feedback yeah for sure and cool so i

Starting point is 01:00:18 guess when you've been working on this project so you've been on it quite a long time what's maybe the most like the most interesting thing you've kind of, you've learned from working on this? So I think the thing that stands out to myself and Jason as well, I alluded to a little bit at the beginning, but there's sort of this very interesting kind of interplay and sort of even like trade-off almost between like algorithms and implementation, right? So for example, this dense data structure, the algorithm in itself is sort of, you know, you're going to

Starting point is 01:00:48 reuse the samples, okay, there's an algorithm to construct dense, there's an algorithm to use dense, right? And we knew those for maybe like three months before we actually reduced the sampling time sufficiently. And the reason was that it was actually very difficult to implement those algorithms, the dense construction and using dense well. It was very able to implement them at high performance. And we found that kind of consistent across a lot of things where you can spend a lot of time on an algorithm and it has good complexity, but it won't work actually because, you know, the implementation is very

Starting point is 01:01:30 important as well or vice versa. You know, you can just have a very simple algorithm, but the differences in implementation are, are, can be enormous. Um, so that, that, that's one thing that's the thing that stood out, um out to me, just the importance of implementation to actually achieve high performance has been something I think that maybe goes under the radar a little bit in some of these systems designs. So even like, you know, in the paper, this is, you know, it's not, it's not in the paper, right? Because it's not really new in the sense of like but but the implementation was very important in the end that's that's something i

Starting point is 01:02:09 think i think i found pretty interesting yeah i agree like i mean a good idea implemented badly just just going to get a good performance right like yeah the best idea in the world but if you've got some hacky implementation of it then yeah it's going to be it's not going to work right um it's cool do you have any sort of like, um, kind of war stories along that sort of journey as well? I think you tried that failed dead end you ran into that might be interesting for listeners to know. I do have one more story where we, we,

Starting point is 01:02:36 when we were originally implementing these sort of new disc based policies, we were getting really bad accuracy and we were very confused and we were, you know, we were for months, like maybe like three months. And in the end, what the issue was, is that we were using like P read and P write function calls, right? In C++. And those have a limited block size, right? So in other words, you can only read like two gigabytes or write two gigabytes.

Starting point is 01:03:04 And anything other than that, you basically just get like junk, like random bytes, right? And that, I mean, we just missed it in the sort of like docs, right? But that led to, you know, you get these silent failures in ML where it's not like actual, know compile error or runtime error right but the error just shows up in poor accuracy so you don't know where along there's no actual like you know there's no error you can look at you just see you get poor accuracy that's the only feedback you get and you have to sort of like figure out where that's coming from and this was one of those which took a long long time like, like, you know, months. Oh, wow.

Starting point is 01:03:46 Yes. And we weren't even really, you know, we didn't even really know it was a bug, right? We just were like, man, maybe disk-based training doesn't work. Yeah. But it turns out that you can only read and write two gigabytes at a time or like whatever the max in 32 is. Oh, crazy. You'll never forget that now. That being greened in your brain forever. I crazy. You'll never forget that now. That'll be ingrained in your brain forever.

Starting point is 01:04:07 No, yeah, well, now I see. You and P write rappers. P write rapper as much as you want, and it'll break it up into two gig chunks. Nice, yeah. That was rough, yeah. That's a tough one. Yeah, it's interesting. I never, obviously, not that sort of,

Starting point is 01:04:23 that much experience in the ML world.'s that silent failure there of like you just get bad accuracy it's like damn like how do you go back and figure out where in the whole pipeline that this is coming from oh yeah it's just a bad approach so like how do you like exactly yeah yeah this one we almost were ready to give up on the disc-based training really because it was so long and i looked at so many things and yeah so eventually i got it which was i mean i kind of thought that it was always should have been better like i was pretty surprised yeah because it worked for the simple models in the original marius you see it didn't work for the gnns but we had also rewritten a lot of code and

Starting point is 01:05:05 anyway it was it wasn't that was that was the worst for sure but yeah i bet you slept so well the night after you found it but you're like that was lovely just yeah yeah cool yeah i guess it kind of is your main sort of like uh work all around mario so do you have any other research kind of going on at the moment as well yeah Yeah. So I think, you know, back to the very beginning, in general, I'm just trying to make training more efficient. And there's a lot of options you have to do that, right? You can focus on the system, you can focus on the model, you can focus on the data. I've done a little bit in sort of each of those, those pieces, focusing on the systems with like Marius, Marius GNN.

Starting point is 01:05:47 We have some new systems we're working on. I've done a little bit in terms of just making the models more efficient in the sense of compressing the model during training. I did a little bit of work on that and using the sort of activations, the fact that the activations are low rank. And then more recently done a little bit on sort of tryingations the fact that the activations are low rank um and then more recently done a little bit on sort of trying to compress the data a lot of people have been working on like data pruning these days how can you train to the same accuracy using sort of a

Starting point is 01:06:15 subset instead of the full data set um so so looked in that into that recently but yeah at a high level how can we make how can we make training more efficient you know nice i guess sorry yeah how can we make it to like humans right where you just see one or two examples and then you're seeing you know what a dog is yeah yeah i guess the holy grail sort of system as well kind of factors in all these different areas that it combines stuff to do with the model to do with the data to do with the infrastructure whatever it brings all those things together in sort of like some i don't know some i mean some say what so what would be the most advanced system that would bring all these things together at the moment what is the sort of state of the art state of the art or is it all is it most things just focus on one

Starting point is 01:06:56 specific thing so i mean i think i think that was kind of i mean maybe in the news you've seen like mosaic right their recent sale their original sort of maybe idea right which was to just focus on giving you a platform to sort of efficiently train on your own data and and that platform had lots of tricks and and efficient training and efficient implementations sort of internally and and you know they've they've transitioned to sort of lms a little bit but still giving you the ability to sort of train you And, you know, they've transitioned to sort of LLMs a little bit, but still giving you the ability to sort of train on your own data

Starting point is 01:07:29 cheaply and efficiently. So, you know, I mean, the Holy Grail is kind of maybe similar to that, where you have some system which just takes in user data and somehow it analyzes it and sort of efficiently decides first

Starting point is 01:07:43 what data it needs from that full data that the user provided. What's the most important data? Then it sort of efficiently and automatically decides what's the efficient model, you know, most efficient model that I need to train on this data to achieve, you know, whatever the user wants to achieve. And then you've sort of selected high quality data and a high quality efficient model. And then you have some sort of actual systems implementation that goes and actually runs it well.

Starting point is 01:08:11 And some distributed trainings cluster or on a single machine or whatever. Right. But, you know, those, those are difficult questions to sort of automatically decide what's important data, what model will achieve good accuracy for low cost, et cetera, right? We'll get that one day. Yeah, we probably will. I mean, that's kind of, you know,

Starting point is 01:08:31 it's not so far from what some of these new companies have been trying to do, like even like Mosaic. They had that like software suite that kind of like picked a set of model optimizations to use to sort of, i can't remember what it's called but cool so yeah my next question next question roger is my is my favorite question of all and it's about the creative process and i love seeing how people's and answers to this question dive edge so the question is that how do you approach generating ideas and then how do you determine which ones to select?

Starting point is 01:09:06 What's your creative process? Yeah, so for me, it's all about really simple examples that you can really analyze really well. So I like to make little prototypes in Python. And then that's kind of like, then you have the fork in the road where it's like, okay, I did this little prototype, you know, very simple. And then you sort of, is it going to work at scale? Is it worth sort of pursuing for

Starting point is 01:09:31 longer effort? So like I was recently working on a problem, which was like a little bit like of a, can we do basically an alternative to backpropagation? And so it was, you know, can I implement this for one or two layers in Python? No biases, just weights, just super simple dense layers. Like each layer is Y equals matrix times vector, W times X. And then, you know, you can very quickly dive very deep into that process, you know? So can you beat back propagation for this two-layer simple network?

Starting point is 01:10:10 And, you know, you can try very many things. You can get very into the weeds. But then you also sort of don't spend too much time, right? It might be 100 lines of Python. That's a very simple prototype. And that allows you to, you know, say, no, we can't, we can't very easily beat back propagation without having to go, you know, into like transformers and C++ and all these complicated things. So that's kind of my, that's kind of my,

Starting point is 01:10:37 how I, I work on problems. I like to prototype and I like to sort of, even if, even if people have done it before, I like to prototype just to, to see the problem myself, see what's actually hard about the problem. But, you know, there are a lot of other things that are involved, right? How do you decide what, what problems to prototype, what problems to think about? That's, you know, that's a little bit kind of random in some sense, maybe, but you get that from papers, from, you know that's a little bit kind of random in some sense maybe but you get that from papers from you know talking to to theo my advisor and talking with you know other people and stuff and i've never had that answer before that says they're all they're so great because everyone's answer is always so different so like yeah that idea about prototyping stuff and working and

Starting point is 01:11:22 feeling things out there feeling things out of that and kind of getting an intuition for it and understanding the problem that way again yeah that's such a great way of doing things like i really love yeah i have this like uh pie charm project which is just like experimental and it just that that's like the overall folder and then there's just like a thousand subfolders just like random things i've tried so many random things oh that's that, that's fantastic. Great. Anyway, it's time for the last question now, Roger. So it's like, what's the one thing you want the listener to take away

Starting point is 01:11:51 from this podcast today? So I think in one sentence, maybe the main thing for Marius and Marius GNN is to sort of first maximize the resources you have, maximize the use of the resources you have before scaling out to more resources. I think that's the takeaway from maybe Mars and Mars GNN in one sentence. And, you know, we're not the first to say that, right, as you mentioned, like the cost of papers, et cetera. But I think it's, you know, it's important to sort of revisit that as we have these very complex distributed machine learning deployments.

Starting point is 01:12:29 Um, fantastic. Uh, let's end it there. And it's been a brilliant talking to you, Roger. If you're interested to know more about Roger's work, we'll put links to everything in the show notes so you can go find it. And again, if you, if you enjoy the show, please do consider supporting us through Bam Your Coffee. And we'll see you all next time for some more awesome computer science research.

Your Ad Here

Disseminate: The Computer Science Research Podcast - Roger Waleffe | MariusGNN: Resource-Efficient Out-of-Core Training of Graph Neural Networks | #37

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.