Disseminate: The Computer Science Research Podcast - Roger Waleffe | MariusGNN: Resource-Efficient Out-of-Core Training of Graph Neural Networks | #37
Episode Date: July 31, 2023Summary: In this episode, Roger Waleffe talks about Graph Neural Networks (GNNs) for large-scale graphs. Specifically, he reveals all about MariusGNN, the first system that utilises the entire storage... hierarchy (including disk) for GNN training. Tune in to find out how MaruisGNN works and just how fast it goes (and how much more cost-efficient it is!) Links: Marius ProjectRoger's Homepage Roger's TwitterEuroSys'23 PaperSupport the podcast through Buy Me a Coffee Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate the Computer Science Research Podcast. I'm your host, Jack Wardby.
A reminder that if you enjoy the show, please consider supporting us through Buy Me a Coffee.
It really helps us to continue making the podcast. It gives me great pleasure to this.
I'm joined by Roger Willeff, who will tell us everything we need to know about Marius GNN,
Resource Efficient Out-of-Core Training of Graph Neural Networks.
Roger is a PhD student at the University of Wisconsin-Madison. Roger, welcome to the show. Hi, yeah, great to be here.
Thanks for having me. Cool. Well, let's jump straight in then. So can you maybe tell us a
little bit more about yourself and how you became interested in systems and machine learning
research? Yeah, sure. So as you said, I'm a Peter Science PhD student
at University of Wisconsin-Madison.
I'm in my fourth year now,
so getting at least sort of halfway done maybe.
Working with Professor Theo Rokatsinas,
who was at UW,
and he's since transitioned to ETH Zurich
and now at Apple, but still working with him.
And yeah, at a high level,
sort of been focusing on resource efficient training of large scale machine learning models with a lot of sort of specific focus on graph neural networks. And I come from a physics
background. So in undergrad, I kind of primarily did physics. But then toward the end of undergrad, sort of my senior year,
started transitioning towards CS because I kind of always like to program and you have this sort
of creativity in CS that you don't necessarily have in physics, right, where you get to sort
of design and build your own ideas. So yeah, so CS sort of kind of became my sort of main area of focus toward the end of undergrad.
And then, you know, once you sort of get in CS, obviously, machine learning these days is very exciting.
You know, it's becoming sort of more and more mainstream and sort of lots of machine learning models.
And sort of the models are continuing to get bigger and bigger and more expensive to train and to run, as we all know. So that makes, you know, machine learning systems and making training and inference
of these machine learning models sort of a very exciting and very timely topic.
And it also gives you sort of a nice machine learning systems, give you a nice sort of like
middle ground in CS where you get to sort of, in some sense, in my opinion, maybe have the best
of both worlds where you're sort of like building software, a lot of programming, a lot of building things that
others can use, but also you're sort of working on the algorithmic side of machine learning models,
which, you know, transformers, et cetera, that everyone's very excited about these days.
So you get to be involved in everything. And so, yeah, that's kind of, you know,
kind of evolved naturally, but sort of over time and undergrad and at the end of undergrad, just started getting more and more excited about kind of the machine learning with how the times were going.
And also just because, you know, love to program and love to sort of get my hands dirty. So that's kind of the long winded story. Um, you know, I don't know, like maybe like Eureka moments,
but,
uh,
yeah,
just over time,
it's gravitated toward,
toward that field.
Awesome stuff.
Yeah.
So you say you fall through at the moment.
So,
I mean,
how long,
how much road is it left to run there?
So you got another year,
two years.
Yeah.
Something like that.
I'm trying to try to start thinking about finishing up here um
in the next year or so but we'll see how it goes um it's not you know there's no hard deadline and
there's no major rush well let's get stuck into to marius gnn so i mean i think i'm pronouncing
it right it is marius right cool yeah um so what is what's the the the backstory there with marius what what is marius
like what is it got some meaning i think greek god or something yeah it's a roman general i think
okay it's uh yeah so the the name is sort of the name was chosen by some of the co-leads sort of
on the marius project jason only and and so there's sort of been two works, an original work,
which we called like the Marius system,
and then Marius GNN is sort of a follow-up and an extension,
which kind of extends the original work to GNN's graph neural networks specifically.
Okay, cool.
So let's kind of dig into some background and set the listener up there
for what we're going to be talking about today.
So can you tell us a little bit more about kind of graph neural networks, GNNs, kind of give us some backstory there?
So I think the best place to start is sort of just like what is a graph, right?
So basically, the idea here is that we have sort of the input to these graph neural networks is a graph, which is a really sort of abstract and general data structure. But basically, you have some set
of nodes, which are like objects. And then you have some set of relationships, which connects
those nodes together, and we call those edges. So you have this graph, which is made up of nodes
and edges. And again, yeah, very flexible data structure, and often sort of the best representation
for a lot of forms of data. So for example, you know, some classic
graphs would be like a social network where you have sort of nodes are people or businesses,
et cetera, like on Facebook. And then you have sort of, you know, the edges are friendships
between them or something. But, you know, beyond sort of that sort of standard example, you have
things like road networks, right? cities connected by roads, you have even
more sort of scientific and maybe complicated graphs like molecules, you know, made up of atoms
and bonds, chemical bonds between them. So sort of, you know, a wide variety of things that can
be represented as graphs. And then often these graphs have have information associated with them.
So for example, the nodes in the graph can have some associated data, which we usually call in machine learning, features, right?
So, for example, the cities in a road network may have some information about traffic laws there or whatever, density, traffic, things like that.
And you can sort of group all that information for
each node into some sort of something that you'd call it a feature vector. For example, again,
for another example, like social network, you might have information about the business information
about the person, things that they like, things they don't like, which may, you know, give you
some valuable information about who they might be friends with, things like that.
So these are sort of additional information that you include in the graph, along with the nodes and the edges, these feature vectors for all the nodes. And basically, at a high level, that's sort
of the input that you're working with when you're working with graph neural networks. And then once
you sort of have that input, graph neural networks are just basically a specific type of neural network that's designed to operate over that input.
So just like you have, you know, convolutional neural networks designed to operate over images and, you know, transformers for text, you have graph neural networks for that sort of input data structure, which is made up of these nodes, their feature information, and then the edges between them. And, you know, it's sort of, again, a mirror
of a standard neural network where once you have that input data, the idea is to sort of
successively transform that input data through a series of layers into sort of more meaningful
representations so that you can do some sort of downstream task. So yeah, happy to give a little bit more detail
about exactly what those layers do,
which I think will be useful.
Yeah, I have a quick question
before we go into those a little bit deeper.
On the feature vectors,
you said that they're associated with the nodes.
Do the edges also have feature vectors
or do you have to put that information in the node?
Yeah, you can definitely put feature vectors on the in the nerd yeah you can you can definitely put uh uh feature vectors
on the edges as well that's sort of like uh you know gnn's version two in some sense right
simplest and kind of the main um sort of set of standard benchmarks is just usually node
node information but the the feature vectors the edges can have feature vectors as well for sure
yeah so i interrupted you when he were talking about the different layers you can
have and stuff. Yeah, no problem. No, I think
it's good to take a pause after you sort of
have the input in mind
and the basic idea, because
the layers are a little bit more complicated, but
basically what the idea is that
each GNN layer is going to take
the sort of vector representations,
these feature vectors for all
the nodes, and then the graph structure, which are the edges. And basically, you're going to use that
information to produce new vector representations for each node, which are sort of like higher
level representations. And those higher level representations, you create them by combining
the sort of original vectors, the original features of each node with the features of the neighbors of
that node. And the neighbors here are defined by nodes which are connected to each other by edges,
through edges. So, you know, it's usually helpful to give an example. So the kind of common example
that I've given in some talks is, you know, consider a graph that is made up of maybe the states of the United States.
And, you know, the node that we care about is the node for the state Wisconsin, right? And then we
have maybe an edge to neighboring nodes, Illinois and Minnesota, right? Neighbors of Wisconsin
geographically. And then maybe we have some base set of features for each state, right? So for one
for Wisconsin, one for Illinois and Minnesota could be, you know, dependent on whatever the
downstream task is, you know, population, things like that. But let's let's group all that features
that we have for each of those three nodes into some vector representation that we usually call
h zero. So the zero here, meaning that it's sort of the first representation of that node.
And then, you know, we label that feature vector as H. And then what the first GNN layer is going
to do is it's going to compute a higher level representation for the node Wisconsin, which
we're going to call H1. So the one being sort of after the first layer. And you compute that
higher level representation by aggregating the original
features that you had for the state Wisconsin H0 with the original features that you had for
Illinois and Minnesota, the H0 for those two nodes. And the aggregation function is really like the
sort of neural network layer. And that's what has sort of learnable weights, parameterized function
that gets updated through gradient descent, just like a standard neural network layer. And so for example, you know, that aggregation function
could be, you sort of sum the three feature vectors, three H zeros together, and then you
multiply by a weight matrix, which is learnable, a parameterized weight matrix. And then that gives
you the H1 for the node Wisconsin, for example. So that's kind of the idea. You have these original feature
vectors for each node, and then you sort of transform them into new feature vectors for
each node at each GNN layer. And that transformation is done by running the sort of feature vectors
through some parameterized aggregation function and combining them with the feature vectors
of the neighbors. So the idea is to sort of take your information that you have
about your node and to sort of augment it and transform it using the information from your
neighbors. And then once you have that sort of H1 for the node Wisconsin, you can compute H1 for
Illinois, Minnesota, all the other states, for example. And then you can sort of, again, stack
another layer on, compute sort of H2 for the node Wisconsin if you wanted to, and so on.
And you can keep doing this until you get some higher level representation that you're sort of happy with, right, which has learned enough information from yourself and from your neighbors.
And then you can use that sort of output, call it like H10, for example, for some sort of downstream task like node classification, right? So then you could use it for prediction, just like you would the output of a convolutional neural network,
right? For image classification, you can use the output of all the nodes in the graph,
like all the higher level representations for all the nodes to do some sort of graph classification,
just by sort of combining those, all the nodes in the graph together, the features of all the
nodes in the graph together. Or you can do some sort of edge prediction, all the nodes in the graph together, the features of all the nodes in the graph together,
or you can do some sort of edge prediction task
if you wanted to.
So you have sort of maybe two nodes
and you have their two vector representations,
two higher level vector representations,
and then you sort of use those two vectors
to decide something about,
is there an edge between these two nodes?
Is there not an edge?
Things like that.
So that's basically the the gist of
chan ends and i have a question on the graph um classifications at the end so is that basically
sort of saying this graph is i don't know kind of in the context of the example of of the states of
the u.s like what would what would i be classifying that basically so maybe a better maybe a better
one is um for that example is some sort of like molecule where you have, you know, sort of a molecule represents a graph and you want to sort of predict some information about that molecule.
Does it interact with this drug?
Does it, you know, could it be used as a candidate to, you know, sort of like drug discovery, those sort of things.
That's maybe a better example for graph classification. Node classification, a good example for that is like
citation network. You have like a paper, all the archive papers. And so each node is a paper,
and then they're connected to other papers based on the citations. And you want to sort of do like
automated grouping. So these are CS papers, these are machine learning papers,
these are physics papers, that sort of thing. Edge prediction, which is often called link
prediction as well. The common example there is you want to sort of discover new edges.
So you sort of, you want to predict new friendships, right? And so Facebook wants
to predict who you should be friends with. You want to, again, you can do sort of drug discovery
if you have sort of a different graph, but now each node is like a molecule and you have like interactions between
known interactions between molecules based on prior experiments and you want to sort of predict
unknown interactions things like that so all of these things yeah there's a lot of applications
awesome stuff that's great and great so yeah let, let's dig into kind of the background for Marius
GNN. And so you said a lot of it is that doing this training is very resource or doing it in a
resource efficient manner is like quite challenging of a large scale graphs. Why is that? And kind of
how big are these graphs? How big, when does it become a challenge?
Yeah. So I think, so now that we've kind of like established the notion of what we're
working with, the data that we're working with, sort of the key, the key first issue to realize
is that we have these feature vectors for every node in the graph, right? At the very least,
maybe even for the edges, as you mentioned. And then that means we sort of have to store them.
And then we have to sort of store feature vectors for very large graphs.
And that's really where the sort of start of the challenges comes in, right?
So for example, large scale graphs these days have billions of edges and nodes.
You can take one concrete example, which is the hyperlink graph from the web common crawl
dump from 2012.
So basically, each node in that graph is a web page.
And then the hyperlinks
between web pages give you the edges. And this is sort of even, you know, 11 years old now.
But even that hyperlink graph already had billions of nodes, 3 billion nodes, to be exact, and then
128 billion edges. And so if you say, okay, well, I have to store some sort of feature vector for
each of those 3 billion nodes, you know, if it's a sort of 100 dimensional or 1000 dimensional vector, then you have sort of,
you know, 100 or 1000 floats times 3 billion, right? So you have large amount of data to store.
I think we used 50 or 100 when we were when we were training with it. And we had like 3.5
terabytes of total, just data just to store the input, which just means storing basically the node features and the edges.
And then once you're sort of working with that kind of data size, then that gives you sort of a host of challenges of sort of systems challenges and sort of training challenges.
The first being that, you know, exceeds the the memory capacity of any gpu
accelerator that we kind of currently have on the market so you know you can't store the graph in
gpu memory and so you either have to sort of offload to cpu memory but even even cpu memory
you know getting 3.4 terabytes of cpu memory is is that's also a challenge right so so you
potentially even have to go even further where
you have to scale out to sort of a distributed CPU memory, or sort of a more complex single
machine. And then at the same time, though, you can't just sort of abandon the GPUs completely,
because you do need to sort of use those things to do the matrix multipliers that are in all the
sort of GNN layer transformations.
So very quickly, you are sort of forced to have some sort of mixed CPU,
maybe even a distributed CPU, GPU training environment,
where you have to sort of move data between sort of all hierarchies,
all memory hierarchies, and sort of even between machines, between CPU and GPU,
et cetera. And that sort of introduces kind of some of the key challenges.
How do you go about doing all that sort of data movement and et cetera? So at a high level,
that's the sort of main challenge, right? Is that very quickly you get a larger amount of data with these graphs and that kind of causes you to, you're forced to use these sort of complicated
full memory hierarchy architectures, etc. Yeah, nice. So kind of given that then,
given that sort of the lay of the land there, what are the kind of state-of-the-art current
solutions out there in a distributed fashion? How are they architected and kind of what are
the problems with them? Basically, they sort of opt for that architecture that I just described, which is basically,
you know, you can start in the GPU memory for small graphs, and then you sort of fall
back to CPU memory for larger graphs.
And then for even larger graphs, you go to sort of a distributed CPU memory where you
have maybe the graph partitioned across a sort of set of CPU machines.
And then you sort of create, you know, mini batches for training in this sort of CPUs,
in these distributed CPUs, and you sort of transfer them to the GPUs for the sort of model computation.
But we found sort of the existing solutions at the time when we were sort of starting this Marius GNN work, primarily sort of distributed DGL or DGL, we found that they were pretty expensive.
And basically, there's sort of two main reasons for that. Because the first is that no matter
what you do, when you make this sort of decision to just sort of scale out and get a lot of like
sort of distributed CPU memory to scale to these large
graphs, then you sort of are forced to pay for those resources no matter what. Right. So if you
think about like on AWS, they sort of standard GPU machines that have like V100 GPUs, they only
have maybe a maximum 500 gigabytes of memory. So to scale the hyperlink, you would need already
on the order of 10 of those
machines to get into this sort of three or four terabyte range. And each of those is already,
you know, $24 an hour. So you're very quickly paying many hundreds of dollars per hour.
And then the second issue is that you sort of have low GPU utilization as sort of by default in these sort of complex architectures because
you have this sort of distributed CPU memory, and then you have to sort of transfer, create
mini-batches across these multiple machines, which may require communication between the
machines, and then you have to transfer those mini-batches to the GPU, etc.
So we found that these systems, not just DJL, but PyTorch Geometric and some of
the other popular ones, they have low GPU utilization and maybe, you know, 30, 40%.
And that GPU utilization gets even lower as the deployment becomes more complicated. So as you
scale to larger graphs and you have multiple machines and more machines, then the GPU utilization sort
of slowly continues to drop, you know, down to maybe 30, 20%. So basically, in the end, you end
up having a lot of these expensive resources that you're paying for, which aren't fully utilized.
So it's sort of not a great, it's not a good use of the money in some sense, right?
And then that's sort of the,
that's kind of the kind of key problem potentially with the existing solutions.
Nice, yeah.
I mean, at $24 an hour,
I know why Jeff Bezos is so rich, right?
I mean, that's just crazy.
There's some like Lambda AI is trying to come along
with some of these cheaper cloud-based VMs.
But yeah, generally those costs add up quickly for sure.
Yeah, yeah.
That's definitely something you won't want to leave running
longer than you had to as well.
Exactly, exactly.
You want to minimize the time that you have to run those things.
So I guess kind of with all these problems in mind,
you revisited this question of when do we need distributed training for GNNs?
So I guess what's the answer?
So the idea behind Marius GNN is to sort of first minimize the costs by optimizing as much as you fully optimized the single machine, and then you sort of, you know, if you have all of the resources fully utilized on a single machine, and you still need to further reduce runtime, that's when you need distributed training.
But only then, right?
So, and then the question then in Mario's GNN is basically, how do you sort of fully utilize the resources on a single machine?
That's kind of the key question.
Yeah, it reminds me of the cost paper, right?
And I think that was focusing on stream processing systems,
was it?
Or was it graph processing systems?
I can't quite remember.
Yeah, it was graph processing,
but sort of like a standard graph,
like page rank, more sort of things.
Yeah, yeah.
Graph algos, right?
Yeah.
Cool.
So let's talk about Marius GNN.
So tell us, how does it work?
So maybe we can kick things off with the system architecture,
and then we can talk into some of its key features.
Yeah.
So I think maybe I'll spend a little bit of time
talking a little bit more about a little bit of background
of sort of what are the key challenges first
once you move off the gpu right
so we talked about we have this uh these sort of node features for every for every node and then
you know those can sort of quickly exceed the gpu memory capacity so just give maybe like a couple
minutes of like you know once you put those in the cpu what do you have to do and then that'll sort
of um lead us into sort of the Mars GNN system
architecture. So once you sort of move your graph and the node feature representations,
and you sort of store those in CPU memory, then you have to sort of do this mixed CPU,
GPU mini batch training. And basically, the idea is that you start with some sort of mini batch,
which is sort of like a set of training examples that you sample from the graph.
So those can be a set of sort of maybe nodes, or they can be edges, depending on the task. But let's consider it's just a set of nodes. So you say, okay, I'm going to do, you
know, node classification for these set of nodes. So I have some labels for them. And then, you
know, I have their base, their original features. And so what do you need to do? So first you need to sample some of their neighbors
so that you can sort of have these required information
for the GNN layers, right?
The GNN layers need to know some of the neighbors
to be able to do these aggregation functions.
And then once you sort of have those neighbors sampled,
you can then get the sort of initial H0 feature vectors
for all the original nodes that you
started with in your mini batch, as well as the neighbors, right? You need the feature vectors
for the neighbors. And that whole process is sort of like mini batch preparation.
And you need to do that all on the CPU because that's where your data is stored.
And only once you've done that, then you can transfer all of that information to the GPU
to actually be able to do the model computation.
So that kind of, you know, we've been kind of hinting at once you move off the GPU, you have all of this sort of complex steps that need to be done on the CPUs.
And then this data movement, which can limit the GPU utilization.
But that's just the idea in a little bit more detail. And then, you know, the kind of key other thing there is that this neighborhood sampling
is sort of specific to GNNs, right?
So every time you want to compute the GNN output for a specific node, you need to go
and sample the neighbors.
So that's sort of a new thing in GNNs compared to maybe like image classification, for example.
And it's important to note that this, it's often for these large
graphs required to actually sample the neighbors. In other words, not use all the neighbors of a
specific node, because for really large graphs, you know, one node, think about like the Twitter
graph with like, it's like, you know, users and then their followers, some graphs like, you know,
Obama or whatever, Elon Musk, right, they're going to have like millions and millions of followers.
So you can't use all of their neighbors because then just their neighbors alone are like too big for the GPU memory, right?
So this is sort of the difficulty in a little bit more detail about, you know, once you move off the GPU is you have this data movement issue that you have to sort of, you know, create these things
off device and then transfer these mini batches. And then creating these mini batches themselves
is not necessarily trivial because of this neighborhood sampling step where for all the
nodes in the mini batch, you have to like sort of look up their neighbors, maybe do some random
sampling process, etc. So with that in mind, the sort of key ideas in Mari's GNN are to sort of
first make that mixed CPU GPU minibatch training as efficient as possible, right? We want to sort
of maximize the GPU utilization. We want to get 100% GPU utilization, which means the GPU is always
sort of busy computing the forward pass and the backward pass of the GNN,
which means we need to figure out how to make the overhead of this data movement and this
mini batch preparation on the CPU. We need to minimize that as much as possible. And then the
second key idea is that once we have an even larger graph, right, which sort of exceeds the
memory capacity of the CPU, then we don't want to go to multiple
CPUs, distributed CPU memory, because then we're going to need to sort of coordinate mini batch
preparation and data movement between CPUs themselves before having to transfer to the GPU.
So instead, we're going to take a different approach, which is we're going to store
the sort of even larger graph that exceeds the CPU memory capacity, we're going to store the sort of even larger graph that exceeds
the CPU memory capacity, we're going to store that on disk. And therefore, we don't have to
actually pay for any more CPU resources to scale beyond the graphs that fit in CPU memory.
So those are sort of the key ideas. And then those kind of give you naturally,
the system architecture in Mario's GNN, which is basically sort of a
full memory hierarchy and out of core pipelined training architecture, which means that we sort
of use disk, we use CPU memory, and we use the GPU memory and the GPU, so three layers of the
hierarchy. And there's sort of two pipelines. So there's
sort of the first piece, which is the storage layer, which is basically responsible for
transferring data from disk to CPU memory, and sort of pipelining and overlapping data
movement there. And then there's sort of the CPU to GPU memory piece of the architecture,
which is responsible for, you know,
creating mini batches in the CPU
and then transferring them to the GPU
and sort of optimizing that piece of the architecture,
pipelining that,
making sure the mini batch preparation is fast,
making sure the data movement is pipelined, et cetera,
so that you can sort of maximize your GPU utilization.
And then, yeah, we usually kind of call that piece the processing layer, which is, yeah,
responsible for the sort of mixed CPU, GPU, minibatch training. And basically the idea for
each of these pieces is that you sort of break the architecture into sort of a set of stages.
So for example, if we think about the CPU to GPU stage, the processing layer,
in order to sort of, you know, one piece of the puzzle for maximizing GPU utilization is to sort
of pipeline all of these, these pieces of the processing layer, so that you can sort of overlap
the all of this, all the steps and sort of overlap data movement and mini bash preparation with the
mini bash computation on the GPU. So you basically separate all of these sort of higher level pieces of the system architecture,
the storage layer and the processing layer.
You separate their individual components into stages, pipeline stages, and then you connect
them by queues and you give each sort of stage its own set of workers.
So for example, for the processing layer, you might have
a set of CPU workers, which is responsible for creating mini batches on the CPU,
you might have another set of workers, which is responsible for transferring mini batches from
the CPU to the GPU. And then you might have another set of workers, which is actually the
GPU worker doing the computation themselves. And then each sort of set of workers can in parallel, you know, create a mini batch while
another one is in parallel transferring the previous mini batch to the GPU. And they can
just read and write their input and output to sort of set of cues. So they, you know, they don't have
to worry about sort of synchronization, right, sort of a standard kind of like multi-consumer, multi-producer, you know, pipeline.
And this is sort of the first piece of the puzzle
to sort of maximize GP utilization.
So that's kind of the high-level overview
of the system architecture.
Maybe that was a lot said at once.
I don't know if you have any questions or...
I do have one question.
So I'm kind of imagining there's
kind of a production line in my head right and these exactly a conveyor belt right exactly
do you have like a what's what's the the communication like between the between
different stages is there like an overseer slash coordinator that's sort of informing these workers
to speed up or slow down or whatever to kind of keep the whole system balanced or is no that not
really yeah there's no coordinator to do that but there is there is there are ways to sort of
throttle the pipeline at different pieces so the way we do it is is like in the configuration file
when you sort of launch marius or marius gn you basically specify the number of workers for each
of these stages and then you also specify the queue stages
in the queue sizes, sorry, the queue sizes between each stage. So for example, you know,
if the CPU mini batch preparation workers are, you know, really fast, and they're, you know,
putting a ton of stuff on the queue, that's for the transfer worker. You know, once that queue
fills up, according to some size that
you set then the mini batch work preparation workers have to wait right so they can't keep
creating mini batches and putting them on this queue um and then and then they just you know
they just sleep basically right until until there's space again there's no back pressure
there from the queue saying like slow down a little bit give me some information okay right
yeah and then there's a sort of like global back pressure which is there's a there's a global limit
on the number of mini batches that can be in the pipeline at once right okay so that that's and
that's yeah that's useful for sort of like staleness issues which are so i mean maybe this
isn't really i'm going to kind of off on a tangent a little bit here. But is the resource pool that each of these stages has,
is that dynamic at all, or is it fixed?
Is it up front, it's a config parameter you set,
or is there a way for that to be dynamic at runtime and change?
Yeah, so right now it's fixed in the config up front,
and it's even sort of fixed without necessarily any information
on how to set that, right?
It's sort of like, yeah, there's no like magical, exactly.
But we have, you know, thought about and even explored a little bit making it dynamic and making it dynamic is certainly doable.
It's just, you know, it hasn't made the top of our priority list, but you can definitely imagine dynamically measuring the size of these queues,
you know, figuring out which one is the bottleneck,
reallocating resources to sort of,
you know, reduce the bottleneck, et cetera.
And, you know, that's something
that can be done for sure.
Yeah, one for future work.
Yeah, exactly.
Yeah, so let's go into it
in a little bit more detail.
And so there's this data structure you use in Marius called DENSE.
So what does the acronym stand for?
I'm guessing it's an acronym, right?
And then how does it work?
Yeah, it stands for the Delta Encoding of Neighborhood Samples.
Nice.
As we've been discussing a little bit,
you have this mixed CPU-GPU mini-batch training.
That's sort of the first piece of the puzzle, once you have a large graph that exceeds the GPU memory capacity.
And when we've sort of said a little bit, you know, we store the graph, the nodes,
their feature vectors in CPU memory. And then you sort of sample mini batches, which basically,
think of that, again, just as a set of nodes for each node,
then you need to sample their neighbors, you're still on the CPU, right? Because you still need
to need to sample the neighbors on the CPU first, so that you can get their feature vectors from the
CPU before you transfer to the GPU. And so basically, this dense data structure comes in
in that in that piece of the puzzle and this mini batch preparation processing, mini batch preparation.
And the reason it's sort of needed is that that neighborhood sampling step is very take 1200 milliseconds or something to sample the neighbors.
And it only takes 100 milliseconds maybe to do the computation on the GPU.
So even if you have this pipeline where you're sort of allocating more workers to do the mini batch processing and the neighborhood sampling,
you still have this sort of large differential to where even with, you know,
a lot of workers running in parallel, it's going to be hard to saturate the GPU, because the GPU is
by far not the bottleneck, the CPU preparation, and even the data transfer is much slower than
the GPU computation. So in order to sort of sat, you know, saturate the GPU, it's more than just
pipelining, and you really need to bring down the neighborhood sampling time as much as possible.
And why is this neighborhood sampling sort of so slow?
Well, fundamentally, these GNNs are up against like a sort of exponential scaling issue.
So if you have a multi-layered GNN, you need to actually sample a multi-hop neighborhood,
which grows exponentially in size. So what do I mean by that? So for example, if we have a
two-layered GNN, that means we want to compute H2 for some node, right? Let's say node A.
We start with node A in our mini batch. And we want to compute like the output
of a two layer GNN for that node to do like node classification, let's say. So that means we want
to compute the second layer representation H2, as we've kind of been calling them. And in the
representation H2 for node A that we have, it depends on the representation H1 for node A, but it also depends on the representation H1 for the neighbors
of node A. And the representation for H1, all of the representations H1 are also, you know,
sort of computed representations, right? Which means that the H1 representation for node A
depends on the H0 representation for node A. That's fine. But the H1 representation for node A depends on the H0 representation for node A.
That's fine.
But the H1 representation for the neighbors of node A, which we need to compute H2 for
node A, those representations depend on their neighbors, right?
So in other words, if we have a node B, which is a neighbor of node A, we need the neighbors
of node B in order to compute H1 for node B, which is a neighbor of node A, we need the neighbors of node B in order to compute H1
for node B, right? So this is sort of like a two hop neighbor of node A, you can think of, right?
You go from node A to its neighbors, to their neighbors. And that's how you compute sort of a
two layer GNN. And you can sort of, you can think of this as, you know as it continues as you go.
So if you have a three-layer network, three-layer GNN,
you want to compute H3 for node A, for some node A,
then you need the three-hop neighborhood of node A.
So when you're doing the mini-batch preparation,
you need to know this ahead of time, how many layers you need to compute.
And therefore, that tells you how many hops away from these nodes you need to compute. And therefore, that tells you how many sort of hops away from
these nodes you need to sample. And these hops, these multi-hop neighborhoods grow exponentially,
right? So because if you start with, let's say, you know, 10 nodes, and then you sample 10 neighbors
for each of those nodes, well, then now you have 100 nodes. And then if again, you sample sort of
10 neighbors for each of those 100 nodes, well, now you have 100 nodes. And then if again, you sample sort of 10 neighbors
for each of those 100 nodes, well, now you have 1000 nodes, right? So you sort of very quickly
have this exponential increase in the sort of number of nodes which are participating in this
mini batch, because of how these multi hop neighborhoods are constructed. So yeah, so just
to reiterate, right, these, these multi layered GNNs, they require these multi-hop
neighborhoods, which are constructed sort of step by step, right?
You start with a set of nodes, you sample their neighbors.
And then the second step is to sample the neighbors of the first set of neighbors, right?
And so on.
And this sort of whole set of neighborhood sampling, this is really sort of why it's
slow and why it's sort of the kind of key bottleneck because of this exponential explosion and because this is sort of a very complicated process, actually, in the end.
So basically, that becomes, yeah, that becomes a big bottleneck.
Like I said, you have pie torch geometric.
It's like over 10 times slower.
Same thing with DGL than the computation itself does that make sense it's a little hard to see without these
sort of like diagrams but hopefully yeah no it amplifies right like over like it's kind of an
inherent property of graphs i mean there's always that kind of example of like you only in six hops
you can go across the whole exactly right so it's the same sort of problem manifesting there in the sense that you end right the more layers you add you want though you're on the h10
then you're gonna have like it's gonna explode right yeah you're gonna have the whole graph
exactly after a few layers you have so many nodes participating in this multi-hop neighborhood
that that you really have to that it really just the sampling time really just blows up
yeah so once you once you sort of have that mind, then the sort of the key idea behind dense is to recognize that in this sort of
expansion, as you're sort of starting from a small set of nodes and you're expanding out
in the graph, as you sample more and more hops away from those nodes, the key observation is
that you're going to return to nodes that you've already seen.
In other words, because of how the graph is structured, you know, the neighbors of some node,
the two hop neighbors of some node may be a one hop neighbor of another node, right?
And that's the sort of key sort of observation that we try to take advantage of with Dense.
Another way to say that is when you do these sort of multi-hop sampling in existing systems,
you perform a lot of redundant one-hop sampling to construct these multi-hop neighborhoods,
right? Because these multi-hop neighborhoods are constructed by first sampling the one-hop
neighbors for a set of nodes and then sampling the one hop neighbors for their neighbors and then you know sort of just by building you know one hop at a time and the observation that
we're talking about here is you sort of you build the one hop neighborhood of the same node multiple
times because it reappears at different pieces in this overall multi-hop neighborhood you basically
are expanding the one hop neighbors of the same nodes many many times and that's a lot
of redundant work so that's what we're going to try to get rid of with dense i'll try to give like
just another quick example of this redundancy with like let's say we have we have two nodes a and b
and and again we're trying to sample their two hop neighborhood for both of those nodes now we
have a mini batch they're both part of the they're both part of our mini batch we want their two hop neighborhood. For both of those nodes, now we have a mini batch, they're both part of the, they're both part of our mini batch, we want the two hop neighborhood.
So if we sample the one hop neighbors of node A to start with, let's say that it happens to
have a neighbor, B is its neighbor, right? So now, in the second step, when we're sampling
the neighbors of the neighbors of A, we're going to sample the one-hop neighbors of node B.
But we would have already done that in the first step
because we needed the one-hop neighbors of node B
in the first step, right?
So these nodes reappear,
and you sort of end up sampling one-hop neighbors
for the same nodes many, many times
in these multi-hop neighborhoods.
Does that make sense without any figures?
It's a little hard
to follow these sort of like but it makes sense when you think about a graph right you you if
you're traversing a graph but you're traversing from many nodes at once like think about doing
a random walk but you're starting around like 10 random walks from 10 nodes those random those
random walks are likely to sort of like maybe,
you know, come across each other, right? At which point you can sort of reuse the old random walk
from that point, or you can sort of start your own new random walk again. So yeah, the idea behind
dense is to sort of minimize this redundancy of sampling one hop neighbors for the same nodes um and and the way to do that is just
to cache and reuse the one hop neighbors those sampled one hop neighbors for nodes that you've
already seen earlier during this multi-hop neighborhood construction and that's the high
level idea nice now that makes total sense yeah yeah yeah and so basically dense is this data
structure that that we use to do that caching and reusing and tracking
because you need to sort of track which nodes you've seen while you're constructing these multi-hop neighborhoods
and you need to track their one-hop neighborhoods.
And basically Dense is the data structure to sort of do that.
And the way it does that is by sort of, again, building these multi-hop neighborhoods sort of step by step.
And it uses each step now is built using an incremental delta.
So you sort of sample the one hop neighbors for some nodes, A and B, for example, in step one.
And then in step two, you're going to sample, you know, the one hop neighbors for the neighbors of nodes A and B,
but only for the neighbors for which you have yet to already sample one hop neighbors for the the neighbors of nodes a and b but only for the
neighbors for which you have yet to already sample one hop neighbors right so you sort of each delta
is a is a new set of nodes and and those are the new ones that you're going to sample one hop
neighbors for in the next step but that delta are a completely new nodes you haven't seen them before
in the previous sort of steps.
And that's basically the idea behind Dense.
And, you know, we spent a lot of time
developing sort of efficient algorithms as well
to go along with constructing it
or to go along with this high-level idea, right?
You also still need to be able to construct it quickly
on the CPU, and then you also need to be able
to use this data structure on the GPU, right?
Once you have Dense, you still need to be able to read it data structure on the GPU, right? Once you have Dense, you still need to be able to
read it and find out the multi-tab
neighborhoods on the GPU because
you need that for the computation.
And so we have some sort of parallel algorithms
on the CPU to construct Dense and we have
and it's sort of nice for
GPU kernels as well.
And yeah,
it worked really quite well in the end.
It allowed us to sort of sample you know
up to 15 times faster for three and four layer gnn's and it also allowed us to even do the
computation faster because it sort of gave us this compressed format of these multi-hop
neighborhoods and was very amenable to you know sort of optimized gpu kernels and and the scaling
it works really well with as the it works even better as the number of layers
increase, which makes sense, right?
The more layers you have, the more redundancy you have in these multi-hop neighborhoods.
As you said, you know, once you get to sort of five, six layers, you have almost the whole
graph.
And so you can see Dense actually outperform like optimized sampling kernels that are specifically developed to minimize sampling time.
Like for example,
those in next door and you know,
dense out starts outperforming them after three,
four layers,
just because it scales so much better with respect to the number of GNN layers.
Awesome.
So I mean,
there's kind of,
there's two other sort of features that you mentioned in your paper about
marriage.
Maybe you can give us sort of like a, a a rundown of how the partition replacement policies work and the
altitude, and then we can talk some more about the numbers and the experiments.
Let's see how fast Marius goes.
Yeah.
So once you have that pipeline and dense data structure, that gives you the CPU to GPU piece.
And then the next piece is, well, what if of the CPU to GPU piece, right? And then the next
piece is like, well, what if the graph doesn't fit in CPU memory? So then again, as we said,
you have sort of two options. You can get a bunch of CPUs and do some sort of like distributed
CPU memory and then distributed training process. But in Mars, we sort of kind of continue with this
theme we have where we're like going to, you know, extract everything we can out of a single machine.
And with a single machine, you always have this sort of disk available to you that's not really made use of, right?
So you're sort of paying for it.
It's not the dominant cost.
The dominant cost is the GPUs and then the CPUs.
But you do have this available.
And so, you know, why not try to use it and basically the high level idea behind disk based training is that you're going to sort of then now you're going to store
your sort of graph and the base node feature vectors on disk but unfortunately it's not as
simple as just sort of having a pipeline where you do mini batch preparation on disk and then
you know move that mini batch to the cpu and then move the mini batch to the GPU. Because first of all, you can't really do mini batch preparation on a disk, right? It's
not a processing unit. And then also because of the fact that mini batch preparation requires
an enormous amount of random access, right? So you're sort of randomly sampling the training
examples in the mini batch, and then you do all this neighborhood sampling, which is all random. And the disk just doesn't have the ability to do random access at that
granularity. So what you have to do to use disk is you have to do this, you have to move data
between disk and CPU memory using these larger sequential reads and writes. So the way we do that
is we store the sort of the nodes and the edges of the graph on disk and the base feature vectors for each node.
But we partition them.
We partition the nodes and their feature vectors into these sequential blocks.
And we also group the edges according to those partitions.
And then once you have that data layout on disk, you can then load subsets of the graph into CPU memory by sequentially reading these blocks.
So you read a sequentially read a subset of those partitions and their corresponding edges,
and you bring that into memory. And that's a much more efficient way to sort of transfer
data between disk and CPU memory. And then once you have the sort of subgraph and CPU memory,
then you can do the same sort of mixed CPU GPU training on that in memory subgraph that we've been talking about using dense using the pipeline, etc. And you can do random access to sample your training examples and your multi, what do you do when not all partitions fit in CPU memory at once?
And that's where you get into this notion of having to sort of load some subset of the partitions in CPU memory, train on the training examples in that subset.
And then you have to swap some partition back out to disk and bring a new partition in.
And so you get this partition swapping such that over a sequence of swaps, eventually,
you bring in the whole graph from disk into CPU memory by swapping partitions in and out.
And that leads to the notion of having then some partition replacement policy,
which decides what partitions you're going to have in memory and when you're going to have the partitions in memory. And then you sort of, the first thing you come to when you think about,
okay, how do I design this partition replacement policy is that it should sort of minimize the
number of swaps and therefore correspondingly minimize the disk IO and the training time.
And so, you know,
we developed a policy that does that, that minimizes the number of swaps. It's called
beta. That was sort of part of the original Marius work. But what we found in Marius GNN,
when we tried to use that policy for more complicated GNN models, is that it leads to
lower accuracy compared to training when you just have the full graph in
memory. So this setup of moving these partitions between disk and CPU memory actually leads to
sort of an accuracy drop compared to just buying a bigger machine with a bigger CPU memory,
basically. And that's where things get really kind of interesting with all the disk-based stuff in
Mars GNN is to sort of understand why that is the case and then how you can sort of improve that. And basically at a high level, why does
that occur? Well, it occurs because you're focusing heavily on the in-memory subgraph to generate your
training examples and their neighborhoods. So you're only doing random access to this sub graph instead of the
full graph. And that leads to sort of a biased training process. So the idea, the sort of
analogy here is to imagine that you have all of your images on disk, and then you just load in
the images of like the cats and CPU memory, you train on those randomly, then you swap those out,
you load in the images of dogs, you train on those randomly, you you swap those out. You load in the images of dogs.
You train on those randomly.
You swap those, so on and so forth.
Instead of just putting all the images from all the classes in memory together and randomly
sampling from all of them.
So that's sort of the lack of randomness that you get by default almost because of this
only random access to the CPU memory.
And then you sort of do all the training on this,
the data you have in CPU memory, you swap some out, you bring some new in, you focus on that
for a little bit, and then you do another swap, etc. So you get into this predicament, which is
that when you use disk, in order to have high throughput, etc, you need to sort of do these
sequential reads and writes and do these larger swaps, not randomly access disk.
But then at the same time, in order to get high accuracy,
you need to randomly access the whole graph.
So it's actually, it's almost a, you can't,
it's like a conflict that you can't really necessarily solve because there's
two fundamentally different, you know,
requirements for high accuracy and then also for high throughput.
But in Mario's GNN,
we try to make a more flexible disk-based policy
that allows us to sort of get the best of both worlds in some sense.
Cool. Yeah. So should we talk numbers and experiments?
Yeah, sure.
So yeah, tell us about how you went, I guess, evaluating Mario's GNN
and what you compared it against and what your findings were.
Yeah. So at the time we ran these experiments, sort of DJL, Deep Graph Library and PyTorch Geometric were kind of the two most popular systems.
So we just compared basically the main experiments who are end to end training in Marius GNN directly with those two systems. And we focused on node classification and link prediction tasks
using sort of the largest graphs available, which are sort of the open graph benchmark graphs.
And you have about 100 million nodes and maybe a few hundred million to a billion edges.
And we focused just kind of on common models like GraphSage, the GraphSage GNN, and Graph Attention Network as well.
And basically, the setup we used, yeah, we used AWS GPU machines.
And for baselines, basically, since they require the graph to be in CPU memory,
we basically, we chose the smallest machine which had enough CPU memory
to actually store the graph. But that smallest machine, you know,
may have been a few hundred gigabytes of CPU memory
and potentially came with multiple GPUs.
So we allowed the baselines to use those multiple GPUs
if they required a sufficiently large machine
that had multiple GPUs.
And from RSGNN, what we did is we ran two versions.
So we ran one version, which was on that same machine
as what the baselines required to store the graph and CPU memory.
And so then we also just used sort of CPU and GPU training,
but we only used a single GPU for Mars GNN.
And then we also ran a second version of Mars GNN,
which was
the disk-based training version where we used the smallest GPU machine on AWS,
which didn't have enough CPU memory to store any of these graphs. But that's okay for the
disk-based Mars GNN because we stored it on disk and then we used CPU as sort of that partition
cache with the partition replacement policy.
That's sort of the experiment setup.
And then the results, what we found basically is that when Marius GNN is on the same machine
as each of the baselines, and they're all sort of doing CPU GPU training, Marius was
maybe three to four times faster than the baselines, even though it was only using a single GPU.
And the baselines were often using maybe four or even eight GPUs.
And that's because of the sort of reduced sampling time with dense and then the pipelining.
So we were able to sort of get very high GPU utilization.
Nowadays, we actually have 100% GPU utilization with Mars GNN.
And then the disk-based training gives you sort of a different option.
So it was also about a couple times faster than the baselines,
but then it was significantly cheaper as well
because you're on a much cheaper machine,
maybe only a $3 machine instead of a $24 an hour machine.
So we were able to
reduce the training costs between one and two orders of magnitude up to 64 times, if I remember
correctly. And this was pretty consistent across both node classification and link prediction
that we saw these sort of numbers. I mean, then some great numbers.
Great. Are there any areas where it's like where
Mario's GNN is, is like suboptimal and kind of what are the limitations of it at the moment?
You know, there is this, there is this, we didn't talk about it like super detailed,
but there is still this, even with the sort of new partition replacement policies that we have
and the sort of new techniques in Mario's GNN to, to help disk-based accuracy, there is still
the potential that disk-based training gets you a little bit lower accuracy
than if you were to train with a full graph and memory.
Of course, if that's an issue for you,
then Marius GN has the option to do CPU-GPU training,
but you'll have to pay for enough CPU memory to do that.
And one problem that Marius does have
is if you do pay for a large CPU machine
to do CPU GPU training,
then you may get a lot of GPUs. And at the time, Marius G9 only had a single GPU support.
So we've, we've since worked on that and sort of gotten multi GPU support, but that's, that's sort
of relatively recent and it's not even sort of available quite yet on, on GitHub, but that's
coming. And then there are some other like, you know, limitations.
One limitation compared to DGL and PyG is sort of like development support, documentation,
ease of use in some sense, like Marius has written C++.
And we've been working on a Python bindings, but they're not fully sort of supported yet.
So it definitely can be a little bit harder to work with in those systems,
especially if you need to sort of write some custom models or code or whatever.
But we're trying to sort of work on the engineering
and the usability aspect as well.
And then the last piece is like,
it doesn't quite have a distributed,
like a multi-CPU machine yet um implementation as well
but that's another thing we have sort of on our list as well to do i see you preempting my next
question now is because they're where do you go next with with with marish and sort of the overall
project but i guess kind of focus on the usability working on multi-cpu those things kind of what's
on the roadmap for the next sort of yeah I don't know, six to seven months.
The next like sort of main research piece is like a distributed version, you know, so
how can you maintain a hundred percent GPU utilization when you have many GPUs and many
machines?
That's sort of the next research piece.
And then, yeah, there's a lot of engineering work that can always be done to sort of, you
know, make everything from the installation to the of you know make make the for everything from the
installation to the you know to the ease of to you know writing your own models to to sort of
pre-processing post-processing all those sort of things that that people want when they want to
actually use it for an application yeah for sure i mean that and that's sort of one angle that i
don't i mean it's harder to kind of get of get papers published in it when it's focused on usability.
People care about numbers and like things going faster.
Yeah, that's.
This piece of software is more usable, right?
Right.
That's a sort of like PhD versus, you know, startup difference a little bit right now.
You know, in your PhD, there's not necessarily bonus points for you to get papers papers in the phd but that's it right yeah
and cool so yeah so i guess kind of this next question is kind of what impact do you think
it can have longer term and also what impact is as maria's g and had already have you seen
much use of it in already are people out there using it or what's the feedback like
yeah so hope i mean our hope is that you know we've we've open sourced it and we're like we said we're working on making it easy to use it's not it's
not perfect yet but we're working on and we we hope that that allows sort of like researchers
um in many you know science scientific areas many different areas to sort of easily and sort of
quickly train gnns on their own data right and then therefore enables them to sort of quickly train GNNs on their own data, right? And then therefore enables them
to sort of do, to sort of further their research in a way that maybe it was too expensive or too
difficult to do without Mars GNN. So that's kind of the goal. And yeah, I mean, we've had a lot of
people, we've had a good number of people starting to use it on GitHub and, you know, posting questions and interacting with it.
So overall, I've been happy with how people have seen the system and the number of people trying to start using the system.
And motivates us to keep making it better and giving more functionality for sure.
Yeah, definitely.
I mean, it was really rewarding rewarding seeing kind of
yeah your actual research be used by people having that feedback loop right it motivates
back to that yeah yeah it goes back to that sort of initial um you know motivation for ml systems
right where you sort of get to build these things which are you know actually like cloned and and
used right yeah for sure i mean i mean i kind of just on a tangent i know a lot of my work in my
phd like i mean someone might read the paper a tangent i know a lot of my work in my phd like
i mean someone might read the paper one day and do a game but like the implementation is probably
just going to be stay like dead somewhere like get up and i was ever going to look at it again
which is pretty sad so it's really nice that you have that feedback yeah for sure and cool so i
guess when you've been working on this project so you've been on it quite a long time what's
maybe the most like the most interesting thing you've kind of, you've learned from working on this?
So I think the thing that stands out to myself and Jason as well,
I alluded to a little bit at the beginning,
but there's sort of this very interesting kind of interplay
and sort of even like trade-off almost between like algorithms
and implementation, right?
So for example, this dense data structure, the algorithm in itself is sort of, you know, you're going to
reuse the samples, okay, there's an algorithm to construct dense, there's an algorithm to use dense,
right? And we knew those for maybe like three months before we actually reduced the sampling
time sufficiently. And the reason was that it was actually very difficult to implement those algorithms,
the dense construction and using dense well.
It was very able to implement them at high performance.
And we found that kind of consistent across a lot of things
where you can spend a lot of time on an algorithm and it has
good complexity, but it won't work actually because, you know, the implementation is very
important as well or vice versa. You know, you can just have a very simple algorithm, but
the differences in implementation are, are, can be enormous. Um, so that, that, that's one thing
that's the thing that stood out, um out to me, just the importance of implementation
to actually achieve high performance has been something I think that maybe goes under the
radar a little bit in some of these systems designs.
So even like, you know, in the paper, this is, you know, it's not, it's not in the paper,
right?
Because it's not really new in the sense of like but but the implementation was very important in the end that's that's something i
think i think i found pretty interesting yeah i agree like i mean a good idea implemented badly
just just going to get a good performance right like yeah the best idea in the world but if you've
got some hacky implementation of it then yeah it's going to be it's not going to work right
um it's cool do you have any sort of like, um,
kind of war stories along that sort of journey as well?
I think you tried that failed dead end you ran into that might be interesting
for listeners to know.
I do have one more story where we, we,
when we were originally implementing these sort of new disc based policies,
we were getting really bad accuracy and we were very confused and we were,
you know, we were for months, like maybe like three months.
And in the end, what the issue was,
is that we were using like P read and P write function calls, right?
In C++.
And those have a limited block size, right?
So in other words, you can only read like two gigabytes or write two gigabytes.
And anything other than that, you basically just get like junk, like random bytes, right?
And that, I mean, we just missed it in the sort of like docs, right?
But that led to, you know, you get these silent failures in ML where it's not like actual, know compile error or runtime error right but the error
just shows up in poor accuracy so you don't know where along there's no actual like you know there's
no error you can look at you just see you get poor accuracy that's the only feedback you get
and you have to sort of like figure out where that's coming from and this was one of those
which took a long long time like, like, you know, months.
Oh, wow.
Yes.
And we weren't even really, you know, we didn't even really know it was a bug, right?
We just were like, man, maybe disk-based training doesn't work.
Yeah.
But it turns out that you can only read and write two gigabytes at a time or like whatever the max in 32 is.
Oh, crazy.
You'll never forget that now.
That being greened in your brain forever. I crazy. You'll never forget that now. That'll be ingrained in your brain forever.
No, yeah, well, now I see.
You and P write rappers.
P write rapper as much as you want,
and it'll break it up into two gig chunks.
Nice, yeah.
That was rough, yeah.
That's a tough one.
Yeah, it's interesting. I never, obviously, not that sort of,
that much experience in the ML world.'s that silent failure there of like you just get bad
accuracy it's like damn like how do you go back and figure out where in the whole pipeline that
this is coming from oh yeah it's just a bad approach so like how do you like exactly yeah
yeah this one we almost were ready to give up on the disc-based training really because it was so
long and i looked at so many things and yeah so eventually i got it which was i mean i kind of
thought that it was always should have been better like i was pretty surprised
yeah because it worked for the simple models in the original marius you see it didn't work for
the gnns but we had also rewritten a lot of code and
anyway it was it wasn't that was that was the worst for sure but yeah i bet you slept so well
the night after you found it but you're like that was lovely just yeah yeah cool yeah i guess it
kind of is your main sort of like uh work all around mario so do you have any other research
kind of going on at the moment as well yeah Yeah. So I think, you know, back to the very beginning, in general, I'm just trying to make
training more efficient. And there's a lot of options you have to do that, right? You can focus
on the system, you can focus on the model, you can focus on the data. I've done a little bit in
sort of each of those, those pieces, focusing on the systems with like Marius,
Marius GNN.
We have some new systems we're working on.
I've done a little bit in terms of just making the models more efficient in the sense of
compressing the model during training.
I did a little bit of work on that and using the sort of activations, the fact that the
activations are low rank.
And then more recently done a little bit on sort of tryingations the fact that the activations are low rank um and then more
recently done a little bit on sort of trying to compress the data a lot of people have been
working on like data pruning these days how can you train to the same accuracy using sort of a
subset instead of the full data set um so so looked in that into that recently but yeah at a high level
how can we make how can we make training more efficient you
know nice i guess sorry yeah how can we make it to like humans right where you just see
one or two examples and then you're seeing you know what a dog is yeah yeah i guess the holy
grail sort of system as well kind of factors in all these different areas that it combines stuff
to do with the model to do with the data to do with the infrastructure whatever it brings all those things together in sort of like some i don't know some i mean some say what so
what would be the most advanced system that would bring all these things together at the moment what
is the sort of state of the art state of the art or is it all is it most things just focus on one
specific thing so i mean i think i think that was kind of i mean maybe in the news you've seen like
mosaic right their recent sale their original
sort of maybe idea right which was to just focus on giving you a platform to sort of efficiently
train on your own data and and that platform had lots of tricks and and efficient training
and efficient implementations sort of internally and and you know they've they've transitioned to
sort of lms a little bit but still giving you the ability to sort of train you And, you know, they've transitioned to sort of LLMs a little bit,
but still giving you the ability
to sort of train on your own data
cheaply and efficiently.
So, you know, I mean,
the Holy Grail is kind of
maybe similar to that,
where you have some system
which just takes in user data
and somehow it analyzes it
and sort of efficiently decides first
what data it needs from that full data that the user provided.
What's the most important data?
Then it sort of efficiently and automatically decides what's the efficient model,
you know, most efficient model that I need to train on this data to achieve, you know, whatever the user wants to achieve.
And then you've sort of selected high quality data and a high quality
efficient model.
And then you have some sort of actual systems implementation that goes and
actually runs it well.
And some distributed trainings cluster or on a single machine or whatever.
Right.
But, you know, those,
those are difficult questions to sort of automatically decide what's
important data, what model will achieve good accuracy for low cost, et cetera, right?
We'll get that one day.
Yeah, we probably will.
I mean, that's kind of, you know,
it's not so far from what some of these new companies have been trying to do,
like even like Mosaic.
They had that like software suite that kind of like picked a set of model
optimizations to use to sort of, i can't remember what it's called
but cool so yeah my next question next question roger is my is my favorite question of all and
it's about the creative process and i love seeing how people's and answers to this question dive
edge so the question is that how do you approach generating ideas and then how do you
determine which ones to select?
What's your creative process?
Yeah, so for me, it's all about really simple examples
that you can really analyze really well.
So I like to make little prototypes in Python.
And then that's kind of like,
then you have the fork in the road where it's like,
okay, I did this little prototype, you know,
very simple. And then you sort of, is it going to work at scale? Is it worth sort of pursuing for
longer effort? So like I was recently working on a problem, which was like a little bit like of a,
can we do basically an alternative to backpropagation?
And so it was, you know,
can I implement this for one or two layers in Python?
No biases, just weights, just super simple dense layers. Like each layer is Y equals matrix times vector, W times X.
And then, you know, you can very quickly
dive very deep into that process, you know?
So can you beat back propagation for this two-layer simple network?
And, you know, you can try very many things.
You can get very into the weeds.
But then you also sort of don't spend too much time, right?
It might be 100 lines of Python.
That's a very simple prototype.
And that allows you to, you know, say, no, we
can't, we can't very easily beat back propagation without having to go, you know, into like
transformers and C++ and all these complicated things. So that's kind of my, that's kind of my,
how I, I work on problems. I like to prototype and I like to sort of, even if, even if people have done it before, I like to prototype
just to, to see the problem myself, see what's actually hard about the problem. But, you know,
there are a lot of other things that are involved, right? How do you decide what,
what problems to prototype, what problems to think about? That's, you know, that's a little bit
kind of random in some sense, maybe, but you get that from papers, from, you know that's a little bit kind of random in some sense maybe but you get that from papers
from you know talking to to theo my advisor and talking with you know other people and stuff and
i've never had that answer before that says they're all they're so great because everyone's
answer is always so different so like yeah that idea about prototyping stuff and working and
feeling things out there feeling things out of that and kind of getting an intuition for it and understanding the problem that way
again yeah that's such a great way of doing things like i really love yeah i have this like uh
pie charm project which is just like experimental and it just that that's like the overall folder
and then there's just like a thousand subfolders just like random things i've tried so many random
things oh that's that, that's fantastic.
Great.
Anyway, it's time for the last question now, Roger.
So it's like, what's the one thing you want the listener to take away
from this podcast today?
So I think in one sentence,
maybe the main thing for Marius and Marius GNN
is to sort of first maximize the resources you have,
maximize the use of the resources you have before scaling out to more resources.
I think that's the takeaway from maybe Mars and Mars GNN in one sentence.
And, you know, we're not the first to say that, right, as you mentioned, like the cost of papers, et cetera.
But I think it's, you know, it's important to sort of revisit that as we have these very complex distributed machine learning deployments.
Um, fantastic.
Uh, let's end it there.
And it's been a brilliant talking to you, Roger.
If you're interested to know more about Roger's work, we'll put links to everything in the
show notes so you can go find it.
And again, if you, if you enjoy the show, please do consider supporting us through Bam
Your Coffee.
And we'll see you all next time for some more awesome computer science research.