Signals and Threads - The Uncertain Art of Accelerating ML Models with Sylvain Gugger
Episode Date: October 14, 2024Sylvain Gugger is a former math teacher who fell into machine learning via a MOOC and became an expert in the low-level performance details of neural networks. He’s now on the ML infrastructure team... at Jane Street, where he helps traders speed up their models. In this episode, Sylvain and Ron go deep on learning rate schedules; the subtle performance bugs PyTorch lets you write; how to keep a hungry GPU well-fed; and lots more, including the foremost importance of reproducibility in training runs. They also discuss some of the unique challenges of doing ML in the world of trading, like the unusual size and shape of market data and the need to do inference at shockingly low latencies.You can find the transcript for this episode on our website.Some links to topics that came up in the discussion:“Practical Deep Learning for Coders,” a FastAI MOOC by Jeremy Howard, and the book, of which Sylvain is a co-author.The Stanford DAWNBench competition that Sylvain participated in.HuggingFace, and the Accelerate library that Sylvain wrote there.Some of the languages/systems for expression ML models that were discussed: PyTorch, TensorFlow, Jax, Mojo, and TritonCUDA graphs and streamsHogwild concurrency
Transcript
Discussion (0)
Welcome to Signals and Threads, in-depth conversations about every layer of the tech stack from Jane Street. I'm Ron Minsky.
It's my pleasure to introduce Sylvain Gougere. Sylvain is a machine learning engineer here at Jane Street, and he's done a bunch of interesting stuff in the outside world as well. He was a core maintainer of Hugging Face's Transformers library. He wrote Hugging Face Accelerate,
which is a nice library from them
that helps you run your models performantly
on a lot of different kinds of hardware.
And he also wrote a lovely book
along with Jeremy Howard
called Deep Learning for Coders
with Fast AI and PyTorch.
So he's done a bunch of interesting stuff
in the outside world.
He's also doing a lot of interesting
machine learning work here at Jane Street.
So thanks for joining me.
Thanks. I'm very honored to be here.
And just to kick things off, I'd love to hear a little bit more about your background. And in
particular, how did you get to work on machine learning in the first place?
So that's a good question. I was originally a math teacher, like 10 years ago, and from
teaching at the first year of university level. And yeah, I moved to the US in 2015. I had kids, so I took some small projects at home
to mostly take care of my kids.
In 2017, AI was kind of becoming more mainstream.
I actually read an article in the New York Times about it.
It was going to steal, like, everyone's jobs
in the next two or three years.
That didn't happen, but it's still something
that became more mainstream.
And at the end of the article,
they were mentioning a couple of online courses
for people interested into diving more.
And I was interested, so I dived into it.
So one of the courses mentioned
was the Fast.ai course by Jeremy Howard,
which I followed.
It was very interesting.
And I started commenting a little bit
more and more on the forums
and making a couple of contributions
to the Fast.ai library,
which is used throughout the course
to make training models a little bit faster
and a little bit easier.
And then towards the end of the course,
Jeremy led a FastAI team to this competition
called the Third Bench Competition,
which is the ancestor of the ML Path Benchmark.
It was organized by Stanford,
and the goal was to train a computer vision model
as fast as possible to a given accuracy.
And so we entered the competition
and helped to beat the team.
And we were like positioned first for the longest time.
And yeah, at the very end, Google kind of told us by publicly releasing TPUs for the
first time.
And yeah, those massive computers that no one else had access to trashed our best entry
and our best time.
So I want to hear more about the competition.
But before that, can you tell me a little bit about like, what is Fast.ai?
What's the basic program there?
What's the mission behind that organization?
So Fast.ai is a non-profit whose goal is to educate people about deep learning.
Especially in those early years, it was starting to become more mainstream,
but not necessarily as mainstream as it is today.
And the idea behind it is that during more world,
and I believe that to get the best model,
you need good machine learning engineers,
but you also need people
who really understand the data
that those models are going to consume. So if you want
good model in radiology, you need
very good radiologists to kind of
understand how machine learning works, so that they're
going to be able to help you build those best models.
The Fast AI course, I am
both at Coders who want to really
dive deep into machine learning, but also beginning
is more of an introduction that anyone who is interested can take in to learn more about
what is machine learning what are those deep learning models and what they can do. So the
basic idea is to democratize machine learning so all sorts of domain experts can actually know
enough about it to really leverage it in a meaningful way. Exactly you said it way better
than I did. Let's get back to the competition. So like the end of the competition is like a great dramatic story
of Google gorilla stomps on everything
by dropping TPUs at the last moment.
But what were you actually doing
in order to get into the first place
before Google kind of jumped in there?
Yeah, so a couple of things.
The main thing is related to the way
we are training the model
and in particular,
like the learning rate schedule.
So to take a little step back,
when you train those machine learning models,
initially your model is random,
so it's outputting crappy predictions.
But then you compute a loss,
and from that loss, some gradients
that are going to make your model a little bit better
if you adjust the weight following those gradients.
So the whole process is called stochastic gradient descent.
Right, and just to say a higher-level thing about all this,
this is just an example of a more general thing
called function optimization, right?
You have some function that you want to optimize. In this case, the function is
given the set of model weights and the input you want to run on. You want to find the set
of model weights that give you the best and most accurate answer. And we just approach this like
we do in some sense, almost any other kind of optimization problem with techniques that actually
go back 50 years or something of, we're just going to compute a derivative,
and we're going to walk the model weight
in the direction of the derivative
and just do that over and over until we
get to a more optimal result.
FRANCESC CAMPOY- Yes, exactly.
Like, the whole process and the whole math behind it
existed for like 50, 60 years.
It's just that with GPU becoming more and more powerful,
we actually had to compute to apply that process
to complex problems like deep learning.
So that very important hyperparameter, the learning rate, is kind of the size of the
step we take, like following those gradients.
At the time of the competition, the most popular learning rate schedules were very inefficient,
just training at a low learning rate for a very long time.
And then we divide that low learning rate by 10 and we train for even a longer time.
That did converge to a good accuracy, but
it was very inefficient. And one of the things
we had in our competition entry was
to follow a learning rate schedule
that is more like a warm-up from a low learning rate to a
high learning rate. So not start at a high learning rate because
otherwise the model immediately explodes,
but by warming up, so starting from
something low and gradually increasing it
to the maximum, we can have the model
learn a little bit of something for those
and then have high learning rate for a little bit of time
so that we can explore the lost landscape efficiently
and then decrease it towards the end.
And this kind of schedule made it possible
to train the model to the same accuracy, but way faster.
Why is that a better schedule?
If you kind of just think about this
without knowing a lot of the details,
the idea that when you're very far from the answer,
you want to take large steps, and when you get far from the answer, you want to take large steps.
And when you get closer to the answer, you want to take smaller steps.
Seems intuitive.
But here you say instead, you start with small steps and then go up to big steps and then go down to small steps.
So what about the structure of the problem makes that the right approach?
I mean, at the beginning of the problem, like since your model is randomly initialized,
in the landscape of your loss function, you are very, very high and you actually have very steep canyons.
So if you take large step at the beginning,
you can at least begin to descend
into one of those canyons of the loss function
and then increase that learning rate
to like dive through there fast.
And you will skip over a lot of local minimas
because your learning rate is large.
So towards the end,
you need that decrease to step down further
into like one of those smaller part of the landscape of the loss that have like this local minimass. So is the intuition here that
when you start at a randomly initialized point, the terrain around which you're trying to optimize
is just more wild. And if you take big steps, the derivatives are very high and you're kind of
jumping all over the place. But even with a little bit of optimization away from that initial
randomness, you end up into something that feels like a more regular space. And now you can go back
to what makes more intuitive sense.
Yes.
It depends also, like,
we're talking about just a 3D problem,
but we have millions of dimensions
because your model has millions of parameters.
So it's the idea that, yeah,
on some of those dimensions,
the landscape is very, very spiky.
So at least taking care of that
at the beginning with a low learning rate
is going to make the whole optimization problem easier.
And then you can have larger steps.
Yeah, I do think this is kind of terrible intuition
when one thinks about a problem like this,
of like, I'll try and visualize it in two or three dimensions.
And you're like, you have just lost all of the important structure,
and you really need to think about this high-dimensional problem
to really know what's going on.
That was one of the optimizations.
The other optimization we did was a computer vision problem.
And so the kind of models we applied to them,
which are called CNNs for Convolutional Neural Networks,
they can work on any size of images
because it's just some kind of filter
that you apply over all of your image.
And so the idea is at the beginning of the model,
like when you train the model, it's random, it's crappy,
so it doesn't really need to see the whole picture.
We kind of gave it a more blurry version of the picture,
like just 128 by 128.
And then gradually as training goes on,
we increase the size of those images
to make them more of the standard size
people doing that problem were using.
And you have that gradual resizing as well
because if at the beginning your image is smaller
and you have your filter to put all around the place
of that image, it's going to be more efficient
if you have less pixels compared to doing the training
with always the high-resolution images.
Right, so there's two neat properties of convolutional neural networks that are coming into play here. This is this convolutional networks are in general a dimensionality reduction
trick. You could imagine like a big network that's applying to all of the different inputs and all
the different parts of the image. And then you could just have weights that are individual to
all the neurons that are associated with all the different parts of the image. But that's enormously
wasteful because in the early parts of the network, you actually want sort of the
same regular structure over and over. And so the basic idea of a CNN is you kind of lock those
weights together. So you, in some sense, just have one copy of this neuron, which is activated
multiple times in multiple places. But then once you've done that trick, you also have this
resolution independence where you can run it at multiple different resolutions and you're saying,
well, we're just going to like train this thing at low resolution. And then again, after it gets
into the ballpark and we need to more precisely fine-tune it, then we'll increase the resolution
and do the rest of it. And were these essentially kind of new techniques at the time of this puzzle?
Yeah, both of them were new techniques. The gradual resizing is still not that widely used.
The new kind of learning was like, no, everyone uses that. Like all Transformers models
like GPT-3.5,
I think,
GPT-5 now.
Not for sure
because OpenAI
doesn't publish its research.
But like the open source
versions of that
are trained using
that kind of schedule.
And since birth,
we have seen
that kind of learning
schedule all the time.
So that's how you got
into Fast.ai
and you got into
this competition space.
How did you end up
being co-author of this book?
So after like collaborating on the Fast.ai library
and participating on the forum of that competition,
Jeremy Howard kindly offered me a job at Fast.ai,
which I accepted.
So I worked there for two years,
built a couple versions of the Fast.ai library
and two iterations of the online course.
And it was very natural going from the course
to publish a reference book
with kind of the same content,
just in a different format for people who prefer to learn from books
instead of YouTube videos.
Got it.
And then what brought you to Hugging Face?
And what is Hugging Face?
So Hugging Face is kind of the GitHub of machine learning.
The idea is that we have a website
that looks kind of like GitHub,
except of having repos with code,
you have repos with model weights.
So like LMA1, LMA2, LMA3, and 3.2, it was released a couple of like GitHub, except of having repos with code, you have repos with model weights. So like LMA1, LMA2, LMA3, and 3.2,
that was released a couple of days ago,
are all on Hugging Face,
along with, I think now is a million public model
for more kind of applications of machine learning,
like computer vision, text, speech, et cetera, et cetera.
And yes, the idea is that they are kind of the forefront
of the open source AI
by allowing people to share those models.
We've had that happen in a couple of libraries
because model weights are very good,
but if you don't have the code to actually instantiate those models,
they're kind of useless.
To complement that, they have libraries like the Transformers library,
which actually contains the code of those models.
And how did you end up at Hugging Face?
In 2020, there was this thing that happened worldwide.
Oh, yeah. I vaguely remember that.
Yeah.
And so that kind of disrupted some plans at Fast AI.
So I looked for another job.
And there was this startup from French people,
which was based in New York City.
And so I knew them from the French tech community
in New York City.
And I'd met them a couple of times before.
We were looking to expand.
So I applied to Hugging Face,
and I joined them randomly in June of 2020.
As a continuation of my work in open source
from Fast AI to democratize machine learning
and how people use the Transformers library
or like their website
with all the public weights on it.
So what kind of technical work
did you end up doing at Hugging Face?
A couple of things.
It was about the maintenance
of the open source libraries
because, yeah, there are people
doing pull requests,
having issues kind of all the time.
So that is already a huge amount of work.
Then I developed like new tutorials and new examples
to help people use those libraries.
And that kind of ended with an online course
that was meant to be taken after the Fast AI course,
like for people who wanted to specialize
a little bit more into transformers.
So there are those two aspects.
And then, yeah, at some point,
like all the researchers at Unique Face
were kind of annoyed by like our big black box trainer,
which contained like all the stuff of the training loop.
And it becomes with time,
like this huge amount of spaghetti code
because you have like new flags
that appear to kind of control everything
that people want to do with their trainings.
And so I created a new open source library
to make it much more lightweight to all people with our trainings so that they can have more flexibility.
The idea is that usually those APIs that train models, you have a trainer API and you give it some things like your model and your data and you click the train and it trains and it's marvelous for people who just want that.
But yeah, researchers wanted to change, tweak the training loop a little bit.
We're struggling a bit more.
So there are various techniques that have applied for that in the past.
Like in Fast.ai, we had some callback-based systems.
So we had callbacks that the researcher could implement
to change a little bit the behavior of the training loop
at this particular point or another.
The Hugging Face trainer was less extensible.
But for that library called Accelerate,
I went back to, yeah, if the researcher is just going to write their training loop
and there's not going to be like a black box trainer,
and they just need to change like a couple of lines here and there
to make it run on any kind of systems.
At first, it was like six lines, then five lines.
We tried to reduce that number of lines to the absolute minimum
so that there was as little intrusion as possible
that kind of gave that API from Accelerate.
And when you say you want to make it possible for people
to do their training on multiple different kinds of systems, what is the diversityate. And when you say you want to make it possible for people to do their training
on multiple different kinds of systems,
what is the diversity of systems
underneath that you're thinking about?
What are the kinds of systems
and different variations on the training
that you are trying to enable with Accelerate?
Training requires a lot of data, usually,
when you train those large language models
or even other kind of models.
And to make it more efficient,
usually you kind of divide and conquer. And if
you have multiple GPUs, you give a slice of the data set to each of your GPUs. And so let's say
you have N GPUs, then your training time should be reduced by N at the end of the day, because
they fully parallelize the things that you cared about just by splitting your data this way. So
this is called data parallelism. And it's kind of the first level of parallelism
we can use
when we have
multiple GPUs
and we want to
run a training on them.
And so
you can do that
in PyTorch
except
it requires
some kind of
boilerplate code
that is a bit annoying.
So the idea of Accelerate
was to remove
that boilerplate code
by just having to
change a couple of lines
in your training loop
and poof
your model can now run training on multiple GPUs, also on TPUs, because of course, like
the code to run them with the same kind of distributed data parallelism on TPUs is different
from the one of GPUs.
That would be too simple otherwise.
And then if you have done the modification, nothing runs on CPU again.
So like the idea is that it kind of deals with all of that crap for you of detecting
which kind of environment you're on
and then adding the boilerplate code that is needed
for your training to run successfully.
All of those kind of systems.
And then it also adds, if you want to train in a mixed precision setting
because you want to use lower precision types,
we can talk about that later.
It also dealt with the additional LISM codes that were required
to properly do that kind of automatically.
Yeah, I mean, I think this whole discussion kind of underlines just the diversity of different
hardware and setups that you can do when you're doing training. There's the kind of, in some sense,
simplest thing of like you can run your training on a CPU, which is a thing that people did for a
long time. And then there are multiple different parallel architectures, GPUs, which are like
literally descendants of graphic programming chips, and TPUs, which are like literally descendants of graphic programming chips,
and TPUs, which is this tensor processors that Google came up with. And the main game here,
going from the CPUs to the GPUs and TPUs, is about parallelism. It turns out CPUs are these
kind of funny machines that have lots of parallel circuits, but they're interpreters for a brutally
sequential programming language, right? And so they're not that good at doing lots of things
in parallel. And in fact, there's all the complexities of like multi-core architectures and
stuff on that side, which is how you try and take advantage of parallelism there. But then GPUs and
TPUs are machines that are much more directly parallel in their structure and built for
large scale, highly regular parallel computations. And then at some point, those things aren't enough
either. And now you start getting to various forms of distributed.
You want so much parallelism, you want multiple GPUs.
And then the first thing you were talking about was this data parallel training,
where what we're doing is we're running this like stochastic gradient descent,
where we're like picking random subsets of data, breaking it up into batches,
and then training on individual GPUs and computing like a net gradient,
which we then use for updating the model.
And then there's also pipeline style parallelism,
which you might need when your model itself is too big to fit.
In fact, not just pipeline parallelism,
but various kinds of model-level parallelism,
where you actually take the model and break it up and split it among multiple GPUs,
because even the model weights themselves are too big to fit there.
And then Accelerate is trying to help you write your model once
and your training loop once
and do a modest amount of modifications
to be able to access this whole sweep of different ways
of doing the training.
FRANCESC CAMPOY- Yeah, exactly.
If your model does not fit anymore on one GPU,
you can split it different ways.
You can split the layers.
You can say, if it's a deep learning model,
usually those come by.
They are bigger because you have stacked more layer.
You have layer 1 on GPU 1, layer 2 on GPU one, layer two on GPU two,
layer three on GPU three, et cetera, et cetera,
which is like a good idea because then your model fits.
But then there is this inefficiency in the sense of GPU two has to wait for GPU one
to be finished to be able to process as a result
and pass it along to GPU three.
And so that's where pipeline parallelism comes into play
where you're trying to pipeline things efficiently.
So like give a little bit of your data to GPU1,
which is going to send it to GPU2.
And then GPU1 will process the second little bit of data
while GPU2 is busy computing the first part.
And there's this ping pong between a forward
when you run through your model
and the backward path
where you compute all of your gradients.
So you can also efficiently interleave
like some part of the forward
and some part of the backward computation
in that pipeline parallelism.
And then there is like tensor parallelism where instead of splitting your model by layers,
you actually split the weights of your model into chunks, and each GPU only sees one part of the weights. And so then the GPU needs to come together and agree on the results of all the
matrix multiplies that you compute. So this kind of parallelism requires way more, I mean, a very
efficient way to communicate between GPUs
to be accessible.
MARK MANDELMANN- That's right.
Maybe the other interesting thing
about the hardware around this kind of stuff
is the criticality of the network.
You need these very fast network transfers
to do the tensor exchanges.
And yeah, there are some contexts
where it can be a little less critical
because you can overlap compute and data.
And some things like this tensor parallelism,
the GPUs are just going to be sitting idle
while you're waiting.
So we nowadays have these kind of wild new networks which have much, much higher capacity and are very focused on these very low latency and high determinism data transfers.
One of the things I think is interesting about this is the way in which the networking stack has changed, right? I think when I started learning about how do you do high-performance trading systems,
I learned about, well, the operating system kernel is obviously too slow.
So if you want to be reasonably fast, you have to do kernel bypass.
You have to have a user-level networking stack that's doing the communication.
And these systems use a technology called RDMA, remote direct memory access, which I
think an easier way of understanding what's going on here is it's CPU bypass, right?
Basically, network comes in on the NIC
and then without going through any CPU at all,
just gets copied directly to the place in memory
that it needs to go,
maybe directly into the GPU memory.
So you're really cutting away all of the fat
from the bones that you can
to make this stuff go as fast as possible.
Yes, and even like in the more recent hardware
that NVIDIA has announced, Atlas GTC, like you kind of stack your GPUs as fast as possible. Yes, and even like in the more recent hardware that NVIDIA has announced, Atlas GTC,
like you kind of stack your GPUs as close as possible
and you try to like put as many as possible you can
in a single cab.
So like there are 72 GPUs in the same cab,
very close to each other.
So that you can have even faster network between those
because you have like you stack some,
the network is in the middle,
some GPUs above, some GPUs below,
and like they have this big NV link in the back that links everything together very fast
just because they sit very close together.
Yeah, you start caring an enormous amount about the physical layer at this point.
Today, we can get these NVLink setups where inside of a single box with, say, eight GPUs in it,
you get this fast network.
And yeah, what you're describing is doing this at the cabinet level.
Yes.
Which is funny.
Yeah, I mean, I remember hearing people
talk about like earlier hacks,
not for machine learning,
but for other purposes
where people would like, you know,
basically try and make little supercomputers
where you unroll your PCI Express network
and basically spread it over an entire cabinet.
And in some sense, InfiniBand
sort of grew out of the similar
supercomputer networking fabric.
And indeed, InfiniBand plays a real role
in how these GPU networks work as well.
Okay, that was the stuff you did at Hugging Face.
So more recently, you joined Jane Street.
Tell me a little bit about what your role here entails.
Sure. So Jane Street,
mostly work here on the engineering performance
around machine learning.
The day-to-day life is a researcher
will come to me with a model we've trained. And they're like, oh, my training is going really slowly. Could you help me with that?
And yeah, we'll profile it together, try to identify the bottlenecks and yeah, make it faster.
To take a step back, like most of the researchers here at Gen3 use PyTorch, which is the software
to write neural nets and train models, which has the particularity of being really accessible
because it's eager. The counterparts from Google, TensorFlow, and JAX are more like compiled languages.
So it's kind of harder to get started because you write your model, but then it does not
compile.
And so you need to fix some of the operations that seem like a valid Python operation, but
you need to kind of modify them so that TensorFlow or JAX recognize them and see, oh, this is
what you are trying to do.
Whereas in Python, you can do anything you want,
but then your code can be inefficient in surprising ways
because that particular operation, for instance,
has no implementation on the GPU.
And so the computer needs to transfer data back to the CPU
just to be able to execute it
and then send it back the other way.
And in general, especially on modern GPUs,
and the way PyTorch works is that
when you want to execute a model,
the CPU dispatches the operation on the GPU, I think,
so that the CPU immediately runs to the next instruction.
And you're getting your hardware in a good state.
And if your CPU is always ahead of your GPU,
and then the GPU has lots of stuff to process,
but as soon as your code requires some synchronization
because you need some data back from the GPU to the CPU,
it can become pretty inefficient just because you're kind of stalling the GPU as the CPU will
wait to get the data back. And then it will take time for the CPU to send back some new operations
to execute to the GPU. Right. And it's that waiting where the GPU is waiting on the CPU.
It's slow for a lot of reasons. It's slow because the memory transfers are slow. It's slow because
CPUs are inherently slow. And then, oh my God, it's slow because the memory transfers are slow. It's slow because CPUs are inherently slow.
And then, oh my God,
it's slow because
the code that's running
is written in Python,
which is maybe like
60 times slower
than what the
corresponding thing
written in C
might have looked like.
Exactly.
Even if you don't
care about GPU,
most of your Python code,
you will always try
to have it vectorized.
We're trying to write
as few for loops
in Python as possible
because those will
be very slow.
Whereas if you can
execute like an operation from like NumPy, which will be backed by C or C++, it will be very slow. Whereas if you can execute like an operation
from like NumPy, which will be backed by C or C++,
it will be much faster.
And it's kind of the same idea for the GPU,
except on top of that, you have that complexity
to avoid synchronization point between the CPU
and the GPU as much as possible.
And notably, when a C programmer says,
oh, I want to make sure this is vectorized,
what they mean is I want to make sure I'm using
like the SSE, AVX, whatever
instructions they're vectorizing that are using
fundamental parallelism technologies baked into the CPU
to be able to do four or eight or whatever computations
in parallel.
And when a Python programmer says vectorize, what they mean
is the inner loop is in C. And maybe it's also
vectorized with AVX or whatever at the bottom.
But the fundamental thing is getting away from the Python interactive loop.
Exactly.
Sometimes you can have code that looks very innocuous,
but you're actually executing a for loop,
which is going to, at every iteration,
trigger a synchronization between the CPU and the GPU,
which is extremely bad,
because you'll launch a tiny operation on the GPU
and then have to wait for the GPU to finish it
and get back the result to the CPU,
and then launch a new tiny operation GPU, et cetera, et cetera.
And this is also really bad
because one thing
we forgot to mention
is starting something
on the GPU
is also very slow.
It takes sometimes
for the CPU
to send the code
of the kernel
all the input
and the outputs.
That takes a couple of mics
or even sometimes
a millisecond
to get started
and actually having
your GPU starting
to do the work.
It's maybe worth saying
we're throwing this word
around kernel a lot
which is kind of a funny GPU-specific word.
And basically, the kernel is the small computational program
that you are typically running on the GPU.
And writing these GPU kernels is actually really hard,
because they're highly parallel, and they're
hard to reason about.
And so the programs, in fact, tend to be numerically very
intense, but in terms of lines of code, pretty small.
You're not creating like million line code bases
that are running on the GPU. They're a lot tighter than that.
Yeah. You call those individual small kernels, a kernel to do metmall and then a kernel to do
some activation function in neural net. Yeah. This is just one Python line, which is then
dispatched on the GPU to be executed in parallel. So the thing that always kills me about this
whole PyTorch story is that if you asked me to design something, I would definitely design something
like TensorFlow or JAX. Just like the basic idea of TensorFlow and JAX is that you're more or less
like hijacking Python as like a metaprogramming system. You kind of write what looks like Python,
but what really you're doing is you're writing in some domain specific language for expressing the
computational graph that represents the program that you're going to run on the GPU. And the
reason I would have wanted to do it that way is because it seems just dramatically easier to make
sure that thing is going to run fast. You can't take every arbitrary Python thing and make it run
fast on the GPU. So you restrict yourself to some DSL where you can guarantee that things are running
fast. And it just seems like the whole thing is going to be much easier to reason about whether
I'm staying inside of the envelope of reasonable, fast programs and all of that.
PyTorch has kind of clearly won.
JAX is new and exciting,
and maybe that will get more mindshare over time.
But like TensorFlow was the big thing,
and then PyTorch has been much more successful.
And it just kind of like frustrates my intuitions
as a person who designs APIs.
Do you have a view as to like,
why is it that PyTorch kind of won
and things like TensorFlow and JAX are more niche? is it that PyTorch kind of won and things like
TensorFlow and JAX
are more niche?
So, yeah,
PyTorch won
for the flexibility.
Like, we saw ML researchers
to easily follow around
with various ideas
and maybe at first
it will be very inefficient,
but they want to be able
to iterate really fast
through their ideas
and test quickly
if they're going to be
worth it or not.
And even if the first
running round is inefficient,
if the ID turns out to be a good ID,
then we can spend some time optimizing it
and making it as fast as possible.
PyTorch kind of represents that model well.
You can fool around very easily.
And yeah, also look with that model of execution
that is asynchronous, you still get the performance.
Unless your code triggers some of the Zedon
like CPU, GPU synchronization,
your code is still performant when you run it from PyTorch.
There is this flexibility, this idea you can easily throw around.
And they did come around having a compile thing,
like PyTorch 2.0 introduced Torch.compile,
which is kind of what people didn't like about TensorFlow.
But they kind of had to implement it at the end.
Modern GPUs are really, really fast.
And that programming model of,
I'm just going to dispatch the operation asynchronously from Python
was starting to lose just because the GPU was so fast
that by the time your CPU had scheduled the kernel,
the GPU was already finished, basically.
And even if you kept telling the CPU,
schedule this kernel, this kernel, this kernel in a row,
it would just not be fast enough for the GPU.
And this idea behind charge the compile is, again,
to get the whole computational graph
from your model, and then try
to identify in that graph, maybe
there are places where you're doing something that's very inefficient
and we can simplify the instructions.
But more importantly, try to take two
consecutive instructions and fuse
them together on the GPU, so that instead
of launching a lot of small kernels,
on the GPU, you launch one big kernel,
which does a lot of work. And this is very efficient for, first, you don't pay kernels, and the GPU launches one big kernel, which does a lot of work.
And this is very efficient for, first,
you don't pay the overhead, and the second thing
that's very efficient is that very often,
kernels that are in a row, they read the data
that the previous kernel has already
written. So you have this inefficiency,
I'm going to write something in GPU memory,
and then immediately in the next kernel, oh,
I'm going to read that GPU memory at just one.
And there are some cache systems in the GPU, but still, you have some bit of overhead by doing that.
Whereas in a FUSE kernel, you can just keep that data in registers in the GPU,
and you don't have to move it around if it wasn't necessary.
Right, so you get to skip both the memory transfers and the kernel launch time.
Yeah, kernel launch overhead.
And so they do this, which is kind of a crazy hack, by using another Python DSL,
which is called Triton, which is kind of a crazy hack, by using another Python DSL, which is called Triton,
which is kind of a subset of Python
where you can directly write efficient CUDA kernels,
which works well.
If you want to write a fast matrix multiplication
and Triton, it's relatively easy.
And they have some crazy templates, basically,
for all of the operations you can do in a PyTorch model.
And they fuse these templates from the graph
that they extracted during Torch.compile
to create big Triton kernels that are going to execute
big chunks of the model on the GPU at once.
MARK MANDELMANN, Right.
So yeah, maybe we should talk for a second
about the programming language ecosystem around GPUs.
GPUs have a really interesting underlying computational model.
And then there's a big collection
of programming tools for it.
Maybe you can walk us through what some of the major pieces
of this ecosystem are.
FRANCESC CAMPOY- Yeah. So if we start at the bottom level, like the equivalent of C for GPU is CUDA,
which is a proprietary language from NVIDIA that they developed to program their GPUs.
AMD supports most of CUDA as well because they kind of have to, they are a bit late in the game
and if they want people to adopt their product, they kind of need to make sure the software is
what people are used to. It's basically C, except you have those kernels
that you write, which are executed in parallel
on the GPU.
And it comes with everything that's kind of a pain in C.
Like, you have to do lots of point arithmetic
to make sure that you are looking at the right data.
You have undefined behaviors every time
you're not super careful.
And it's pretty hard to debug.
MARK MANDELMANN, So it's a very low-level system.
And it also exposes you directly to the performance
characteristics of the GPU.
Or not exactly directly, because it gives you some layer of abstraction, but
you get to see a lot of the underlying details. And I guess one of the things that struck me, as someone
who's mostly used to thinking about performance in the CPU context, is how different the concept
of threads is on a GPU versus a CPU. I wonder if you can say a little bit
of how should someone who's coming to GPUs for the first time think about threads?
So, you will have lots and lots of them for one.
The GPU can launch a million of threads pretty easily and all execute them in parallel.
The idea is that you have those blocks that correspond to physical blocks on the hardware,
where like a bunch of threads is executed.
And even like those threads are like they're executed in a group, which is called a warp.
When you write a kernel, it's actually each instruction is going to be seen exactly at the same time by 42 threads, which together form a warp. When you write a kernel, it's actually each instruction is going to be seen exactly at the same time
by 42 threads
which together form a warp.
And one block has
a number of warps.
I mean, any number of warps
that you want
that's not too large,
like one block can accommodate
like 1,024 threads maximum.
And then you can launch
several of those blocks
in parallel.
Yes, the idea of that
block layer is that
it's physically
on the GPU chip
at one location.
So you can have like some memory that is shared between those threads,
which is useful.
For instance, if you're doing a tricks multiply,
you're going to load some of the weights into that shared memory
and then use it with those threads to compute something repeatedly
instead of like accessing the same region in global memory several times.
Right. So there's some more expensive, smaller, closer to the thread memory
that sits there to be shared among these threads that are on the same SM, right?
Streaming multiprocessor.
Right, and then maybe the other thing that's perhaps not obvious to someone
who hasn't thought much about GPUs is you also have dedicated registers.
Yeah, up to a certain amount of registers, like 65K for the whole SM.
You can have a program with lots of threads that use fewer registers
or maybe a program that has less threads, but each thread can use more registers.
Right, and the critical difference here between CPUs and GPUs
is that on a CPU, you have a really small number of registers,
and then when there's a thread, there's just one thread running on the CPU
and using all of those registers.
And then when you want a different thread to run,
you have to swap it out in all of the registers and swap the new thread in.
And so you have this fairly large context switch time and context switch times on
GPUs are incredibly small. And so this is part of what enables you to do this kind of massive
multi-threading. You have all of these different threads and the threads are both able to execute
in these warp groups. So they can do stuff in parallel and groups of 32, but also they often
end up being blocked, not typically blocked on IO because the GPU is not doing IO, but also they often end up being blocked, not typically blocked on I.O. because the GPU is not
doing I.O., but just blocked on memory, right? They need to do a thing. They need to wait for
memory to be shuffled in. And so you can immediately grab some other group of threads that's running
and get them started. And you can hide a lot of the memory latency by having all of these threads
that are consuming different pieces of memory concurrently. Yeah, that's the job of the SM. So
the warp controller is going to schedule warps like ever on a unit
that's going to do
some float matrix,
a float arithmetic.
However, you need
specifically dedicated
to matrix multiply,
which can do like a small
16 by 16 matrix multiply
for like those 32 threads
that we just mentioned.
And however, you need
that are going to load
something from global memory
or from shared memory.
Any instruction is dispatched
on one of those cores.
And then immediately
after it's finished,
like another warp is going to take its place.
And this way
like most of the latency
is hidden from the user
as long as you can
express your program
in a way that
you always have a warp
computing something.
And CUDA gives you
direct explicit
low-level access
to more or less
this computation model
in an unsafe
programming model
which is not especially
clearly documented
and can be hard to figure out and hard to understand.
And when you get it wrong, you just
get weird undefined behavior, and your program breaks
in hard-to-understand ways.
OK, so that's CUDA.
It's great and terrible.
What else is there in the kind of programming language space?
FRANCESC CAMPOY- So we mentioned PyTorch and
software and JAX, which are kind of the exacts overhand.
So it's something that's in Python with all the good
and the bad of Python
that then is going
to express,
either compile the
computational graph
on the side of
Jackson TensorFlow
or directly send
instruction to the
GPU on the side
of PyTorch,
which they're going
to dispatch those
CUDA kernels that
we just talked about.
And in the middle
there is a flurry
of new languages
because as it turns
out, researchers,
they love to hack
and test new ideas,
but they also don't
love to code in CUDA for some reason.
And in the middle, there are several languages like Triton,
which kind of sit in Python land in the sense that it's a syntax that looks like Python
and you have some subset of Python operations that are supported,
but are actually just DSLs to generate efficient CUDA kernels.
So we mentioned Triton is one of them.
And I guess one thing about Triton is it's in some ways
not quite as general purpose.
It's really good for doing things that kind of vaguely look like matrix multiplies.
Yeah, I mean, in general, modern GPUs are really, really good
at matrix multiplies.
They have those special cores, one of them called tensor cores,
which are really efficient.
And any way you can make your program look like a matrix multiply,
you're going to get way more flops
than if it's just just regular floating point operations.
Triton is really good at programming those styles of arrays
and matrix multiplying them, or then reducing them if you want.
Your model computation is slightly different than that.
Sadly, very often Triton will not compile your code
and won't necessarily tell you why as your own message.
It's not always super clear.
And the debugging experience is also not always super nice
because you're not in Python anymore.
Like it's a generic CUDA kernel.
And so you can't really, in the middle of it,
inspect the states of everything.
Or like you can try to print a bit of the stuff,
but it kind of stops like that.
There's also this weird decision
that the whole machine learning world has made
that we can have all the innovation we want
on the programming language side,
but the syntax always has to be Python.
Yeah, people are used to Python.
So you can try to move them all to another language.
At various times, like Google tried Swift for TensorFlow
to try to get Swift programmers into machine learning
or to move away from Python programmer
to another language that is more efficient,
but that didn't go so well.
It sounds like there is a whole ecosystem in Python
with all the rest of the libraries you need
to process your data or inspect your results
and stuff like that.
You can try moving researchers away from what they like,
but very usually they don't really follow you.
So another interesting language in the space,
which I actually don't know a ton about, is Mojo.
I'm kind of curious what your thoughts on that are.
So Mojo, I think, I don't know a lot about it,
so I hope people will excuse me if I say many mistakes,
but it's kind of the same as Triton,
except instead of wanting to be a new DSL in Python, it's kind of the same as Triton, except instead of wanting to be
a new DSL in Python, it's kind of this new language, which looks a bit like Python, but it
is its own language. The support for GPU in Mojo is going to be released in a couple of months,
from what I heard, but it's not there yet. But the idea is that you will be able to write those
efficient CUDA kernels like you do in Triton and in that language Mojo. But since you are not trying
to do a DSL in Python,
like, there is going to be support for debugging or maybe
better handling, just because you're
writing in a language that was specifically designed for that
instead of trying to add that functionality in Python.
MARK MANDELBAUM- Right.
And I think unlike writing stuff directly in CUDA,
it's a safe language, right?
I think it's got enough type system support
that if you do something crazy, it will actually
try and catch it for you.
The way I understand is it's like a little bit Rust-inspired.
I think it has some of the same Rust-like mechanisms, lifetimes,
and things like that.
And so if it's following that kind of approach,
I would expect them to try and make it actually safe.
FRANCESC CAMPOY- Yeah.
And then you have all of our projects
in the kind of the same space, some with a GPU,
and with a TPU are some Google projects that kind of do
the same thing of like giving you some Python interface
to create efficient CUDA kernel.
And if you want to write CUDA kernels,
but in Python because you really love Python,
there are some languages like Numba.
You're doing exactly the same thing
as you would do in CUDA, just the syntax is Python.
MARK MANDELMANN- Got it.
Stepping away from all this panoply of languages out there,
how does this all play into the work you do here?
Researchers working on a model, they've put together
their model in PyTorch.
It's not running as fast as they think it should or they hope it would. What do you do here? Researchers working on a model, they've put together their model in PyTorch. It's not running as fast as they think it should or they
hope it would.
What do you do?
How do you approach these performance questions?
FRANCESC CAMPOY- First things first is
profiling multiple times to identify.
We talked about both CPU and GPU synchronization points,
which are inefficient.
So a profile will show you that very easily.
And you can track, oh, this instruction created a choke point by synchronizing
GPU and CPU, so let's remove that, or let's try
to find a way to remove it. Some of them are easy
to remove because you can express them in different ways.
Other can be a bit trickier.
For instance, if you want your training to stop
because your loss is none, so if you
have a final loss after computing
your loss from your data on your
randomly initialized weights that is very large
or none, all your gradients are going to be none,
and then all your model weights are going to be none.
So basically your training is finished and completely borked.
So you might as well want to stop and stop wasting GPU hours on it.
So even that tiny thing is kind of difficult
because when you type if loss.isNone in Python,
to be able to know which branch of that if statement
the CPU should execute,
it needs to know if the loss is none or not.
So it needs to wait on the loss is none or not.
So it needs to wait on the GPU to have finished computing
to be able to inspect the value.
You have kind of a synchronization point here
that looks difficult to remove.
One of the solutions is to do that check,
but in another thread,
like launch another thread that's going to do that check,
where the CPU is going to be blocked,
but that's okay because the main thread
will continue executing the model.
And maybe you will do a couple of iterations wrong
on the CPU with your weights
that will ultimately be nines, but that's okay
because your program will be stopped by that other thread.
This is one example of
something that's a bit trickier to remove.
The idea is that once you remove all those GPU-CPU
synchronizations, your GPU is fed up
as fast as possible. And then the next step
is you can try to compile your model
to access this world of kernel
fusion we talked about just before
to make it even faster.
In the process, you might also want to use
different types for your floating point operations.
Most of the models have been trained
for a long time in float32s,
but we discovered that for deep neural networks,
float16 is actually kind of enough
for the precision in the layers in the middle.
As long as you do your sum, for instance,
like when you do a matrix multiply,
you can have the weights of both matrices being float 16s
and still have a result that's kind of
correct as long as you do that accumulation
of the sum in float 32s. And that has led
to NVIDIA introducing on
these GPUs like very efficient matrix
multiplies for like float 16s.
Now it's like float 8 or even
like FP4 for the new generation of
blackwell GPUs that are going to be soon. Is the float 4 a real thing or is that just a joke? I have no idea. it's like FPFloat 8 or even like FP4 for the new generation of Blackwell GPUs that are going to be soon.
Is the Float 4 a real thing or is that just a joke?
I have no idea.
It's on the slides.
So I do not know.
I'm looking forward to Float 1.
That sounds interesting.
It's either 0 or 1 or something.
I don't even know.
But yeah, without going as deep as that,
like Float 16s are really great
because you can train as much as 2x or 4x faster
depending on the shapes of your program, for free, by doing these mixed precision things where, like, some operations are completing in float 16, some of those are completed in float 32.
Just because you access those tensor cores that are, like, really specialized matrix multiplier units, and they do it really fast if, like, the two matrices are in float 16 or, like, this variant called bfloat16 that was invented at Google as a B standing for brain.
So how do you think about like the programming language setup
influencing the problem that you have
of helping people build fast models?
Like one thing you might imagine wanting
is having this split between
I am doing the inefficient thing
or I'm doing the efficient thing
be really explicit
so that instead of having to come
to someone who knows a lot about performance,
they could just look at their code and be like, huh, let me press the is it fast button
and be like, oh yeah, it's not fast here and it's not fast here. And then they could move
things around until it was fast. But it sounds like that's kind of not what's going on. What's
going on is everything just kind of looks okay and you can run it and stuff, but it's a little
harder to figure out whether or not you're doing the bad thing. So is there like something to
improve there to make it easier for the end users of the system
to understand when they are doing the slow thing?
FRANCESC CAMPOY- It's kind of hard.
And in that regard, PyTorch is actually better than TensorFlow,
for instance, because it lets you explicitly
manage what data is on the GPU and what data is on the CPU.
You choose when you do the transfers,
unless there is an instruction, if loss.isNone,
as we talked about, that kind of creates a transfer
of the.
TensorFlow, for instance, does not even let you handle what is on the GPU and what is
on CPU.
It's going to take care of everything for you because it has compiled everything, and
it decides for you where is your data and how it moves.
And so sometimes it can also result in efficient code just because the compiler decided that
this line should be executed on CPU and this line should be executed on GPU, but sadly
the compiler was wrong.
So at least in PyTorch you can fix
things because you get more fine-grained
control into stuff like that.
And an employee who was
in love with Keras, for instance, who joined recently
was like, oh, this thing in PyTorch is
really great. I can choose where my data is and
on which device and move it when
I want it to move and it's not going to move back
unless I ask for it. So there's like two
different dimensions
along which you might want explicit control.
One is about, am I doing a thing that can be put on the GPU
or not?
And the other is, even if I am doing
a thing that could be put on the GPU
or could be put on the CPU, I can explicitly
control where it goes.
And it sounds like it's more explicit on one side,
and that it actually just forces everything into the completely understood domain-specific language.
But then the actual execution of that language
has a bunch of compiler magic
that you don't get control over in an explicit way.
And this echoes actually a bunch of stuff
that we're doing on the OCaml compiler side
where we are trying to do a lot of stuff
to try and make OCaml faster,
mostly by giving end users more explicit control
as opposed to making
the compiler magically faster.
Exactly for this reason of, when you're trying to enable performance engineering, the key
to the realm is control.
Which also is the idea behind Accelerate in some way.
Like, the key that's to give back researchers the full control of our training loop, because
they wanted to mess around with it.
There is this idea of making sure of synchronization.
So the code is bad.
We only see it by profiling.
We're trying to do a better job here at Gen Suite
at least automatically profiling all the jobs
to identify and let researchers know,
oh, by the way, this particular model
took very, very long to do this particular step.
Are you sure that it's implemented efficiently?
Maybe you should profile it.
And we're trying to get everyone
in easy ways to profile and look at traces
to kind of identify the bottlenecks.
When we have done all of that, like sometimes researchers have some ideas that cannot be
expressed into the building blocks that we have. And like if they want to do something that doesn't
have a fast CUDA implementation already packaged in PyTorch, we need to dive deeper into the stack.
So like we mentioned Triton and then writing into CUDA directly. So yeah, sometimes this is needed
just because like there is a specific layer research are invented
and they want to either try it or put it into production and we need to make it as fast
as possible.
Right.
And then there's a couple of other interesting features from CUDA that I've heard you guys
talk about a bunch.
One of them is CUDA graphs and the other is CUDA streams.
Oh, yeah.
How do those fit in?
So CUDA graphs is something that CUDA released and that was used by PyTorch before Torch.compile.
It's been designed explicitly to remove
that kernel launch overhead we talked about earlier,
like when you're trying to launch a lot of small kernels
and you pay that overhead for each of those small launches.
And so CUDA Graph is the technology
that allows you to play that graph of kernels
once inefficiently,
but it's going to record all of the kernel launches.
And the next time you replay that graph,
it's going to remove all of those overheads
because it already knows what it has to dispatch.
It can do that more efficiently.
So that technology is really useful
to remove the overhead of launching
a series of small kernels.
So it gives you a lightweight form of fusion
where you're not really changing any of the operations.
You're just taking all of the scheduling work
and putting that on the hardware
so you never have to go back to the CPU to do it
and you don't do unnecessary copying of memory
back and forth. Exactly.
And you don't get the kernel fusion, which would give you
the additional benefit of avoiding memory transfers.
You're still doing those memory transfers.
If kernel 2 requires something in memory
from kernel 1, you still have
kernel 1 is going to write it, and then kernel 2
is going to read it. This is still there. The only way to
remove that memory
inefficiency is to fuse those two kernels
either by hand or using something like
Tarsot Compile.
But you remove all the overheads, which is already nice.
PAUL LEWIS O' When you think about fusion
in lots of programming languages,
you can get rid of memory operations,
but you can also sometimes get rid of other operations, right?
You can do other kind of optimizations
across the two kernels.
Where like, if I know I'm going to do these set of things,
maybe there's some things that can be merged together.
So are you also getting that computational benefit?
SLAVOJ ŽIŽE benefit? Sometimes, yeah. If your two kernels
had like some inefficiency in the middle with something you didn't really need,
you can remove that when you fuse them together. Usually like the benefits come more from avoiding
memory transfers. In some instances, you can remove like maybe some intermediate state wasn't
really needed and we can avoid computing it. Got it. Okay. So that's CUDA graphs. What are
CUDA streams? It's a way of parallelizing stuff with CUDA. When you build these CUDA kernels
and then you take CUDA to execute them,
it's going to execute them sequentially.
So like kernel 2 is only going to be executed
when kernel 1 is fully finished on the GPU.
And CUDA stream is a way to parallelize that.
If you have two kernels
and you know they can be run in parallel
because they don't touch the same data,
you can launch the two of them in different streams
and they will be executed in parallel,
at least up until a certain limit. You shouldn't use CUDA streams if you want to
run 100 things in parallel. NVIDIA told us it's not a good idea. And it's true that CUDA streams
do not really perform well. This API is exposed all the way up to PyTorch. So for instance,
if you want to do some stuff like I'm loading my data and I'm going to put it on the GPU,
and in parallel, I would also like to compute the prediction of my previewed batch of data,
which is already on GPU.
I would like to do those two things in parallel
and you can use CUDA streams for that.
Like you have one stream that does a compute
and one stream where you transfer the data
from the CPU to the GPU.
If your model is written well,
like with no synchronization point,
your GPU is fully utilized all the time
without any break.
So can I just think of CUDA streams as a coarse-grained threading protocol where each of the threads
themselves has lots of little mini threads on the inside?
Yeah, kind of.
It's more like a hint than a hard requirement.
Like it's hinting to the GPU, you can run those two things in parallel, it's safe.
The GPU might choose not to do it sometimes.
Okay, so a lot of the different optimizations you've talked about here have been very focused on the GPU programming itself
and the kind of connection between the CPU and GPU pieces.
What other parts of the process end up
needing to be thought about when you're
trying to get the maximum performance out
of a training run?
FRANCESC CAMPOY- We talked about CPU, GPU transfer,
GPU programming, like networking.
Talked a little bit about that as well.
That's another part that is really important.
If you're training on multiple GPUs,
they need to communicate efficiently
if you don't want to be bottlenecked by that.
I think that I've seen,
I spend a bunch of time on this thinking about
not just about making the fabric of the network efficient,
but also organizing the data loading
in a way that you're not going to stall out.
I mean, this is kind of a general problem.
The GPUs are these kind of incredibly
productive compute machines,
and so they're very hungry, right?
They want data as fast as possible.
What do you need to do to make sure you can keep them fed?
Yeah, yeah.
Data loading is definitely an important subject,
especially when you have some data that is asymmetrical.
You have examples in your training set
that are really, really long,
and examples in your training set that are really, really short,
and you kind of need to batch them together
to do, like, one iteration of that stochastic gradient descent
we talked about before.
There are lots of ways you can do that.
For instance, you can just decide, I'm going to take
the long and the short together and I'm going to pad
everything. So I'll have a bunch of
zero in my tensors after the short sequence
has finished and until the end of the very long
sequence, which consumes
a lot of memory, so that's not super efficient.
You can like some kind of representation
of tensors where you like, you concatenate
everything together, but you save the offsets
at which each things are, which is
a bit of a more efficient memory layout.
But even then, like, when you do this retit training,
we're kind of sad because if you have one GPU
that has to load a very, very long sample
and the other GPUs have, like, shorter samples,
since they need to communicate together,
like, to agree on which gradient is the right gradient,
the GPUs with the very short samples
are going to wait for the GPU with the long samples
for a long time.
So you kind of need to organize your data loading
in a way where you think about your distributed training
so that each GPU has kind of a balanced load of data
so that they all take the same time to load the samples.
And so that at least like when you have the long samples,
everyone has a long sample to load,
otherwise it's pretty inefficient.
But then it might impact your training accuracy
because you're not shuffling for your data set.
You're kind of doing a pseudo shuffle where
you still group things by size. So it's kind of a
trade-off between like performance and
accuracy by like
removing some degrees of randomness
in your shuffle of the data.
Yeah, one thing that's maybe not
obvious is that a lot of these algorithms
are essentially structured around barrier synchronizations.
You're going to do a bunch of stuff
and then you're going to wait until everyone's done
and then you're going to do a bunch of stuff
and wait till everyone's done.
Barrier synchronizations are like super terrible
if you have a lot of non-determinism
or just non-uniformity in the different pieces
that are going into meeting that barrier.
Because like some people are going to get to the barrier first,
and you're going to wait on the others.
And while you're waiting, you're just not doing anything.
Happily, GPUs are mostly pretty deterministic
in terms of the amount of time it takes to do a given computation.
But you have to feed it a computation of the same shape everywhere
in order to get it to really neatly line up.
And we were talking also before, when you were asking me,
why was PyTorch a winner?
I think one thing that we really want in the side of PyTorch
and people's heart is the way when you have asymmetrical data
and different sizes at different batches,
it's way easier to code that in PyTorch,
which is more flexible because compiling that kind of thing
is really, really hard.
In TensorFlow or JAX, you kind of need to go to some extreme lengths in your code
to make your data the same shape again, and then send it to your model. Whereas in PyTorch, it's
really easy. You can just batch together small things, and then at the next price, it's going
to be a very long thing, and PyTorch is still happy because it's easier and not compiled.
Right. I mean, I guess this is always the problem when you go to some simpler, more highly
structured domain-specific languages. There are some things that's good at expressing, and there's
some things that's bad at expressing. When you Like there are some things that's good at expressing and there's some things that's bad at expressing.
When you want to do the thing that's bad at expressing,
you can just be in a world of hurt.
Yeah, exactly.
You know, you've spent a bunch of time in your career
working on various kind of open source training,
machine learning ecosystems.
And you now spend a lot of time internally
working in our world.
I think it's fair to say, like,
we are in various ways more immature
than lots of other organizations. I think at the time where Google was already designing their own
custom hardware for efficiently evaluating neural net models, we weren't really using neural net
models at all. Like, all of this effort on our side has really spun up in the last few years.
We've been doing various kinds of statistically driven inference of trading strategies for as
long as I've been at Gensheet. Like, that the first job I had like 21 years ago was doing various kinds of optimizations and model
fitting stuff, but very different models and didn't have any of the same performance shape.
And so all of our tooling around this is relatively new. And I'm kind of curious for you as someone
who's like seen stuff in the outside world and seen our ecosystem here, what are the things that
you see as the big gaps? What are the kind of things that don't work as well here
as they should and that you want to see us improve
and that you want to work on?
One nice aspect of the fact that we are newer
to this special engineering stuff
is that people are not necessarily aware
of the things that are not performant
and making a lot of mistakes when writing their code.
So it's really easy for me to come in
and spend a couple of hours on a project
and be like, oh yeah, no, it's going to train four times faster. And you just have to change those five lines the code. So it's really easy for me to come in and spend a couple of hours on a project and be like, oh yeah,
it's going to train four times faster.
And you just have to change both five lines of code.
So it makes my job very easy in that regard.
Sometimes it's a little bit more difficult than that.
But yeah, there have been a couple of instances
where optimizing a given training
was really, really easy
just by profiling it once because of this.
We should improve our infrastructure
around training loops in general,
like making the training infrastructure work better for researchers
because we're kind of making the same mistakes
as other people already did in the open source world,
like these giant training loops with lots of spaghetti code
that researchers end up not willing to use
because they can't modify what's inside of them.
It feels like sometimes we have that same problem internally as well.
MARK MANDELMANN, So do you think we need to sort of do
morally the same thing that Accelerate
did of trying to build a set of libraries that are highly
configurable and modular, instead of having like one
training loop that everyone uses,
make it easy for people to build their own training
loops for different use cases?
FRANCESC CAMPOY- Yeah, especially
since people here are very smart and really like
to hack things together, it feels like a better solution
for them. The magic training loop where you press play and your model trains as it's appeal,
which I can understand for people who are less familiar with machine learning,
but at least for people who are deeply familiar with all the internals of machine learning
and want to deep research into every part of a training loop,
they need something that's akin to Accelerate,
where you just have small, composable building blocks that are very easy to use
and not like this giant black box with 150 arguments to accelerate, where you just have small, composable building blocks that are very easy to use,
and not this giant black box with 150 arguments
that you have to pass in the correct order.
MARK MANDELMANN- Yeah, that's terrible.
FRANCESC CAMPOY- I'm talking about overtraining APIs,
not giving any bad time to any engineers.
That does not exist internally.
MARK MANDELMANN- Right.
Certainly, there are pieces of code
that we have, some of which I've written,
that have the property of you have hard-coded a bunch of concrete behaviors into it
and it has become ossified and hard to change. And it's certainly a problem that shows up.
Maybe it's worth just saying like a few words about the way in which the problems we solve
here are different than problems solved in the outside world. And maybe just to talk about like
what role machine learning actually has here. So just say like some very high level things.
We use machine learning for a bunch of stuff. We use it for some kind of general purposes in the
way that any organization might. We have a whole AI assistance team whose job is to try and leverage
various AI techniques for building various kinds of automations. A lot of it focused around
LLMs and coding assistance, but not just that. So that's like one kind of use case. And then we have
a bunch of use cases that are very focused on trading. Even inside of the trading world, I think there's kind of two major streams
of applications. There's, we are going off and trying to extract data from the outside world
in order to inform our trading. But we're using the same kind of data that has already been shown
to be a good target for standard machine learning techniques. So maybe we want to get data out of
images or geospatial data or text data. There are all sorts of published models out there and
published architectures that you can do for this. And we are like happy to leverage and fine tune
and exploit those existing things. And there, the work that we end up doing looks a lot like the
work that people do on machine learning in the outside world. And then in some sense, the magic
is more about how do we pick the data that we're going to apply it to and how do we
integrate that into the decisions we're making on the trading side. And then there are places where
we are applying machine learning techniques to trading data itself, like the data that you get
from exchanges, various alternative sources of data that can inform that. And I'm kind of curious,
like, how do you think of that set of data as being different from the kind of data
that you typically see
in the larger world of machine learning?
Data is much noisier.
So it's way harder to train good models on it
just because the signal you can extract from it
is actually way weaker
than in something very structured
like text or images.
Very often, you will never get
the same kind of accuracies
as what you get on computer vision and on text.
But even extracting a very small amount of signal can still lead to good training strategies.
So you can still get valuable feedback from that kind of data.
It's maybe worth saying there are fundamental reasons why trading data sets are noisy,
because the fundamental process of trading is one where when there's a signal,
people trade that signal, and that signal kind of gets removed from the data, essentially.
And so to the first order of approximation,
the time series of the prices of a security
look kind of like a random walk.
And there's a little bit of signal in there,
but it really is mostly noise.
And so whatever your training technique is,
it has to be able to deal with the fact
that there's a ton of noise in the labels, essentially.
And so, yeah, so that's one aspect of it,
very noisy, and as you said,
it also changes and reacts to the way you use it.
So, like, in difference with text, for instance,
like you built that was released in 2018,
like you use it right now, it's still as good
as it was in 2018.
The same is definitely not true for a model
that you train on market data.
The kind of strategies that you could run
a couple of years ago won't necessarily work right now
just because the market has reacted to them
and adapted to them.
So you kind of need to come up
with new modeling strategies all the time
and reinvent yourself.
Another aspect of that data
that is different from the rest
of the outside world is that it's huge.
We have massive amounts of market data.
I think like it's a couple of terabytes per day.
So you multiply that by the number of days
or like a couple of years,
and yeah, you have a massive amount of data
to feed your model.
So that brings its own challenges in terms of data loading
and making sure it's efficient and that the GPU gets saturated.
Right, and in practice, the model sizes that we tend to do
tend to be smaller than the sizes of the very largest language models.
And so the overall ratios of flops per byte are just very, very different.
And so the things that people are optimizing for
in the designs of the GPUs and the designs of the network
are often not exactly the thing that we're seeing.
We have to do a world-best of research, basically.
We can't just rely on what's been done
for other kind of modalities like text or images.
We need to reinvent new models that are adapted to market data
and new ways of loading that data
and keeping the GPU fed as much as possible.
And sometimes we care about algorithms that are completely different
from the one NVIDIA or PyTorch care about,
because they're not necessarily used in LLMs,
and everyone is all about LLMs these days.
It gives a good amount of programming to do in terms of GPU performance.
Yeah, I think it's actually an exciting part of the machine learning world here in general,
is that there's just a wide variety of coming up with and experimenting with new architectures and new models and new ways of applying it to data sets where it's like there just aren't a lot of papers telling you how to like analyze financial time series because the people who are good at that mostly don't publish papers.
I wonder why.
Another thing that I think comes up which which is interesting, is just inference times, right? So we care about using these things in the context of trading and that in general are the level of speed at, and sometimes a few hundred microseconds is slow enough,
sometimes milliseconds.
There are some kinds of machine learning problems
where like, oh, getting an inference once an hour
would be great.
That's like all we need, and sometimes even less than that.
So you just have like a very wide variety
of inference times you need.
And at the very low end of the scale,
it's nothing anyone else cares about.
Yeah, that's why also you were mentioning before
like our model have various sizes.
Some are very small
because we want them to run very, very fast.
But even if they are small,
like there are some challenges to make sure
like that they can run in the timeframe we need
to make sure we are as low latency as possible.
Right, and just to keep up with the data rate.
Yeah, because like there can be a million events
in a single stock one day.
So if you're not fast enough to just process them,
it might not be the case that we need the prediction very fast.
But sometimes if you want to keep up and not get behind too much,
you just need to be a couple of mics per event and not much more than that.
So those are a whole bunch of differences between the kinds of problems we look at
and what you've seen in other places.
How do you think that influences the tooling we build?
Are there ways in which we need to build different kinds of machine learning tools in response
to the ways in which the shape of the problem is different?
Yeah, we talked about data loading, for instance, that comes with its own challenges.
So obviously, we have developed a ton of custom data loading utilities that we can
use to make this faster.
We also talked about models that are not necessarily the same ones as everyone else cares about for
other kind of well-studied modalities
like text or image. So we have a lot of
custom models written internally
that we have found work well, that we
keep trying to optimize for training and
inference. So this is a bunch of exciting
work. The rest of the training is mostly the
same as in any other machine learning
job, I guess. Stochastic gradient descent
has not changed.
It's the same algorithm.
MARK MANDELMANN- And the same algorithm
it was 40 years ago.
FRANCESC CAMPOY- Yeah.
MARK MANDELMANN- So another thing
that you've done a lot of in your career is education, right?
You were a math teacher for a bunch of years.
And then you did a lot of education work,
both at Fasted AI and at Hugging Face.
So you're also involved some in the education story here.
Can you tell us a little bit more about what that looks like?
Yeah.
Chain Streets is trying to like up its machine learning games,
both by hiring more people who do machine learning,
but also by educating the existing people into machine learning.
And we talked a little bit before at FAST.ai
and like how it was important to like make radiologists, for instance,
competent at machine learning so that they can do radiology better with machine learning.
It's the same here.
We need to educate traders
about machine learning
so that they can do
better trading
using machine learning
and can inform
the choice of models
that machine learning
researchers can pick
because they know
the data very well
and they are
domain experts.
We have a boot camp
that we run
every couple of months
with either traders
or researchers
that are not super familiar with machine learning.
And we try to make them up to speed
with the latest technique,
both from the outside world and from inside Gen Street.
You mentioned this point of,
in part, you want to teach people.
So people who are not primarily going to be
machine learning modelers as like their primary job,
but still have them better understand.
People who are experts in other aspects
of the trading problem
and have them understanding more about machine learning.
That's one goal.
I think it's also the case that you can teach people the machine learning stuff in some
ways, like it can't be that hard to learn modern machine learning because in some ways
modern machine learning is 10 years old.
I was saying like make them into domain experts so they can help better, but some of them
like actually end up training models, doing a lot of machine learning themselves.
It depends on whether they like it or not.
Because machine learning is a bit like cooking.
You throw a bunch of stuff,
and then you let your model training stir for a while,
and then you see if it was good or not.
It's not the same kind of just programming
some training system.
Some people like it, some people really dislike it.
Yeah, that makes total sense.
So there's a lot of things you're trying to convey to people
when we're running these classes and these courses.
What are things that you find hard to get across?
LUDOVIC BLECHER The point that's most difficult to get across to people is that, yeah, no one knows anything about machine learning.
Like, it's really just a cooking science.
Like, we still don't know why neural net generalize so well.
We know there's a bit of theory explaining why, like, they are able to train on the training data.
But why are they any good
out of training samples?
We still don't know why they are
so good at generalizing. And in general,
you can try to get a little bit of
intuition into trying to
do this to fix this kind of problem.
Like, I'm overfitting, so I'm going to try
this regularization technique. Maybe that
will help. But there's always that
maybe. It's not until you've tried
that you can know for sure that the thing is going to work.
So, like, this is really hard to convey.
And then, like, try to get people very disciplined
about reproducibility.
Like, one mistake that beginners in machine learning
do all the time is they train a model
and then they forget what they did to train that model.
And so, like, two months later, like,
oh, I did train that model.
It was good.
I should try to retrain it and maybe use it.
And they never managed to reproduce their initial results
just because they didn't write down all of the stuff
that was needed to train that model.
Those two points are really the ones
that are difficult to make a request
because I guess you can't fully understand them
until you have gone through the pain of the two of them.
You don't understand the importance of reproducibility
until you've gone through your first reproducibility crisis.
Yeah, exactly.
Then you fully understand why it's so important
to save absolutely everything down to the revision of the code
so that you can run the exact same thing at another time.
Yeah, and I think some of the reproducibility stuff
is about getting people on board with the discipline required.
We've talked a lot about technology
that meets the researchers where they are
and makes it easier for them to express what they want.
But there's some part of it,
if you want to be a good researcher,
you actually just need an enormous amount of discipline
because the tools are somewhat imperfect
and you just have to be really careful
to get that reproducibility story right.
And at the same time,
I think there's also a lot of work we can do
on the tool side to make reproducibility easier, right?
I think it's complicated in part because the overall ecosystem is complicated.
Just managing Python packages is shockingly complicated.
And making sure you didn't have an upgrade of a random package that broke everything.
Yeah, that's already difficult.
And then making sure your code is checked out and that you know the revision of the code that you are using.
That's another thing because you can change a small line of code in your out and that you know the revision of the code that you are using. That's another thing because you can change a small
line of code in your model and think, oh, this is totally
harmless, but then it actually destroys
the performance of your model because it was a key
ingredient in your cooking recipe and you
hadn't realized that. So yeah, making sure
that your code is still there. And the last
thing is training involves a certain
number of hyperparameters. Usually people
write all of that in some kind of
config. So make sure that you save that config somewhere
so that when you want to reproduce your training,
you actually know that you had used this learning
rate, this batch size, et cetera, et cetera.
MARK MANDELMANN, I guess another fun source
of non-determinism is to the degree
that you're doing your research in Python notebooks, the fact
that you can run Python cells in Python notebooks
in arbitrary orders.
And if you do this in the wrong way,
you can end up computing a result that you just like have no record
of exactly where that result came from.
Yeah, that's another kind of fun.
Fortunately, notebooks are still a bit difficult
to check in into any kind of repo.
So like usually people move away from notebook
by the time it's time to check the code
in some kind of infrastructure.
So like this issue kind of disappears.
But it's true, like while you are experimenting,
this is another fun
source of operability.
And then you have like
GPU being non-deterministic
at a fundamental level
because they are
heavily parallel.
So that's always so fun
like when you're trying
to debug exactly
why my floating point
result at the end
is not the same thing
as what I was expecting
just because floating point
arithmetic is not associative
and GPUs have many threads
which may end
in any kind of random order.
GPU training is in some sense non-deterministic
because it's in parallel,
but it's also in some sense non-deterministic
just because we can tolerate it.
You could do things highly in parallel
and make sure that you're always doing things
in the same order
and do stuff to conserve the determinism.
It's usually at a huge cost in performance.
But it's at a huge cost in performance.
And it doesn't totally matter, right?
That's actually one of the interesting things
about machine learning is because you're doing
this kind of soft numerical optimization process,
you can just take some error.
And actually, a lot of the early research
in various places is called kind of hog-wild concurrency,
where you just had shared model weights
and things checking out and producing new gradients
and updating them.
And were there data races?
Yes, they were.
There were. And like, it races? Yes, they were.
There were.
And it was kind of OK.
But I think over time, that's fallen somewhat into disfavor
because it's just even harder to predict what's going on.
FRANCESC CAMPOY- Yeah, it's completely
un-reproducible.
So you can end up with a model that's really good,
but you have no idea why.
And you were never able to reproduce it anymore.
So that's a bit annoying.
MARK MANDELAIS- Anyway, this was a lot of fun.
Thanks for joining me.
FRANCESC CAMPOY joining me. Thanks for having me.
You'll find a complete transcript of the episode, along with links to some of the things that we discussed, at signalsandthreads.com.
Thanks for joining us, and see you next time.