Signals and Threads - The Uncertain Art of Accelerating ML Models with Sylvain Gugger

Starting point is 00:00:00 Welcome to Signals and Threads, in-depth conversations about every layer of the tech stack from Jane Street. I'm Ron Minsky. It's my pleasure to introduce Sylvain Gougere. Sylvain is a machine learning engineer here at Jane Street, and he's done a bunch of interesting stuff in the outside world as well. He was a core maintainer of Hugging Face's Transformers library. He wrote Hugging Face Accelerate, which is a nice library from them that helps you run your models performantly on a lot of different kinds of hardware. And he also wrote a lovely book along with Jeremy Howard called Deep Learning for Coders

Starting point is 00:00:36 with Fast AI and PyTorch. So he's done a bunch of interesting stuff in the outside world. He's also doing a lot of interesting machine learning work here at Jane Street. So thanks for joining me. Thanks. I'm very honored to be here. And just to kick things off, I'd love to hear a little bit more about your background. And in

Starting point is 00:00:50 particular, how did you get to work on machine learning in the first place? So that's a good question. I was originally a math teacher, like 10 years ago, and from teaching at the first year of university level. And yeah, I moved to the US in 2015. I had kids, so I took some small projects at home to mostly take care of my kids. In 2017, AI was kind of becoming more mainstream. I actually read an article in the New York Times about it. It was going to steal, like, everyone's jobs in the next two or three years.

Starting point is 00:01:17 That didn't happen, but it's still something that became more mainstream. And at the end of the article, they were mentioning a couple of online courses for people interested into diving more. And I was interested, so I dived into it. So one of the courses mentioned was the Fast.ai course by Jeremy Howard,

Starting point is 00:01:33 which I followed. It was very interesting. And I started commenting a little bit more and more on the forums and making a couple of contributions to the Fast.ai library, which is used throughout the course to make training models a little bit faster

Starting point is 00:01:45 and a little bit easier. And then towards the end of the course, Jeremy led a FastAI team to this competition called the Third Bench Competition, which is the ancestor of the ML Path Benchmark. It was organized by Stanford, and the goal was to train a computer vision model as fast as possible to a given accuracy.

Starting point is 00:02:02 And so we entered the competition and helped to beat the team. And we were like positioned first for the longest time. And yeah, at the very end, Google kind of told us by publicly releasing TPUs for the first time. And yeah, those massive computers that no one else had access to trashed our best entry and our best time. So I want to hear more about the competition.

Starting point is 00:02:19 But before that, can you tell me a little bit about like, what is Fast.ai? What's the basic program there? What's the mission behind that organization? So Fast.ai is a non-profit whose goal is to educate people about deep learning. Especially in those early years, it was starting to become more mainstream, but not necessarily as mainstream as it is today. And the idea behind it is that during more world, and I believe that to get the best model,

Starting point is 00:02:42 you need good machine learning engineers, but you also need people who really understand the data that those models are going to consume. So if you want good model in radiology, you need very good radiologists to kind of understand how machine learning works, so that they're going to be able to help you build those best models.

Starting point is 00:02:58 The Fast AI course, I am both at Coders who want to really dive deep into machine learning, but also beginning is more of an introduction that anyone who is interested can take in to learn more about what is machine learning what are those deep learning models and what they can do. So the basic idea is to democratize machine learning so all sorts of domain experts can actually know enough about it to really leverage it in a meaningful way. Exactly you said it way better than I did. Let's get back to the competition. So like the end of the competition is like a great dramatic story

Starting point is 00:03:26 of Google gorilla stomps on everything by dropping TPUs at the last moment. But what were you actually doing in order to get into the first place before Google kind of jumped in there? Yeah, so a couple of things. The main thing is related to the way we are training the model

Starting point is 00:03:40 and in particular, like the learning rate schedule. So to take a little step back, when you train those machine learning models, initially your model is random, so it's outputting crappy predictions. But then you compute a loss, and from that loss, some gradients

Starting point is 00:03:51 that are going to make your model a little bit better if you adjust the weight following those gradients. So the whole process is called stochastic gradient descent. Right, and just to say a higher-level thing about all this, this is just an example of a more general thing called function optimization, right? You have some function that you want to optimize. In this case, the function is given the set of model weights and the input you want to run on. You want to find the set

Starting point is 00:04:14 of model weights that give you the best and most accurate answer. And we just approach this like we do in some sense, almost any other kind of optimization problem with techniques that actually go back 50 years or something of, we're just going to compute a derivative, and we're going to walk the model weight in the direction of the derivative and just do that over and over until we get to a more optimal result. FRANCESC CAMPOY- Yes, exactly.

Starting point is 00:04:33 Like, the whole process and the whole math behind it existed for like 50, 60 years. It's just that with GPU becoming more and more powerful, we actually had to compute to apply that process to complex problems like deep learning. So that very important hyperparameter, the learning rate, is kind of the size of the step we take, like following those gradients. At the time of the competition, the most popular learning rate schedules were very inefficient,

Starting point is 00:04:56 just training at a low learning rate for a very long time. And then we divide that low learning rate by 10 and we train for even a longer time. That did converge to a good accuracy, but it was very inefficient. And one of the things we had in our competition entry was to follow a learning rate schedule that is more like a warm-up from a low learning rate to a high learning rate. So not start at a high learning rate because

Starting point is 00:05:16 otherwise the model immediately explodes, but by warming up, so starting from something low and gradually increasing it to the maximum, we can have the model learn a little bit of something for those and then have high learning rate for a little bit of time so that we can explore the lost landscape efficiently and then decrease it towards the end.

Starting point is 00:05:32 And this kind of schedule made it possible to train the model to the same accuracy, but way faster. Why is that a better schedule? If you kind of just think about this without knowing a lot of the details, the idea that when you're very far from the answer, you want to take large steps, and when you get far from the answer, you want to take large steps. And when you get closer to the answer, you want to take smaller steps.

Starting point is 00:05:48 Seems intuitive. But here you say instead, you start with small steps and then go up to big steps and then go down to small steps. So what about the structure of the problem makes that the right approach? I mean, at the beginning of the problem, like since your model is randomly initialized, in the landscape of your loss function, you are very, very high and you actually have very steep canyons. So if you take large step at the beginning, you can at least begin to descend into one of those canyons of the loss function

Starting point is 00:06:11 and then increase that learning rate to like dive through there fast. And you will skip over a lot of local minimas because your learning rate is large. So towards the end, you need that decrease to step down further into like one of those smaller part of the landscape of the loss that have like this local minimass. So is the intuition here that when you start at a randomly initialized point, the terrain around which you're trying to optimize

Starting point is 00:06:32 is just more wild. And if you take big steps, the derivatives are very high and you're kind of jumping all over the place. But even with a little bit of optimization away from that initial randomness, you end up into something that feels like a more regular space. And now you can go back to what makes more intuitive sense. Yes. It depends also, like, we're talking about just a 3D problem, but we have millions of dimensions

Starting point is 00:06:50 because your model has millions of parameters. So it's the idea that, yeah, on some of those dimensions, the landscape is very, very spiky. So at least taking care of that at the beginning with a low learning rate is going to make the whole optimization problem easier. And then you can have larger steps.

Starting point is 00:07:04 Yeah, I do think this is kind of terrible intuition when one thinks about a problem like this, of like, I'll try and visualize it in two or three dimensions. And you're like, you have just lost all of the important structure, and you really need to think about this high-dimensional problem to really know what's going on. That was one of the optimizations. The other optimization we did was a computer vision problem.

Starting point is 00:07:21 And so the kind of models we applied to them, which are called CNNs for Convolutional Neural Networks, they can work on any size of images because it's just some kind of filter that you apply over all of your image. And so the idea is at the beginning of the model, like when you train the model, it's random, it's crappy, so it doesn't really need to see the whole picture.

Starting point is 00:07:39 We kind of gave it a more blurry version of the picture, like just 128 by 128. And then gradually as training goes on, we increase the size of those images to make them more of the standard size people doing that problem were using. And you have that gradual resizing as well because if at the beginning your image is smaller

Starting point is 00:07:56 and you have your filter to put all around the place of that image, it's going to be more efficient if you have less pixels compared to doing the training with always the high-resolution images. Right, so there's two neat properties of convolutional neural networks that are coming into play here. This is this convolutional networks are in general a dimensionality reduction trick. You could imagine like a big network that's applying to all of the different inputs and all the different parts of the image. And then you could just have weights that are individual to all the neurons that are associated with all the different parts of the image. But that's enormously

Starting point is 00:08:23 wasteful because in the early parts of the network, you actually want sort of the same regular structure over and over. And so the basic idea of a CNN is you kind of lock those weights together. So you, in some sense, just have one copy of this neuron, which is activated multiple times in multiple places. But then once you've done that trick, you also have this resolution independence where you can run it at multiple different resolutions and you're saying, well, we're just going to like train this thing at low resolution. And then again, after it gets into the ballpark and we need to more precisely fine-tune it, then we'll increase the resolution and do the rest of it. And were these essentially kind of new techniques at the time of this puzzle?

Starting point is 00:08:56 Yeah, both of them were new techniques. The gradual resizing is still not that widely used. The new kind of learning was like, no, everyone uses that. Like all Transformers models like GPT-3.5, I think, GPT-5 now. Not for sure because OpenAI doesn't publish its research.

Starting point is 00:09:10 But like the open source versions of that are trained using that kind of schedule. And since birth, we have seen that kind of learning schedule all the time.

Starting point is 00:09:18 So that's how you got into Fast.ai and you got into this competition space. How did you end up being co-author of this book? So after like collaborating on the Fast.ai library and participating on the forum of that competition,

Starting point is 00:09:29 Jeremy Howard kindly offered me a job at Fast.ai, which I accepted. So I worked there for two years, built a couple versions of the Fast.ai library and two iterations of the online course. And it was very natural going from the course to publish a reference book with kind of the same content,

Starting point is 00:09:46 just in a different format for people who prefer to learn from books instead of YouTube videos. Got it. And then what brought you to Hugging Face? And what is Hugging Face? So Hugging Face is kind of the GitHub of machine learning. The idea is that we have a website that looks kind of like GitHub,

Starting point is 00:10:00 except of having repos with code, you have repos with model weights. So like LMA1, LMA2, LMA3, and 3.2, it was released a couple of like GitHub, except of having repos with code, you have repos with model weights. So like LMA1, LMA2, LMA3, and 3.2, that was released a couple of days ago, are all on Hugging Face, along with, I think now is a million public model for more kind of applications of machine learning, like computer vision, text, speech, et cetera, et cetera.

Starting point is 00:10:19 And yes, the idea is that they are kind of the forefront of the open source AI by allowing people to share those models. We've had that happen in a couple of libraries because model weights are very good, but if you don't have the code to actually instantiate those models, they're kind of useless. To complement that, they have libraries like the Transformers library,

Starting point is 00:10:36 which actually contains the code of those models. And how did you end up at Hugging Face? In 2020, there was this thing that happened worldwide. Oh, yeah. I vaguely remember that. Yeah. And so that kind of disrupted some plans at Fast AI. So I looked for another job. And there was this startup from French people,

Starting point is 00:10:54 which was based in New York City. And so I knew them from the French tech community in New York City. And I'd met them a couple of times before. We were looking to expand. So I applied to Hugging Face, and I joined them randomly in June of 2020. As a continuation of my work in open source

Starting point is 00:11:08 from Fast AI to democratize machine learning and how people use the Transformers library or like their website with all the public weights on it. So what kind of technical work did you end up doing at Hugging Face? A couple of things. It was about the maintenance

Starting point is 00:11:20 of the open source libraries because, yeah, there are people doing pull requests, having issues kind of all the time. So that is already a huge amount of work. Then I developed like new tutorials and new examples to help people use those libraries. And that kind of ended with an online course

Starting point is 00:11:35 that was meant to be taken after the Fast AI course, like for people who wanted to specialize a little bit more into transformers. So there are those two aspects. And then, yeah, at some point, like all the researchers at Unique Face were kind of annoyed by like our big black box trainer, which contained like all the stuff of the training loop.

Starting point is 00:11:51 And it becomes with time, like this huge amount of spaghetti code because you have like new flags that appear to kind of control everything that people want to do with their trainings. And so I created a new open source library to make it much more lightweight to all people with our trainings so that they can have more flexibility. The idea is that usually those APIs that train models, you have a trainer API and you give it some things like your model and your data and you click the train and it trains and it's marvelous for people who just want that.

Starting point is 00:12:18 But yeah, researchers wanted to change, tweak the training loop a little bit. We're struggling a bit more. So there are various techniques that have applied for that in the past. Like in Fast.ai, we had some callback-based systems. So we had callbacks that the researcher could implement to change a little bit the behavior of the training loop at this particular point or another. The Hugging Face trainer was less extensible.

Starting point is 00:12:39 But for that library called Accelerate, I went back to, yeah, if the researcher is just going to write their training loop and there's not going to be like a black box trainer, and they just need to change like a couple of lines here and there to make it run on any kind of systems. At first, it was like six lines, then five lines. We tried to reduce that number of lines to the absolute minimum so that there was as little intrusion as possible

Starting point is 00:12:59 that kind of gave that API from Accelerate. And when you say you want to make it possible for people to do their training on multiple different kinds of systems, what is the diversityate. And when you say you want to make it possible for people to do their training on multiple different kinds of systems, what is the diversity of systems underneath that you're thinking about? What are the kinds of systems and different variations on the training

Starting point is 00:13:12 that you are trying to enable with Accelerate? Training requires a lot of data, usually, when you train those large language models or even other kind of models. And to make it more efficient, usually you kind of divide and conquer. And if you have multiple GPUs, you give a slice of the data set to each of your GPUs. And so let's say you have N GPUs, then your training time should be reduced by N at the end of the day, because

Starting point is 00:13:36 they fully parallelize the things that you cared about just by splitting your data this way. So this is called data parallelism. And it's kind of the first level of parallelism we can use when we have multiple GPUs and we want to run a training on them. And so

Starting point is 00:13:50 you can do that in PyTorch except it requires some kind of boilerplate code that is a bit annoying. So the idea of Accelerate

Starting point is 00:13:58 was to remove that boilerplate code by just having to change a couple of lines in your training loop and poof your model can now run training on multiple GPUs, also on TPUs, because of course, like the code to run them with the same kind of distributed data parallelism on TPUs is different

Starting point is 00:14:13 from the one of GPUs. That would be too simple otherwise. And then if you have done the modification, nothing runs on CPU again. So like the idea is that it kind of deals with all of that crap for you of detecting which kind of environment you're on and then adding the boilerplate code that is needed for your training to run successfully. All of those kind of systems.

Starting point is 00:14:33 And then it also adds, if you want to train in a mixed precision setting because you want to use lower precision types, we can talk about that later. It also dealt with the additional LISM codes that were required to properly do that kind of automatically. Yeah, I mean, I think this whole discussion kind of underlines just the diversity of different hardware and setups that you can do when you're doing training. There's the kind of, in some sense, simplest thing of like you can run your training on a CPU, which is a thing that people did for a

Starting point is 00:14:56 long time. And then there are multiple different parallel architectures, GPUs, which are like literally descendants of graphic programming chips, and TPUs, which are like literally descendants of graphic programming chips, and TPUs, which is this tensor processors that Google came up with. And the main game here, going from the CPUs to the GPUs and TPUs, is about parallelism. It turns out CPUs are these kind of funny machines that have lots of parallel circuits, but they're interpreters for a brutally sequential programming language, right? And so they're not that good at doing lots of things in parallel. And in fact, there's all the complexities of like multi-core architectures and stuff on that side, which is how you try and take advantage of parallelism there. But then GPUs and

Starting point is 00:15:31 TPUs are machines that are much more directly parallel in their structure and built for large scale, highly regular parallel computations. And then at some point, those things aren't enough either. And now you start getting to various forms of distributed. You want so much parallelism, you want multiple GPUs. And then the first thing you were talking about was this data parallel training, where what we're doing is we're running this like stochastic gradient descent, where we're like picking random subsets of data, breaking it up into batches, and then training on individual GPUs and computing like a net gradient,

Starting point is 00:16:01 which we then use for updating the model. And then there's also pipeline style parallelism, which you might need when your model itself is too big to fit. In fact, not just pipeline parallelism, but various kinds of model-level parallelism, where you actually take the model and break it up and split it among multiple GPUs, because even the model weights themselves are too big to fit there. And then Accelerate is trying to help you write your model once

Starting point is 00:16:22 and your training loop once and do a modest amount of modifications to be able to access this whole sweep of different ways of doing the training. FRANCESC CAMPOY- Yeah, exactly. If your model does not fit anymore on one GPU, you can split it different ways. You can split the layers.

Starting point is 00:16:37 You can say, if it's a deep learning model, usually those come by. They are bigger because you have stacked more layer. You have layer 1 on GPU 1, layer 2 on GPU one, layer two on GPU two, layer three on GPU three, et cetera, et cetera, which is like a good idea because then your model fits. But then there is this inefficiency in the sense of GPU two has to wait for GPU one to be finished to be able to process as a result

Starting point is 00:16:55 and pass it along to GPU three. And so that's where pipeline parallelism comes into play where you're trying to pipeline things efficiently. So like give a little bit of your data to GPU1, which is going to send it to GPU2. And then GPU1 will process the second little bit of data while GPU2 is busy computing the first part. And there's this ping pong between a forward

Starting point is 00:17:15 when you run through your model and the backward path where you compute all of your gradients. So you can also efficiently interleave like some part of the forward and some part of the backward computation in that pipeline parallelism. And then there is like tensor parallelism where instead of splitting your model by layers,

Starting point is 00:17:35 you actually split the weights of your model into chunks, and each GPU only sees one part of the weights. And so then the GPU needs to come together and agree on the results of all the matrix multiplies that you compute. So this kind of parallelism requires way more, I mean, a very efficient way to communicate between GPUs to be accessible. MARK MANDELMANN- That's right. Maybe the other interesting thing about the hardware around this kind of stuff is the criticality of the network.

Starting point is 00:17:52 You need these very fast network transfers to do the tensor exchanges. And yeah, there are some contexts where it can be a little less critical because you can overlap compute and data. And some things like this tensor parallelism, the GPUs are just going to be sitting idle while you're waiting.

Starting point is 00:18:05 So we nowadays have these kind of wild new networks which have much, much higher capacity and are very focused on these very low latency and high determinism data transfers. One of the things I think is interesting about this is the way in which the networking stack has changed, right? I think when I started learning about how do you do high-performance trading systems, I learned about, well, the operating system kernel is obviously too slow. So if you want to be reasonably fast, you have to do kernel bypass. You have to have a user-level networking stack that's doing the communication. And these systems use a technology called RDMA, remote direct memory access, which I think an easier way of understanding what's going on here is it's CPU bypass, right? Basically, network comes in on the NIC

Starting point is 00:18:48 and then without going through any CPU at all, just gets copied directly to the place in memory that it needs to go, maybe directly into the GPU memory. So you're really cutting away all of the fat from the bones that you can to make this stuff go as fast as possible. Yes, and even like in the more recent hardware

Starting point is 00:19:04 that NVIDIA has announced, Atlas GTC, like you kind of stack your GPUs as fast as possible. Yes, and even like in the more recent hardware that NVIDIA has announced, Atlas GTC, like you kind of stack your GPUs as close as possible and you try to like put as many as possible you can in a single cab. So like there are 72 GPUs in the same cab, very close to each other. So that you can have even faster network between those because you have like you stack some,

Starting point is 00:19:22 the network is in the middle, some GPUs above, some GPUs below, and like they have this big NV link in the back that links everything together very fast just because they sit very close together. Yeah, you start caring an enormous amount about the physical layer at this point. Today, we can get these NVLink setups where inside of a single box with, say, eight GPUs in it, you get this fast network. And yeah, what you're describing is doing this at the cabinet level.

Starting point is 00:19:43 Yes. Which is funny. Yeah, I mean, I remember hearing people talk about like earlier hacks, not for machine learning, but for other purposes where people would like, you know, basically try and make little supercomputers

Starting point is 00:19:53 where you unroll your PCI Express network and basically spread it over an entire cabinet. And in some sense, InfiniBand sort of grew out of the similar supercomputer networking fabric. And indeed, InfiniBand plays a real role in how these GPU networks work as well. Okay, that was the stuff you did at Hugging Face.

Starting point is 00:20:10 So more recently, you joined Jane Street. Tell me a little bit about what your role here entails. Sure. So Jane Street, mostly work here on the engineering performance around machine learning. The day-to-day life is a researcher will come to me with a model we've trained. And they're like, oh, my training is going really slowly. Could you help me with that? And yeah, we'll profile it together, try to identify the bottlenecks and yeah, make it faster.

Starting point is 00:20:33 To take a step back, like most of the researchers here at Gen3 use PyTorch, which is the software to write neural nets and train models, which has the particularity of being really accessible because it's eager. The counterparts from Google, TensorFlow, and JAX are more like compiled languages. So it's kind of harder to get started because you write your model, but then it does not compile. And so you need to fix some of the operations that seem like a valid Python operation, but you need to kind of modify them so that TensorFlow or JAX recognize them and see, oh, this is what you are trying to do.

Starting point is 00:21:03 Whereas in Python, you can do anything you want, but then your code can be inefficient in surprising ways because that particular operation, for instance, has no implementation on the GPU. And so the computer needs to transfer data back to the CPU just to be able to execute it and then send it back the other way. And in general, especially on modern GPUs,

Starting point is 00:21:21 and the way PyTorch works is that when you want to execute a model, the CPU dispatches the operation on the GPU, I think, so that the CPU immediately runs to the next instruction. And you're getting your hardware in a good state. And if your CPU is always ahead of your GPU, and then the GPU has lots of stuff to process, but as soon as your code requires some synchronization

Starting point is 00:21:41 because you need some data back from the GPU to the CPU, it can become pretty inefficient just because you're kind of stalling the GPU as the CPU will wait to get the data back. And then it will take time for the CPU to send back some new operations to execute to the GPU. Right. And it's that waiting where the GPU is waiting on the CPU. It's slow for a lot of reasons. It's slow because the memory transfers are slow. It's slow because CPUs are inherently slow. And then, oh my God, it's slow because the memory transfers are slow. It's slow because CPUs are inherently slow. And then, oh my God, it's slow because

Starting point is 00:22:07 the code that's running is written in Python, which is maybe like 60 times slower than what the corresponding thing written in C might have looked like.

Starting point is 00:22:14 Exactly. Even if you don't care about GPU, most of your Python code, you will always try to have it vectorized. We're trying to write as few for loops

Starting point is 00:22:20 in Python as possible because those will be very slow. Whereas if you can execute like an operation from like NumPy, which will be backed by C or C++, it will be very slow. Whereas if you can execute like an operation from like NumPy, which will be backed by C or C++, it will be much faster. And it's kind of the same idea for the GPU,

Starting point is 00:22:31 except on top of that, you have that complexity to avoid synchronization point between the CPU and the GPU as much as possible. And notably, when a C programmer says, oh, I want to make sure this is vectorized, what they mean is I want to make sure I'm using like the SSE, AVX, whatever instructions they're vectorizing that are using

Starting point is 00:22:47 fundamental parallelism technologies baked into the CPU to be able to do four or eight or whatever computations in parallel. And when a Python programmer says vectorize, what they mean is the inner loop is in C. And maybe it's also vectorized with AVX or whatever at the bottom. But the fundamental thing is getting away from the Python interactive loop. Exactly.

Starting point is 00:23:07 Sometimes you can have code that looks very innocuous, but you're actually executing a for loop, which is going to, at every iteration, trigger a synchronization between the CPU and the GPU, which is extremely bad, because you'll launch a tiny operation on the GPU and then have to wait for the GPU to finish it and get back the result to the CPU,

Starting point is 00:23:23 and then launch a new tiny operation GPU, et cetera, et cetera. And this is also really bad because one thing we forgot to mention is starting something on the GPU is also very slow. It takes sometimes

Starting point is 00:23:31 for the CPU to send the code of the kernel all the input and the outputs. That takes a couple of mics or even sometimes a millisecond

Starting point is 00:23:38 to get started and actually having your GPU starting to do the work. It's maybe worth saying we're throwing this word around kernel a lot which is kind of a funny GPU-specific word.

Starting point is 00:23:47 And basically, the kernel is the small computational program that you are typically running on the GPU. And writing these GPU kernels is actually really hard, because they're highly parallel, and they're hard to reason about. And so the programs, in fact, tend to be numerically very intense, but in terms of lines of code, pretty small. You're not creating like million line code bases

Starting point is 00:24:04 that are running on the GPU. They're a lot tighter than that. Yeah. You call those individual small kernels, a kernel to do metmall and then a kernel to do some activation function in neural net. Yeah. This is just one Python line, which is then dispatched on the GPU to be executed in parallel. So the thing that always kills me about this whole PyTorch story is that if you asked me to design something, I would definitely design something like TensorFlow or JAX. Just like the basic idea of TensorFlow and JAX is that you're more or less like hijacking Python as like a metaprogramming system. You kind of write what looks like Python, but what really you're doing is you're writing in some domain specific language for expressing the

Starting point is 00:24:39 computational graph that represents the program that you're going to run on the GPU. And the reason I would have wanted to do it that way is because it seems just dramatically easier to make sure that thing is going to run fast. You can't take every arbitrary Python thing and make it run fast on the GPU. So you restrict yourself to some DSL where you can guarantee that things are running fast. And it just seems like the whole thing is going to be much easier to reason about whether I'm staying inside of the envelope of reasonable, fast programs and all of that. PyTorch has kind of clearly won. JAX is new and exciting,

Starting point is 00:25:10 and maybe that will get more mindshare over time. But like TensorFlow was the big thing, and then PyTorch has been much more successful. And it just kind of like frustrates my intuitions as a person who designs APIs. Do you have a view as to like, why is it that PyTorch kind of won and things like TensorFlow and JAX are more niche? is it that PyTorch kind of won and things like

Starting point is 00:25:25 TensorFlow and JAX are more niche? So, yeah, PyTorch won for the flexibility. Like, we saw ML researchers to easily follow around with various ideas

Starting point is 00:25:33 and maybe at first it will be very inefficient, but they want to be able to iterate really fast through their ideas and test quickly if they're going to be worth it or not.

Starting point is 00:25:43 And even if the first running round is inefficient, if the ID turns out to be a good ID, then we can spend some time optimizing it and making it as fast as possible. PyTorch kind of represents that model well. You can fool around very easily. And yeah, also look with that model of execution

Starting point is 00:25:57 that is asynchronous, you still get the performance. Unless your code triggers some of the Zedon like CPU, GPU synchronization, your code is still performant when you run it from PyTorch. There is this flexibility, this idea you can easily throw around. And they did come around having a compile thing, like PyTorch 2.0 introduced Torch.compile, which is kind of what people didn't like about TensorFlow.

Starting point is 00:26:17 But they kind of had to implement it at the end. Modern GPUs are really, really fast. And that programming model of, I'm just going to dispatch the operation asynchronously from Python was starting to lose just because the GPU was so fast that by the time your CPU had scheduled the kernel, the GPU was already finished, basically. And even if you kept telling the CPU,

Starting point is 00:26:37 schedule this kernel, this kernel, this kernel in a row, it would just not be fast enough for the GPU. And this idea behind charge the compile is, again, to get the whole computational graph from your model, and then try to identify in that graph, maybe there are places where you're doing something that's very inefficient and we can simplify the instructions.

Starting point is 00:26:53 But more importantly, try to take two consecutive instructions and fuse them together on the GPU, so that instead of launching a lot of small kernels, on the GPU, you launch one big kernel, which does a lot of work. And this is very efficient for, first, you don't pay kernels, and the GPU launches one big kernel, which does a lot of work. And this is very efficient for, first, you don't pay the overhead, and the second thing

Starting point is 00:27:09 that's very efficient is that very often, kernels that are in a row, they read the data that the previous kernel has already written. So you have this inefficiency, I'm going to write something in GPU memory, and then immediately in the next kernel, oh, I'm going to read that GPU memory at just one. And there are some cache systems in the GPU, but still, you have some bit of overhead by doing that.

Starting point is 00:27:28 Whereas in a FUSE kernel, you can just keep that data in registers in the GPU, and you don't have to move it around if it wasn't necessary. Right, so you get to skip both the memory transfers and the kernel launch time. Yeah, kernel launch overhead. And so they do this, which is kind of a crazy hack, by using another Python DSL, which is called Triton, which is kind of a crazy hack, by using another Python DSL, which is called Triton, which is kind of a subset of Python where you can directly write efficient CUDA kernels,

Starting point is 00:27:50 which works well. If you want to write a fast matrix multiplication and Triton, it's relatively easy. And they have some crazy templates, basically, for all of the operations you can do in a PyTorch model. And they fuse these templates from the graph that they extracted during Torch.compile to create big Triton kernels that are going to execute

Starting point is 00:28:07 big chunks of the model on the GPU at once. MARK MANDELMANN, Right. So yeah, maybe we should talk for a second about the programming language ecosystem around GPUs. GPUs have a really interesting underlying computational model. And then there's a big collection of programming tools for it. Maybe you can walk us through what some of the major pieces

Starting point is 00:28:23 of this ecosystem are. FRANCESC CAMPOY- Yeah. So if we start at the bottom level, like the equivalent of C for GPU is CUDA, which is a proprietary language from NVIDIA that they developed to program their GPUs. AMD supports most of CUDA as well because they kind of have to, they are a bit late in the game and if they want people to adopt their product, they kind of need to make sure the software is what people are used to. It's basically C, except you have those kernels that you write, which are executed in parallel on the GPU.

Starting point is 00:28:48 And it comes with everything that's kind of a pain in C. Like, you have to do lots of point arithmetic to make sure that you are looking at the right data. You have undefined behaviors every time you're not super careful. And it's pretty hard to debug. MARK MANDELMANN, So it's a very low-level system. And it also exposes you directly to the performance

Starting point is 00:29:03 characteristics of the GPU. Or not exactly directly, because it gives you some layer of abstraction, but you get to see a lot of the underlying details. And I guess one of the things that struck me, as someone who's mostly used to thinking about performance in the CPU context, is how different the concept of threads is on a GPU versus a CPU. I wonder if you can say a little bit of how should someone who's coming to GPUs for the first time think about threads? So, you will have lots and lots of them for one. The GPU can launch a million of threads pretty easily and all execute them in parallel.

Starting point is 00:29:31 The idea is that you have those blocks that correspond to physical blocks on the hardware, where like a bunch of threads is executed. And even like those threads are like they're executed in a group, which is called a warp. When you write a kernel, it's actually each instruction is going to be seen exactly at the same time by 42 threads, which together form a warp. When you write a kernel, it's actually each instruction is going to be seen exactly at the same time by 42 threads which together form a warp. And one block has a number of warps.

Starting point is 00:29:50 I mean, any number of warps that you want that's not too large, like one block can accommodate like 1,024 threads maximum. And then you can launch several of those blocks in parallel.

Starting point is 00:29:59 Yes, the idea of that block layer is that it's physically on the GPU chip at one location. So you can have like some memory that is shared between those threads, which is useful. For instance, if you're doing a tricks multiply,

Starting point is 00:30:10 you're going to load some of the weights into that shared memory and then use it with those threads to compute something repeatedly instead of like accessing the same region in global memory several times. Right. So there's some more expensive, smaller, closer to the thread memory that sits there to be shared among these threads that are on the same SM, right? Streaming multiprocessor. Right, and then maybe the other thing that's perhaps not obvious to someone who hasn't thought much about GPUs is you also have dedicated registers.

Starting point is 00:30:35 Yeah, up to a certain amount of registers, like 65K for the whole SM. You can have a program with lots of threads that use fewer registers or maybe a program that has less threads, but each thread can use more registers. Right, and the critical difference here between CPUs and GPUs is that on a CPU, you have a really small number of registers, and then when there's a thread, there's just one thread running on the CPU and using all of those registers. And then when you want a different thread to run,

Starting point is 00:31:01 you have to swap it out in all of the registers and swap the new thread in. And so you have this fairly large context switch time and context switch times on GPUs are incredibly small. And so this is part of what enables you to do this kind of massive multi-threading. You have all of these different threads and the threads are both able to execute in these warp groups. So they can do stuff in parallel and groups of 32, but also they often end up being blocked, not typically blocked on IO because the GPU is not doing IO, but also they often end up being blocked, not typically blocked on I.O. because the GPU is not doing I.O., but just blocked on memory, right? They need to do a thing. They need to wait for memory to be shuffled in. And so you can immediately grab some other group of threads that's running

Starting point is 00:31:34 and get them started. And you can hide a lot of the memory latency by having all of these threads that are consuming different pieces of memory concurrently. Yeah, that's the job of the SM. So the warp controller is going to schedule warps like ever on a unit that's going to do some float matrix, a float arithmetic. However, you need specifically dedicated

Starting point is 00:31:50 to matrix multiply, which can do like a small 16 by 16 matrix multiply for like those 32 threads that we just mentioned. And however, you need that are going to load something from global memory

Starting point is 00:31:59 or from shared memory. Any instruction is dispatched on one of those cores. And then immediately after it's finished, like another warp is going to take its place. And this way like most of the latency

Starting point is 00:32:08 is hidden from the user as long as you can express your program in a way that you always have a warp computing something. And CUDA gives you direct explicit

Starting point is 00:32:18 low-level access to more or less this computation model in an unsafe programming model which is not especially clearly documented and can be hard to figure out and hard to understand.

Starting point is 00:32:27 And when you get it wrong, you just get weird undefined behavior, and your program breaks in hard-to-understand ways. OK, so that's CUDA. It's great and terrible. What else is there in the kind of programming language space? FRANCESC CAMPOY- So we mentioned PyTorch and software and JAX, which are kind of the exacts overhand.

Starting point is 00:32:41 So it's something that's in Python with all the good and the bad of Python that then is going to express, either compile the computational graph on the side of Jackson TensorFlow

Starting point is 00:32:50 or directly send instruction to the GPU on the side of PyTorch, which they're going to dispatch those CUDA kernels that we just talked about.

Starting point is 00:32:58 And in the middle there is a flurry of new languages because as it turns out, researchers, they love to hack and test new ideas, but they also don't

Starting point is 00:33:04 love to code in CUDA for some reason. And in the middle, there are several languages like Triton, which kind of sit in Python land in the sense that it's a syntax that looks like Python and you have some subset of Python operations that are supported, but are actually just DSLs to generate efficient CUDA kernels. So we mentioned Triton is one of them. And I guess one thing about Triton is it's in some ways not quite as general purpose.

Starting point is 00:33:27 It's really good for doing things that kind of vaguely look like matrix multiplies. Yeah, I mean, in general, modern GPUs are really, really good at matrix multiplies. They have those special cores, one of them called tensor cores, which are really efficient. And any way you can make your program look like a matrix multiply, you're going to get way more flops than if it's just just regular floating point operations.

Starting point is 00:33:45 Triton is really good at programming those styles of arrays and matrix multiplying them, or then reducing them if you want. Your model computation is slightly different than that. Sadly, very often Triton will not compile your code and won't necessarily tell you why as your own message. It's not always super clear. And the debugging experience is also not always super nice because you're not in Python anymore.

Starting point is 00:34:06 Like it's a generic CUDA kernel. And so you can't really, in the middle of it, inspect the states of everything. Or like you can try to print a bit of the stuff, but it kind of stops like that. There's also this weird decision that the whole machine learning world has made that we can have all the innovation we want

Starting point is 00:34:20 on the programming language side, but the syntax always has to be Python. Yeah, people are used to Python. So you can try to move them all to another language. At various times, like Google tried Swift for TensorFlow to try to get Swift programmers into machine learning or to move away from Python programmer to another language that is more efficient,

Starting point is 00:34:36 but that didn't go so well. It sounds like there is a whole ecosystem in Python with all the rest of the libraries you need to process your data or inspect your results and stuff like that. You can try moving researchers away from what they like, but very usually they don't really follow you. So another interesting language in the space,

Starting point is 00:34:52 which I actually don't know a ton about, is Mojo. I'm kind of curious what your thoughts on that are. So Mojo, I think, I don't know a lot about it, so I hope people will excuse me if I say many mistakes, but it's kind of the same as Triton, except instead of wanting to be a new DSL in Python, it's kind of the same as Triton, except instead of wanting to be a new DSL in Python, it's kind of this new language, which looks a bit like Python, but it is its own language. The support for GPU in Mojo is going to be released in a couple of months,

Starting point is 00:35:15 from what I heard, but it's not there yet. But the idea is that you will be able to write those efficient CUDA kernels like you do in Triton and in that language Mojo. But since you are not trying to do a DSL in Python, like, there is going to be support for debugging or maybe better handling, just because you're writing in a language that was specifically designed for that instead of trying to add that functionality in Python. MARK MANDELBAUM- Right.

Starting point is 00:35:34 And I think unlike writing stuff directly in CUDA, it's a safe language, right? I think it's got enough type system support that if you do something crazy, it will actually try and catch it for you. The way I understand is it's like a little bit Rust-inspired. I think it has some of the same Rust-like mechanisms, lifetimes, and things like that.

Starting point is 00:35:50 And so if it's following that kind of approach, I would expect them to try and make it actually safe. FRANCESC CAMPOY- Yeah. And then you have all of our projects in the kind of the same space, some with a GPU, and with a TPU are some Google projects that kind of do the same thing of like giving you some Python interface to create efficient CUDA kernel.

Starting point is 00:36:05 And if you want to write CUDA kernels, but in Python because you really love Python, there are some languages like Numba. You're doing exactly the same thing as you would do in CUDA, just the syntax is Python. MARK MANDELMANN- Got it. Stepping away from all this panoply of languages out there, how does this all play into the work you do here?

Starting point is 00:36:21 Researchers working on a model, they've put together their model in PyTorch. It's not running as fast as they think it should or they hope it would. What do you do here? Researchers working on a model, they've put together their model in PyTorch. It's not running as fast as they think it should or they hope it would. What do you do? How do you approach these performance questions? FRANCESC CAMPOY- First things first is profiling multiple times to identify.

Starting point is 00:36:35 We talked about both CPU and GPU synchronization points, which are inefficient. So a profile will show you that very easily. And you can track, oh, this instruction created a choke point by synchronizing GPU and CPU, so let's remove that, or let's try to find a way to remove it. Some of them are easy to remove because you can express them in different ways. Other can be a bit trickier.

Starting point is 00:36:54 For instance, if you want your training to stop because your loss is none, so if you have a final loss after computing your loss from your data on your randomly initialized weights that is very large or none, all your gradients are going to be none, and then all your model weights are going to be none. So basically your training is finished and completely borked.

Starting point is 00:37:09 So you might as well want to stop and stop wasting GPU hours on it. So even that tiny thing is kind of difficult because when you type if loss.isNone in Python, to be able to know which branch of that if statement the CPU should execute, it needs to know if the loss is none or not. So it needs to wait on the loss is none or not. So it needs to wait on the GPU to have finished computing

Starting point is 00:37:27 to be able to inspect the value. You have kind of a synchronization point here that looks difficult to remove. One of the solutions is to do that check, but in another thread, like launch another thread that's going to do that check, where the CPU is going to be blocked, but that's okay because the main thread

Starting point is 00:37:40 will continue executing the model. And maybe you will do a couple of iterations wrong on the CPU with your weights that will ultimately be nines, but that's okay because your program will be stopped by that other thread. This is one example of something that's a bit trickier to remove. The idea is that once you remove all those GPU-CPU

Starting point is 00:37:56 synchronizations, your GPU is fed up as fast as possible. And then the next step is you can try to compile your model to access this world of kernel fusion we talked about just before to make it even faster. In the process, you might also want to use different types for your floating point operations.

Starting point is 00:38:12 Most of the models have been trained for a long time in float32s, but we discovered that for deep neural networks, float16 is actually kind of enough for the precision in the layers in the middle. As long as you do your sum, for instance, like when you do a matrix multiply, you can have the weights of both matrices being float 16s

Starting point is 00:38:28 and still have a result that's kind of correct as long as you do that accumulation of the sum in float 32s. And that has led to NVIDIA introducing on these GPUs like very efficient matrix multiplies for like float 16s. Now it's like float 8 or even like FP4 for the new generation of

Starting point is 00:38:43 blackwell GPUs that are going to be soon. Is the float 4 a real thing or is that just a joke? I have no idea. it's like FPFloat 8 or even like FP4 for the new generation of Blackwell GPUs that are going to be soon. Is the Float 4 a real thing or is that just a joke? I have no idea. It's on the slides. So I do not know. I'm looking forward to Float 1. That sounds interesting. It's either 0 or 1 or something.

Starting point is 00:38:57 I don't even know. But yeah, without going as deep as that, like Float 16s are really great because you can train as much as 2x or 4x faster depending on the shapes of your program, for free, by doing these mixed precision things where, like, some operations are completing in float 16, some of those are completed in float 32. Just because you access those tensor cores that are, like, really specialized matrix multiplier units, and they do it really fast if, like, the two matrices are in float 16 or, like, this variant called bfloat16 that was invented at Google as a B standing for brain. So how do you think about like the programming language setup influencing the problem that you have

Starting point is 00:39:31 of helping people build fast models? Like one thing you might imagine wanting is having this split between I am doing the inefficient thing or I'm doing the efficient thing be really explicit so that instead of having to come to someone who knows a lot about performance,

Starting point is 00:39:44 they could just look at their code and be like, huh, let me press the is it fast button and be like, oh yeah, it's not fast here and it's not fast here. And then they could move things around until it was fast. But it sounds like that's kind of not what's going on. What's going on is everything just kind of looks okay and you can run it and stuff, but it's a little harder to figure out whether or not you're doing the bad thing. So is there like something to improve there to make it easier for the end users of the system to understand when they are doing the slow thing? FRANCESC CAMPOY- It's kind of hard.

Starting point is 00:40:10 And in that regard, PyTorch is actually better than TensorFlow, for instance, because it lets you explicitly manage what data is on the GPU and what data is on the CPU. You choose when you do the transfers, unless there is an instruction, if loss.isNone, as we talked about, that kind of creates a transfer of the. TensorFlow, for instance, does not even let you handle what is on the GPU and what is

Starting point is 00:40:28 on CPU. It's going to take care of everything for you because it has compiled everything, and it decides for you where is your data and how it moves. And so sometimes it can also result in efficient code just because the compiler decided that this line should be executed on CPU and this line should be executed on GPU, but sadly the compiler was wrong. So at least in PyTorch you can fix things because you get more fine-grained

Starting point is 00:40:47 control into stuff like that. And an employee who was in love with Keras, for instance, who joined recently was like, oh, this thing in PyTorch is really great. I can choose where my data is and on which device and move it when I want it to move and it's not going to move back unless I ask for it. So there's like two

Starting point is 00:41:04 different dimensions along which you might want explicit control. One is about, am I doing a thing that can be put on the GPU or not? And the other is, even if I am doing a thing that could be put on the GPU or could be put on the CPU, I can explicitly control where it goes.

Starting point is 00:41:19 And it sounds like it's more explicit on one side, and that it actually just forces everything into the completely understood domain-specific language. But then the actual execution of that language has a bunch of compiler magic that you don't get control over in an explicit way. And this echoes actually a bunch of stuff that we're doing on the OCaml compiler side where we are trying to do a lot of stuff

Starting point is 00:41:39 to try and make OCaml faster, mostly by giving end users more explicit control as opposed to making the compiler magically faster. Exactly for this reason of, when you're trying to enable performance engineering, the key to the realm is control. Which also is the idea behind Accelerate in some way. Like, the key that's to give back researchers the full control of our training loop, because

Starting point is 00:41:58 they wanted to mess around with it. There is this idea of making sure of synchronization. So the code is bad. We only see it by profiling. We're trying to do a better job here at Gen Suite at least automatically profiling all the jobs to identify and let researchers know, oh, by the way, this particular model

Starting point is 00:42:12 took very, very long to do this particular step. Are you sure that it's implemented efficiently? Maybe you should profile it. And we're trying to get everyone in easy ways to profile and look at traces to kind of identify the bottlenecks. When we have done all of that, like sometimes researchers have some ideas that cannot be expressed into the building blocks that we have. And like if they want to do something that doesn't

Starting point is 00:42:32 have a fast CUDA implementation already packaged in PyTorch, we need to dive deeper into the stack. So like we mentioned Triton and then writing into CUDA directly. So yeah, sometimes this is needed just because like there is a specific layer research are invented and they want to either try it or put it into production and we need to make it as fast as possible. Right. And then there's a couple of other interesting features from CUDA that I've heard you guys talk about a bunch.

Starting point is 00:42:54 One of them is CUDA graphs and the other is CUDA streams. Oh, yeah. How do those fit in? So CUDA graphs is something that CUDA released and that was used by PyTorch before Torch.compile. It's been designed explicitly to remove that kernel launch overhead we talked about earlier, like when you're trying to launch a lot of small kernels and you pay that overhead for each of those small launches.

Starting point is 00:43:12 And so CUDA Graph is the technology that allows you to play that graph of kernels once inefficiently, but it's going to record all of the kernel launches. And the next time you replay that graph, it's going to remove all of those overheads because it already knows what it has to dispatch. It can do that more efficiently.

Starting point is 00:43:28 So that technology is really useful to remove the overhead of launching a series of small kernels. So it gives you a lightweight form of fusion where you're not really changing any of the operations. You're just taking all of the scheduling work and putting that on the hardware so you never have to go back to the CPU to do it

Starting point is 00:43:42 and you don't do unnecessary copying of memory back and forth. Exactly. And you don't get the kernel fusion, which would give you the additional benefit of avoiding memory transfers. You're still doing those memory transfers. If kernel 2 requires something in memory from kernel 1, you still have kernel 1 is going to write it, and then kernel 2

Starting point is 00:43:57 is going to read it. This is still there. The only way to remove that memory inefficiency is to fuse those two kernels either by hand or using something like Tarsot Compile. But you remove all the overheads, which is already nice. PAUL LEWIS O' When you think about fusion in lots of programming languages,

Starting point is 00:44:11 you can get rid of memory operations, but you can also sometimes get rid of other operations, right? You can do other kind of optimizations across the two kernels. Where like, if I know I'm going to do these set of things, maybe there's some things that can be merged together. So are you also getting that computational benefit? SLAVOJ ŽIŽE benefit? Sometimes, yeah. If your two kernels

Starting point is 00:44:26 had like some inefficiency in the middle with something you didn't really need, you can remove that when you fuse them together. Usually like the benefits come more from avoiding memory transfers. In some instances, you can remove like maybe some intermediate state wasn't really needed and we can avoid computing it. Got it. Okay. So that's CUDA graphs. What are CUDA streams? It's a way of parallelizing stuff with CUDA. When you build these CUDA kernels and then you take CUDA to execute them, it's going to execute them sequentially. So like kernel 2 is only going to be executed

Starting point is 00:44:52 when kernel 1 is fully finished on the GPU. And CUDA stream is a way to parallelize that. If you have two kernels and you know they can be run in parallel because they don't touch the same data, you can launch the two of them in different streams and they will be executed in parallel, at least up until a certain limit. You shouldn't use CUDA streams if you want to

Starting point is 00:45:09 run 100 things in parallel. NVIDIA told us it's not a good idea. And it's true that CUDA streams do not really perform well. This API is exposed all the way up to PyTorch. So for instance, if you want to do some stuff like I'm loading my data and I'm going to put it on the GPU, and in parallel, I would also like to compute the prediction of my previewed batch of data, which is already on GPU. I would like to do those two things in parallel and you can use CUDA streams for that. Like you have one stream that does a compute

Starting point is 00:45:32 and one stream where you transfer the data from the CPU to the GPU. If your model is written well, like with no synchronization point, your GPU is fully utilized all the time without any break. So can I just think of CUDA streams as a coarse-grained threading protocol where each of the threads themselves has lots of little mini threads on the inside?

Starting point is 00:45:51 Yeah, kind of. It's more like a hint than a hard requirement. Like it's hinting to the GPU, you can run those two things in parallel, it's safe. The GPU might choose not to do it sometimes. Okay, so a lot of the different optimizations you've talked about here have been very focused on the GPU programming itself and the kind of connection between the CPU and GPU pieces. What other parts of the process end up needing to be thought about when you're

Starting point is 00:46:15 trying to get the maximum performance out of a training run? FRANCESC CAMPOY- We talked about CPU, GPU transfer, GPU programming, like networking. Talked a little bit about that as well. That's another part that is really important. If you're training on multiple GPUs, they need to communicate efficiently

Starting point is 00:46:26 if you don't want to be bottlenecked by that. I think that I've seen, I spend a bunch of time on this thinking about not just about making the fabric of the network efficient, but also organizing the data loading in a way that you're not going to stall out. I mean, this is kind of a general problem. The GPUs are these kind of incredibly

Starting point is 00:46:41 productive compute machines, and so they're very hungry, right? They want data as fast as possible. What do you need to do to make sure you can keep them fed? Yeah, yeah. Data loading is definitely an important subject, especially when you have some data that is asymmetrical. You have examples in your training set

Starting point is 00:46:56 that are really, really long, and examples in your training set that are really, really short, and you kind of need to batch them together to do, like, one iteration of that stochastic gradient descent we talked about before. There are lots of ways you can do that. For instance, you can just decide, I'm going to take the long and the short together and I'm going to pad

Starting point is 00:47:12 everything. So I'll have a bunch of zero in my tensors after the short sequence has finished and until the end of the very long sequence, which consumes a lot of memory, so that's not super efficient. You can like some kind of representation of tensors where you like, you concatenate everything together, but you save the offsets

Starting point is 00:47:27 at which each things are, which is a bit of a more efficient memory layout. But even then, like, when you do this retit training, we're kind of sad because if you have one GPU that has to load a very, very long sample and the other GPUs have, like, shorter samples, since they need to communicate together, like, to agree on which gradient is the right gradient,

Starting point is 00:47:44 the GPUs with the very short samples are going to wait for the GPU with the long samples for a long time. So you kind of need to organize your data loading in a way where you think about your distributed training so that each GPU has kind of a balanced load of data so that they all take the same time to load the samples. And so that at least like when you have the long samples,

Starting point is 00:48:02 everyone has a long sample to load, otherwise it's pretty inefficient. But then it might impact your training accuracy because you're not shuffling for your data set. You're kind of doing a pseudo shuffle where you still group things by size. So it's kind of a trade-off between like performance and accuracy by like

Starting point is 00:48:17 removing some degrees of randomness in your shuffle of the data. Yeah, one thing that's maybe not obvious is that a lot of these algorithms are essentially structured around barrier synchronizations. You're going to do a bunch of stuff and then you're going to wait until everyone's done and then you're going to do a bunch of stuff

Starting point is 00:48:32 and wait till everyone's done. Barrier synchronizations are like super terrible if you have a lot of non-determinism or just non-uniformity in the different pieces that are going into meeting that barrier. Because like some people are going to get to the barrier first, and you're going to wait on the others. And while you're waiting, you're just not doing anything.

Starting point is 00:48:48 Happily, GPUs are mostly pretty deterministic in terms of the amount of time it takes to do a given computation. But you have to feed it a computation of the same shape everywhere in order to get it to really neatly line up. And we were talking also before, when you were asking me, why was PyTorch a winner? I think one thing that we really want in the side of PyTorch and people's heart is the way when you have asymmetrical data

Starting point is 00:49:11 and different sizes at different batches, it's way easier to code that in PyTorch, which is more flexible because compiling that kind of thing is really, really hard. In TensorFlow or JAX, you kind of need to go to some extreme lengths in your code to make your data the same shape again, and then send it to your model. Whereas in PyTorch, it's really easy. You can just batch together small things, and then at the next price, it's going to be a very long thing, and PyTorch is still happy because it's easier and not compiled.

Starting point is 00:49:37 Right. I mean, I guess this is always the problem when you go to some simpler, more highly structured domain-specific languages. There are some things that's good at expressing, and there's some things that's bad at expressing. When you Like there are some things that's good at expressing and there's some things that's bad at expressing. When you want to do the thing that's bad at expressing, you can just be in a world of hurt. Yeah, exactly. You know, you've spent a bunch of time in your career working on various kind of open source training,

Starting point is 00:49:55 machine learning ecosystems. And you now spend a lot of time internally working in our world. I think it's fair to say, like, we are in various ways more immature than lots of other organizations. I think at the time where Google was already designing their own custom hardware for efficiently evaluating neural net models, we weren't really using neural net models at all. Like, all of this effort on our side has really spun up in the last few years.

Starting point is 00:50:18 We've been doing various kinds of statistically driven inference of trading strategies for as long as I've been at Gensheet. Like, that the first job I had like 21 years ago was doing various kinds of optimizations and model fitting stuff, but very different models and didn't have any of the same performance shape. And so all of our tooling around this is relatively new. And I'm kind of curious for you as someone who's like seen stuff in the outside world and seen our ecosystem here, what are the things that you see as the big gaps? What are the kind of things that don't work as well here as they should and that you want to see us improve and that you want to work on?

Starting point is 00:50:50 One nice aspect of the fact that we are newer to this special engineering stuff is that people are not necessarily aware of the things that are not performant and making a lot of mistakes when writing their code. So it's really easy for me to come in and spend a couple of hours on a project and be like, oh yeah, no, it's going to train four times faster. And you just have to change those five lines the code. So it's really easy for me to come in and spend a couple of hours on a project and be like, oh yeah,

Starting point is 00:51:05 it's going to train four times faster. And you just have to change both five lines of code. So it makes my job very easy in that regard. Sometimes it's a little bit more difficult than that. But yeah, there have been a couple of instances where optimizing a given training was really, really easy just by profiling it once because of this.

Starting point is 00:51:23 We should improve our infrastructure around training loops in general, like making the training infrastructure work better for researchers because we're kind of making the same mistakes as other people already did in the open source world, like these giant training loops with lots of spaghetti code that researchers end up not willing to use because they can't modify what's inside of them.

Starting point is 00:51:42 It feels like sometimes we have that same problem internally as well. MARK MANDELMANN, So do you think we need to sort of do morally the same thing that Accelerate did of trying to build a set of libraries that are highly configurable and modular, instead of having like one training loop that everyone uses, make it easy for people to build their own training loops for different use cases?

Starting point is 00:51:59 FRANCESC CAMPOY- Yeah, especially since people here are very smart and really like to hack things together, it feels like a better solution for them. The magic training loop where you press play and your model trains as it's appeal, which I can understand for people who are less familiar with machine learning, but at least for people who are deeply familiar with all the internals of machine learning and want to deep research into every part of a training loop, they need something that's akin to Accelerate,

Starting point is 00:52:20 where you just have small, composable building blocks that are very easy to use and not like this giant black box with 150 arguments to accelerate, where you just have small, composable building blocks that are very easy to use, and not this giant black box with 150 arguments that you have to pass in the correct order. MARK MANDELMANN- Yeah, that's terrible. FRANCESC CAMPOY- I'm talking about overtraining APIs, not giving any bad time to any engineers. That does not exist internally.

Starting point is 00:52:40 MARK MANDELMANN- Right. Certainly, there are pieces of code that we have, some of which I've written, that have the property of you have hard-coded a bunch of concrete behaviors into it and it has become ossified and hard to change. And it's certainly a problem that shows up. Maybe it's worth just saying like a few words about the way in which the problems we solve here are different than problems solved in the outside world. And maybe just to talk about like what role machine learning actually has here. So just say like some very high level things.

Starting point is 00:53:03 We use machine learning for a bunch of stuff. We use it for some kind of general purposes in the way that any organization might. We have a whole AI assistance team whose job is to try and leverage various AI techniques for building various kinds of automations. A lot of it focused around LLMs and coding assistance, but not just that. So that's like one kind of use case. And then we have a bunch of use cases that are very focused on trading. Even inside of the trading world, I think there's kind of two major streams of applications. There's, we are going off and trying to extract data from the outside world in order to inform our trading. But we're using the same kind of data that has already been shown to be a good target for standard machine learning techniques. So maybe we want to get data out of

Starting point is 00:53:45 images or geospatial data or text data. There are all sorts of published models out there and published architectures that you can do for this. And we are like happy to leverage and fine tune and exploit those existing things. And there, the work that we end up doing looks a lot like the work that people do on machine learning in the outside world. And then in some sense, the magic is more about how do we pick the data that we're going to apply it to and how do we integrate that into the decisions we're making on the trading side. And then there are places where we are applying machine learning techniques to trading data itself, like the data that you get from exchanges, various alternative sources of data that can inform that. And I'm kind of curious,

Starting point is 00:54:22 like, how do you think of that set of data as being different from the kind of data that you typically see in the larger world of machine learning? Data is much noisier. So it's way harder to train good models on it just because the signal you can extract from it is actually way weaker than in something very structured

Starting point is 00:54:39 like text or images. Very often, you will never get the same kind of accuracies as what you get on computer vision and on text. But even extracting a very small amount of signal can still lead to good training strategies. So you can still get valuable feedback from that kind of data. It's maybe worth saying there are fundamental reasons why trading data sets are noisy, because the fundamental process of trading is one where when there's a signal,

Starting point is 00:55:01 people trade that signal, and that signal kind of gets removed from the data, essentially. And so to the first order of approximation, the time series of the prices of a security look kind of like a random walk. And there's a little bit of signal in there, but it really is mostly noise. And so whatever your training technique is, it has to be able to deal with the fact

Starting point is 00:55:19 that there's a ton of noise in the labels, essentially. And so, yeah, so that's one aspect of it, very noisy, and as you said, it also changes and reacts to the way you use it. So, like, in difference with text, for instance, like you built that was released in 2018, like you use it right now, it's still as good as it was in 2018.

Starting point is 00:55:35 The same is definitely not true for a model that you train on market data. The kind of strategies that you could run a couple of years ago won't necessarily work right now just because the market has reacted to them and adapted to them. So you kind of need to come up with new modeling strategies all the time

Starting point is 00:55:49 and reinvent yourself. Another aspect of that data that is different from the rest of the outside world is that it's huge. We have massive amounts of market data. I think like it's a couple of terabytes per day. So you multiply that by the number of days or like a couple of years,

Starting point is 00:56:02 and yeah, you have a massive amount of data to feed your model. So that brings its own challenges in terms of data loading and making sure it's efficient and that the GPU gets saturated. Right, and in practice, the model sizes that we tend to do tend to be smaller than the sizes of the very largest language models. And so the overall ratios of flops per byte are just very, very different. And so the things that people are optimizing for

Starting point is 00:56:25 in the designs of the GPUs and the designs of the network are often not exactly the thing that we're seeing. We have to do a world-best of research, basically. We can't just rely on what's been done for other kind of modalities like text or images. We need to reinvent new models that are adapted to market data and new ways of loading that data and keeping the GPU fed as much as possible.

Starting point is 00:56:44 And sometimes we care about algorithms that are completely different from the one NVIDIA or PyTorch care about, because they're not necessarily used in LLMs, and everyone is all about LLMs these days. It gives a good amount of programming to do in terms of GPU performance. Yeah, I think it's actually an exciting part of the machine learning world here in general, is that there's just a wide variety of coming up with and experimenting with new architectures and new models and new ways of applying it to data sets where it's like there just aren't a lot of papers telling you how to like analyze financial time series because the people who are good at that mostly don't publish papers. I wonder why.

Starting point is 00:57:19 Another thing that I think comes up which which is interesting, is just inference times, right? So we care about using these things in the context of trading and that in general are the level of speed at, and sometimes a few hundred microseconds is slow enough, sometimes milliseconds. There are some kinds of machine learning problems where like, oh, getting an inference once an hour would be great. That's like all we need, and sometimes even less than that. So you just have like a very wide variety of inference times you need.

Starting point is 00:57:58 And at the very low end of the scale, it's nothing anyone else cares about. Yeah, that's why also you were mentioning before like our model have various sizes. Some are very small because we want them to run very, very fast. But even if they are small, like there are some challenges to make sure

Starting point is 00:58:11 like that they can run in the timeframe we need to make sure we are as low latency as possible. Right, and just to keep up with the data rate. Yeah, because like there can be a million events in a single stock one day. So if you're not fast enough to just process them, it might not be the case that we need the prediction very fast. But sometimes if you want to keep up and not get behind too much,

Starting point is 00:58:32 you just need to be a couple of mics per event and not much more than that. So those are a whole bunch of differences between the kinds of problems we look at and what you've seen in other places. How do you think that influences the tooling we build? Are there ways in which we need to build different kinds of machine learning tools in response to the ways in which the shape of the problem is different? Yeah, we talked about data loading, for instance, that comes with its own challenges. So obviously, we have developed a ton of custom data loading utilities that we can

Starting point is 00:58:59 use to make this faster. We also talked about models that are not necessarily the same ones as everyone else cares about for other kind of well-studied modalities like text or image. So we have a lot of custom models written internally that we have found work well, that we keep trying to optimize for training and inference. So this is a bunch of exciting

Starting point is 00:59:17 work. The rest of the training is mostly the same as in any other machine learning job, I guess. Stochastic gradient descent has not changed. It's the same algorithm. MARK MANDELMANN- And the same algorithm it was 40 years ago. FRANCESC CAMPOY- Yeah.

Starting point is 00:59:30 MARK MANDELMANN- So another thing that you've done a lot of in your career is education, right? You were a math teacher for a bunch of years. And then you did a lot of education work, both at Fasted AI and at Hugging Face. So you're also involved some in the education story here. Can you tell us a little bit more about what that looks like? Yeah.

Starting point is 00:59:46 Chain Streets is trying to like up its machine learning games, both by hiring more people who do machine learning, but also by educating the existing people into machine learning. And we talked a little bit before at FAST.ai and like how it was important to like make radiologists, for instance, competent at machine learning so that they can do radiology better with machine learning. It's the same here. We need to educate traders

Starting point is 01:00:07 about machine learning so that they can do better trading using machine learning and can inform the choice of models that machine learning researchers can pick

Starting point is 01:00:15 because they know the data very well and they are domain experts. We have a boot camp that we run every couple of months with either traders

Starting point is 01:00:23 or researchers that are not super familiar with machine learning. And we try to make them up to speed with the latest technique, both from the outside world and from inside Gen Street. You mentioned this point of, in part, you want to teach people. So people who are not primarily going to be

Starting point is 01:00:38 machine learning modelers as like their primary job, but still have them better understand. People who are experts in other aspects of the trading problem and have them understanding more about machine learning. That's one goal. I think it's also the case that you can teach people the machine learning stuff in some ways, like it can't be that hard to learn modern machine learning because in some ways

Starting point is 01:00:55 modern machine learning is 10 years old. I was saying like make them into domain experts so they can help better, but some of them like actually end up training models, doing a lot of machine learning themselves. It depends on whether they like it or not. Because machine learning is a bit like cooking. You throw a bunch of stuff, and then you let your model training stir for a while, and then you see if it was good or not.

Starting point is 01:01:13 It's not the same kind of just programming some training system. Some people like it, some people really dislike it. Yeah, that makes total sense. So there's a lot of things you're trying to convey to people when we're running these classes and these courses. What are things that you find hard to get across? LUDOVIC BLECHER The point that's most difficult to get across to people is that, yeah, no one knows anything about machine learning.

Starting point is 01:01:34 Like, it's really just a cooking science. Like, we still don't know why neural net generalize so well. We know there's a bit of theory explaining why, like, they are able to train on the training data. But why are they any good out of training samples? We still don't know why they are so good at generalizing. And in general, you can try to get a little bit of

Starting point is 01:01:54 intuition into trying to do this to fix this kind of problem. Like, I'm overfitting, so I'm going to try this regularization technique. Maybe that will help. But there's always that maybe. It's not until you've tried that you can know for sure that the thing is going to work. So, like, this is really hard to convey.

Starting point is 01:02:11 And then, like, try to get people very disciplined about reproducibility. Like, one mistake that beginners in machine learning do all the time is they train a model and then they forget what they did to train that model. And so, like, two months later, like, oh, I did train that model. It was good.

Starting point is 01:02:25 I should try to retrain it and maybe use it. And they never managed to reproduce their initial results just because they didn't write down all of the stuff that was needed to train that model. Those two points are really the ones that are difficult to make a request because I guess you can't fully understand them until you have gone through the pain of the two of them.

Starting point is 01:02:43 You don't understand the importance of reproducibility until you've gone through your first reproducibility crisis. Yeah, exactly. Then you fully understand why it's so important to save absolutely everything down to the revision of the code so that you can run the exact same thing at another time. Yeah, and I think some of the reproducibility stuff is about getting people on board with the discipline required.

Starting point is 01:03:03 We've talked a lot about technology that meets the researchers where they are and makes it easier for them to express what they want. But there's some part of it, if you want to be a good researcher, you actually just need an enormous amount of discipline because the tools are somewhat imperfect and you just have to be really careful

Starting point is 01:03:16 to get that reproducibility story right. And at the same time, I think there's also a lot of work we can do on the tool side to make reproducibility easier, right? I think it's complicated in part because the overall ecosystem is complicated. Just managing Python packages is shockingly complicated. And making sure you didn't have an upgrade of a random package that broke everything. Yeah, that's already difficult.

Starting point is 01:03:38 And then making sure your code is checked out and that you know the revision of the code that you are using. That's another thing because you can change a small line of code in your out and that you know the revision of the code that you are using. That's another thing because you can change a small line of code in your model and think, oh, this is totally harmless, but then it actually destroys the performance of your model because it was a key ingredient in your cooking recipe and you hadn't realized that. So yeah, making sure that your code is still there. And the last

Starting point is 01:03:57 thing is training involves a certain number of hyperparameters. Usually people write all of that in some kind of config. So make sure that you save that config somewhere so that when you want to reproduce your training, you actually know that you had used this learning rate, this batch size, et cetera, et cetera. MARK MANDELMANN, I guess another fun source

Starting point is 01:04:14 of non-determinism is to the degree that you're doing your research in Python notebooks, the fact that you can run Python cells in Python notebooks in arbitrary orders. And if you do this in the wrong way, you can end up computing a result that you just like have no record of exactly where that result came from. Yeah, that's another kind of fun.

Starting point is 01:04:30 Fortunately, notebooks are still a bit difficult to check in into any kind of repo. So like usually people move away from notebook by the time it's time to check the code in some kind of infrastructure. So like this issue kind of disappears. But it's true, like while you are experimenting, this is another fun

Starting point is 01:04:45 source of operability. And then you have like GPU being non-deterministic at a fundamental level because they are heavily parallel. So that's always so fun like when you're trying

Starting point is 01:04:52 to debug exactly why my floating point result at the end is not the same thing as what I was expecting just because floating point arithmetic is not associative and GPUs have many threads

Starting point is 01:05:02 which may end in any kind of random order. GPU training is in some sense non-deterministic because it's in parallel, but it's also in some sense non-deterministic just because we can tolerate it. You could do things highly in parallel and make sure that you're always doing things

Starting point is 01:05:16 in the same order and do stuff to conserve the determinism. It's usually at a huge cost in performance. But it's at a huge cost in performance. And it doesn't totally matter, right? That's actually one of the interesting things about machine learning is because you're doing this kind of soft numerical optimization process,

Starting point is 01:05:30 you can just take some error. And actually, a lot of the early research in various places is called kind of hog-wild concurrency, where you just had shared model weights and things checking out and producing new gradients and updating them. And were there data races? Yes, they were.

Starting point is 01:05:44 There were. And like, it races? Yes, they were. There were. And it was kind of OK. But I think over time, that's fallen somewhat into disfavor because it's just even harder to predict what's going on. FRANCESC CAMPOY- Yeah, it's completely un-reproducible. So you can end up with a model that's really good,

Starting point is 01:05:55 but you have no idea why. And you were never able to reproduce it anymore. So that's a bit annoying. MARK MANDELAIS- Anyway, this was a lot of fun. Thanks for joining me. FRANCESC CAMPOY joining me. Thanks for having me. You'll find a complete transcript of the episode, along with links to some of the things that we discussed, at signalsandthreads.com. Thanks for joining us, and see you next time.

Signals and Threads - The Uncertain Art of Accelerating ML Models with Sylvain Gugger

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.