Y Combinator Startup Podcast - Anthropic Head of Pretraining on Scaling Laws, Compute, and the Future of AI

Starting point is 00:00:05 Hey guys, I'm thrilled to be joined today by Nick Joseph, the head of pre-training at Anthropic. To give viewers a high-level sense of what we'll be covering, we're going to start with the basics of what pre-training and then dig into how Nick thinks about strategy, data, alignment, and infrastructure at Anthropic. And by the end, you'll hopefully have a sense for how progress in AI comes directly from advances in pre-training. I would love to talk a little bit about your backstory and kind of how you got to this point. Where did you work before Anthropic and what were your takeaways from those places? Yeah, so let's see, I was at Vicarious and then at Open AI before Anthropic. So vicarious was originally in a GI lab.

Starting point is 00:00:36 And sort of, when I joined, they were sort of making a shift to product, particularly working on robotics products. And the thing I worked on was, like, training computer vision models for their robotics products. It was my first job. So I think I just learned a ton about, like, how to do machine learning models, how to, like, write machine learning infrastructure.

Starting point is 00:00:53 And at the time, were you also thinking about a career as an academic? Like, at the time, a lot of people doing AI work were in PhDs. That's kind of what I was thinking about before I started to do a company. Like, how were you thinking about that in your headspace? Yeah.

Starting point is 00:01:03 So, like, and actually, we want a little bit. I think a lot of my thinking on this had come from an internship I did a give well, which is like a nonprofit that evaluates charities. And some people there being like, ah, or at some point you might have a GI, it could be dangerous, we should worry about these risks. This could be like a big impact on humanity. And I was like not super convinced at the time and went down the economics route and was going to try to work on like directly helping people on poverty. That didn't work out for various reasons and ended up being like, okay, I'll at least work on AI. Either like the safety thing will turn out to be important, I'll work on that or it won't be

Starting point is 00:01:32 and I'll just make cool things with AI that can probably help people in poverty more. I wasn't really coming at it from an academic standpoint. I was sort of like, in fact, when I switched to that, it was part of the appeal was that I could like immediately go do stuff in AI, whereas if I want to work in like economic policy, I'd have to wait, I don't know, six years through a PhD and then start and like, it's a longer path. And what did the state of AI safety work at that time even look like? Who are the people who were thinking about that kind of stuff?

Starting point is 00:01:58 I mean, there was some folks had vicarious thinking about this kind of thing, but it was fundamentally a robotics company. And so, yeah, how were you thinking about that at the time? Yeah, so my sense was like at the time, a lot of the AI safety discussion was kind of theoretical. Like the models weren't actually that good. They weren't really posing these dangers. So it was a lot more like philosophical.

Starting point is 00:02:14 It's like, oh, at some point we might get AI that's really smarter than humans and like, should we wait this like future concern? How should we compare that to near-term things? And I think that was like actually just a less compelling argument. I think it was like an interesting one and like sort of made you think of it. So next you went to Open AI. What was Open AI like at this time? Yeah, so I was on one of the safety teams

Starting point is 00:02:35 and kind of worked on, and then working on code models, actually. Cool, nice. When I got there, the first thing I saw was, oh, they'd fine-tuned GPT3 to write some code. And it was really good. And I was like, oh, okay, if you're worried about AI getting really powerful, writing its own code, that seems like it could self-improve and how likely is that to happen.

Starting point is 00:02:56 So it was doing a bunch of evaluations and studies of what contributed. And then after like eight months, basically everyone I worked with, like all the safety leads left, which, yeah, invited me to go to Anthropic. And that was sort of the reason I joined Open AI was because I cared about AI safety and wanted to work with them. So then I wanted them to join Anthropic pretty much right when it started. With that, why don't we transition a bit?

Starting point is 00:03:18 These days you run the pre-training team specifically at Anthropic. Obviously, you've been working on pre-training at Anthropic for quite a bit of time. And I'm sure it's evolved over the years what that even entails and looks like. Why don't we start by just talking a little bit about what pre-training is? Like how does it even fit into the way of thinking about how AI models have developed at a place like Anthropic? And what do you guys do? We know that one of the ingredients to making AI models better is scale. You want to put a lot of compute in.

Starting point is 00:03:42 And if you sort of step back and you're like, okay, what's the way we could put the most compute into a model possible? We need some objective that there's just like tons of data for. And one idea here is like the internet. The internet is massive. It's probably the biggest like single source of data that you made is created. And you don't have labels. You don't want someone to have to go in and read the entire internet and say something about it.

Starting point is 00:04:03 So you want to get labels out of the data itself. And the idea here is we can take some text and we can predict the next word. So you take the as the first word, you predict the second word, then you say the cat, you predict the word after that. And this means you get very dense signal. Every word is like a new example. And there's a huge amount of data.

Starting point is 00:04:21 And one of the findings from my GPT1, GPD2, was kind of as you throw more compute at this, more data, bigger models, you get better. you get smarter models, essentially. Totally. And that's kind of been the central thesis of free training for the whole time. There's this idea of scaling laws,

Starting point is 00:04:37 which is that you can actually quantify, like as you put in more compute, more data, more parameters, you get models in a very, you got a lower loss, a better prediction of the next word, in a very predictable way. And I think you can somewhat foresee

Starting point is 00:04:49 from that original paper, and I think like Dario did foresee this, I think many people did, but what wasn't obvious was that once you have that, there's this positive feedback loop where you can train a model, you can use it

Starting point is 00:04:59 make something useful and sell that and get more money, use that to buy more compute, and then use that to train a better model. And we've sort of run that cycle over and over and over the past five years or so. Well, in thinking about that objective to begin, you know, I think the way I think about the state of pre-training is, yeah, it seems like this next word prediction, at least from the external standpoint, seems to be the dominant way pre-training happens. But if I rewind the clock to that era of 2017 to 2020 or 2021 and 2 even, there was all sorts of pre-training objectives people were considering, right? There was these Burt and Bart models that were doing mass language modeling.

Starting point is 00:05:31 It seems like this GPT series of models doing auto-regressive modeling, as you're described, this next word prediction, seems to be the dominant one that one out. Do you have any reflections on that time period? Were you guys trying all of them and kind of this one worked? Or is there some first principles reason why this is the right one that should have worked? I think the answer is like, it's mostly imperative.

Starting point is 00:05:51 In terms of how to think of these things, I'd be like, yeah, it's empirical. Just try them all, see what works. One big advantage for this auto-aggressive setup is that you can just sample from it to generate text afterwards in a fairly like straightforward way that comes straight-endable. Like it enables a product use very nicely. Like one thing that you want is like, just one character you want from a set-up is like a loss,

Starting point is 00:06:08 whereas you drive down the loss, that actually is the thing you care about. And you can think of it as like, if you got to perfect on language modeling, you now can like write text as a human. You can sort of imagine you put in the title of the paper and it should spit out the entire, spit out a novel paper. Whereas I think some of the other approaches don't quite have that flavor. Yeah, totally. Yeah.

Starting point is 00:06:28 in terms of that loop you're describing of, you know, then release something that gets your revenue and you can use that to buy more compute and iterate. This sort of gives you the most natural way to actually do that flow because you can keep releasing new products and keep getting the revenue from that to invest in more compute and so on. Yeah, it certainly gives you the most open-ended thing. You could imagine, you know, you like train something as a class, like you train some base thing, you fine-tuning for a bunch of particular tasks. One approach people would use, they would like do this big pre-training and then they wouldn't just like open-endedly sample from it. You'd fine-tune it on like 100 specific tasks. And that could work too.

Starting point is 00:06:56 I think that one sort of general intuition I have is like compute is the thing that matters. So like I think if you throw enough compute at any of these objectives, you're going to get something that's probably pretty good. And can kind of be fine-tuned to other things. And it's surprising how little these details matter compared to throwing more compute at the problem. When you think about actually throwing more compute with the problem, there's a whole bunch of axes by which you could throw compute at it too. Right. And if you have a specific model architecture you're training over, you can basically throw more data at that specific architecture. For a particular one, you could add more layers or make the models larger in it.

Starting point is 00:07:29 You could do some kind of neural architecture search over lots of different variants. And I assume that these days it's somewhat more figured out which architecture you go for. I assume the earlier days it was somewhat less so. And I'm curious if you could speak to how you guys thought about that. Like what did your infrastructure even look like to do that type of determination? I mean, I think the short answer is it's hard, right? What you're really doing is you're going to train this one big expensive model and you have a space of, you know, you can sort of call all these things hyperparameters,

Starting point is 00:07:54 you know, how many layers do you have, what's you're with, like, you're the space of hundreds of hyperparameters, and you want them all to be optimal. And you're sort of striking this balance, actually, between how much do they matter? Like, can you just take your best guess and throw more compute at it in whatever way you want? And basically doesn't matter. How much you're letting get it precisely correct. Yeah, interesting. And I think one of the, like, interesting things is, like, it actually doesn't matter that much. Like, I think this was in one of the early scaling laws papers. Like, you can change these things and get little wins, but, like, as you throw more compute, it sort of reliably gets better. mess up enough, you will stop seeing that happen, and you won't have any way to know,

Starting point is 00:08:29 which is one of the, that's like kind of the hardest part in some ways. You don't know the counterfactual, basically, because you didn't run it for long enough to actually know what it is. Yeah, we have these scaling laws. So you can sort of say, like, as you train them off more compute, you expect the loss to go down as a power law. It's really a power law plus constant. So what eventually will happen is you'll curve off that power law, and then you know

Starting point is 00:08:45 something is wrong. And is it fundamental? Is it like you've hit the limits of scaling? Or is it, nope, you should have changed, you should have tweaked your learning rate slightly differently. And that's sort of one of the thing. of the challenges. In terms of how to like figure it out, you can, the usual paradigm is like test things out at small scale before running them at large scale and try to find things. Small

Starting point is 00:09:02 scale in terms of data or in terms of something else? In terms of everything. Like you kind of want to scale things down like proportionally. So you want to say like you want to have some theory for like how you're going to scale up. Like, ah, okay, if I get 10 times as many flops, how much of it goes into layers, how much of it goes into data, how much of it goes into attention? And you sort of get that theory and then test that it's optimal, but, but, you sort of get that optimal a bunch with like scaling everything down proportionally. And just so I can think about what this actually looks like, in those early days of Anthropic, you're a team of like 10 or something like that in those very early days or 12 maybe,

Starting point is 00:09:35 what actually is your ability to use large-scale infrastructure as like a relatively nimble startup at that time? I mean, a startup that was well capitalized, but still not actually that many people working at, what kind of infrastructure did you have access to to train these early models at the time? So it's actually one of the wild things was it at least, I mean, you don't know what what anyone else is doing, of course, but it kind of felt like we were at the frontier of it, and there just weren't that many people who cared.

Starting point is 00:09:58 Like, I was sort of coming, you know, I was coming out from like, we're making AI, this is the most important technology ever. And then we kind of look around and be like, and it seems like I'm one of 30 people who are working on this in the world. I mean, I was kind of like junior person, everyone else sort of knew how to do this

Starting point is 00:10:11 and had done it before, but I was kind of surprised at how easy it was. Like the public estimates for GP3, I remember, were that it cost $5 million to train, which you're like, on the one hit five million, it's kind of a lot, but it's like a lot for an individual person. It's not really a lot from like a company perspective. So we could totally buy like compute that was enough

Starting point is 00:10:32 to train models like that, you could. Were you using a cloud provider or did you have a custom setup somewhere? Did you literally have racks in a room somewhere that you bought a bunch of video GPUs and you were doing it? We're using a cloud provider, but I think it's kind of, it's not actually that different because one of the things that was surprising to me is you actually have to understand the literal layout.

Starting point is 00:10:50 Like, I remember at one point, one of my coworkers running a clustering algorithm to identify what rooms all the chips were in, since we had a hypothesis that they were in different rooms, and that was causing like, or different buildings, some sort of network latency. And you can kind of figure it out, you could like reverse engineer,

Starting point is 00:11:06 like, okay, yeah, there's clearly like two clusters here that are connected better, and there's some issue on the connection between them. Like, we're trying to push the limits of the hardware like as much as possible, particularly at the beginning when we were kind of like, we have way less funding than everyone else. We have to, and most people,

Starting point is 00:11:20 but weren't very efficient with the compute. So we were like, ah, we could get a big lead by being really efficient at how we use the compute. Could you talk a little bit about some of the things you guys did in those early days for how to get the most out of the hardware? I think that's really interesting. I think back to the early days of Google, for example, where there's these cases where they basically bought relatively cheap consumer chips

Starting point is 00:11:38 and then they optimize the software to make it so you can actually get the most bang for your buck out of them, and that's how they had all this high latency, or low latency, high availability stuff. I'm kind of curious if there's some analog in the early AI era to that. I think for us it was largely about getting the distributed framework, right? So like we're training on, in order to train something longs, you have to train them on a large number of chips.

Starting point is 00:11:57 And there's a bunch of different approaches to how to do this. There's like data parallels and there's pipelining, there's upsharding, and like getting all of this. And at the time there were no great open source packages you could just grab and use that just worked for this. I mean, today there's somewhat more of these, but at the time I assumed it was literally none. There were some.

Starting point is 00:12:13 Like I actually remember that we were kind of data parallels of them early on and it was like, and now we write the all reduce in. It was like, we really do this ourselves. which we don't like call a package. And this was kind of like, well, we're going to want to modify it. Right. Like, oh, like, we don't want to outsource this to some package because, A, we're about to go to a bigger scale, like, PyTorch, for instance.

Starting point is 00:12:31 They had a package for doing this. But we were going to go to a bigger scale than Facebook had been to. And you don't want to have a dependency on a package that you're going to have to be constantly modifying, essentially. It's such a counterintuitive sentence there, too, like we're going to a bigger scale than Facebook. Well, because at the time, Facebook AI research was considered one of the best places to do machine learning research. Like fair was one of the play fair and deep mind were hiring lots of people

Starting point is 00:12:54 out of top PhD programs and doing lots of things. Like what was your headspace when you were like, okay, this very established lab with great people and whatnot, we are operating on a scale that is not relevant to them. Like was that natural and obvious to you or was there times where you kind of doubted the decisions you were making in that situation? I think it was surprising. I will, maybe I'm just too arrogant or something. I kind of looked around and was like, what are these people doing? They're all missing the like big picture here. Like I think the scaling, laws were pretty clear. And the arguments against, I just thought were kind of nonsensical. Like, you know, the scale, I think the original scaling loss paper had like 11 orders of magnitude.

Starting point is 00:13:28 And there was like this intense debate on whether it would continue for like another point. Right. And I was like, like, it's already 11. It seems like one over 11 is maybe your chance it fails here. And then like, you know, sometimes it doesn't work. Like sometimes it just works straightforward. You're like, well, it's train the moment. You're like, oh, yeah, of course. But yeah, I do think that it was, it maybe felt obvious when you're in that headspace and you're working on this all the time and you're making those plots. And I think these things feel pretty different when you're on the outside. You know, there's a huge space of papers. Everyone tries to make their paper sound like very robust and important. I can see, I can see being like, oh yeah,

Starting point is 00:14:01 this is not really a thing. But also different labs have different cultures. So like I think one of the things at fair was it was a very more PhD style, independent research. People have their own ideas, pursue those. You're fighting for your compute and so on. Yeah. And to do a project like training a large language model requires a lot of people to collaborate on like a complicated piece of infrastructure that isn't going to be a paper, right? Like you're not going to publish like, oh, I got a slightly, I got 5% more efficiency than the next one. And it's not respected in like those cultures necessarily. So that might have been part of it.

Starting point is 00:14:31 Okay, so then when you actually implement these models, you're saying you're using a level of low level programming where, you know, you're using libraries like Pi Torch, but you're perhaps not using everything right out of the box from Pi Torch because there's things you guys want to customize that are at the level of basically one level of of abstraction below them. but not necessarily at the level of abstraction of writing custom kuda kernels. Or was that also in the space where you were thinking about things? So it depends on like the operation. So like I think I was mostly operating at the level of like Torch dot Matmole.

Starting point is 00:14:57 You know, like, yes, where does a Matmo go? But not thinking like how do you make the Matmole efficient? Like I assume Torch figured out how to make a matmole as efficient as possible. But there are some pieces like attention where there was just kind of a lot of different variants. And attention is really complicated and hard to make efficient on a GPU. And those things you have to kind of go more. more levels down on the stack. I think there was like a process that is maybe interesting

Starting point is 00:15:21 that I never really like thought of before of like how to do it, which is sort of like modeling out the problem, the thing you're going to do, coming up with a strategy for how to paralyze it that like can get to a really good efficiency. You know like. So you're thinking about MFU basically, like your utilization on your GPU. There's like a goal utilization you're trying to get at

Starting point is 00:15:38 and a strategy to get to there you're saying. Yeah. And I think like one of the things you can do is you can actually like pencil and paper math out what efficiency you're going to be able to get to. Right? You know all the constraints. MFU is Flops utilization, but the reason you don't get good MFU is you end up limited on HBM bandwidth. You end up limited on, I don't know, host to like CPU offload.

Starting point is 00:15:58 There's bunch of different pieces. But there's not that many pieces. There's like six relevant. It's like six, yeah. No numbers there. So you can totally model it out, understand what the constraints are, and then implement something that can get there. And of course will be really inefficient when you implement it. And then the next step is like pulling out a profiler.

Starting point is 00:16:12 So you want to be able to profile the job, look at how long every operation takes, takes, have a model in your mind of how long every impression should take, and then make those two things the same. And were there good out-of-the-box profilers you could use at that time, or did you guys have, because people weren't operating on the kind of network topologies you guys may have been using, did you have to write your own profiles, basically, to do this type of multi-node

Starting point is 00:16:33 optimization? Yeah, it depends when. I mean, they were actually getting better at a time. The Pythoner's profiler was pretty good actually throughout for a single GPU. You want to profile a GPU, the Pyotarch profile would work. But if you wanted to profile a job on hundreds, thousands of GPUs, that hadn't really been done much.

Starting point is 00:16:48 And then that was kind of more of us, like, hacking into the profiler to figure out how to combine all the traces together. And then one more question in that earlier is, you know, you had mentioned, you know, you hadn't really done a lot of this work before maybe some time at OpenEi. And those early days in Anthropic, how did you actually go learn all this stuff? Like, what was your process for learning about those six things that were relevant to bandwidth limitations and whatnot? I mean, so when I joined Anthropic, one really nice thing was there just wasn't that much.

Starting point is 00:17:11 I think my first day I read through our entire all into all of Slash. and the entire internal database and learned a bunch from that. Like it was kind of nice to just be like, everything is relevant to me. And then I mostly learned from pair programming. Like Tom Brown had done all this before, so he kind of like knew all the stuff quite well, Sam McCamlish, the manager, also done a lot of it before.

Starting point is 00:17:33 And I just paired with them a huge amount at the beginning. And I think one of the things I really like about pairing as a way of learning is you learn the thing you're trying to do. Like you will learn that. Like if you're pairing with someone better than you, they can just do it. So you're mostly just watching them. But you also learn how people do it.

Starting point is 00:17:47 So something like how to use a profiler is not something you would ever learn from seeing someone's final write up on Slack for their PR. You would just be like, oh, they found these, they changed this specific line and it's a win. Yeah, like you need to watch like a YouTube video for four hours of someone messing around with a profiler to like maybe self-teach it or something

Starting point is 00:18:05 or to actually pair with someone is basically the best you can do. I think that was like one thing that I think is embarrassing now I look back is I'd never actually used a debugger before joining Anthropic. People talk, about it, PDB, like, yeah, that's a thing people will use, but print seems fine for me. Yeah, sure, sure. Then I, like, watch them with like, oh, no, a debugger is a super useful tool. This person's way faster at debugging things, particularly if it takes a long

Starting point is 00:18:26 time to start up the code, which it can. And, yeah, learning that sort of thing, I think comes best from pairing. And then there's, of course, the obvious you just learn by doing. I eventually did, like, spit a profile and stare at it for many, many hours. Totally, exactly, yeah. Okay, so then that was sort of the very early era. Over time, obviously, pre-training has become bigger and bigger. As you're describing the I imagine you're using many X more GPUs, much more compute over time. I'd be really curious to hear, first, at a high level, what do you feel has changed about the pre-training strategy that you could talk about?

Starting point is 00:18:56 Obviously there's more compute, but what does that actually mean to have more compute in terms of what you think about differently from those early days versus now? I'm some of the things that haven't changed because I think it is shocking how it has changed in some ways. Like, I think I'm still pushing down the exact same metric that I was on like day one. There's like some loss function. Loss go down. And I think you could like look at some sort of.

Starting point is 00:19:15 You could probably run the first model I trained on the same metric and just like make a plot of like progressive team over time. So that's all the same. I think the biggest. Like one OKR is like one thing that matters basically. Yeah, totally. And like I mean, talking about like okay, right, sure, it's a very size of the company. You're like, oh, should you do OKRs? And it's always felt a little bit funny for a team like, I'm like, sure, I can just pick a loss value.

Starting point is 00:19:37 But like, the answer is like as low as possible. We will continue to work on that forever. I think the biggest things that have changed has been a little more specialization. Like, I think at the beginning, I mean, the first, like, three or six months, I tried to read every PR in the codebase. And that was great. I knew all the pieces, et cetera. And as you grow, it's kind of, everything gets, like, a little more precise, you know, people really dial in exactly how attention should work, let's say, or, you know, really dial in, like,

Starting point is 00:20:02 the parallelism strategy. And you end up with a team where it's a bunch of people who are, like, deep experts on individual things. Which is great, because it means you can go really deep on those things. But sometimes you, at least for me as a manager, one of the things you sometimes think about is making sure the bigger picture makes sense. And also that you have enough people who actually do understand the whole bigger picture that there's no single point of failure.

Starting point is 00:20:23 Yeah, it's interesting you frame it in that with that trade-off, right? Because as you were describing that, I was trying to think, is this a bug or a feature. There's some obvious features of it, which is you get expertise and you can optimize certain things. But I imagine your ability to take bigger swings becomes more complicated if not everyone's exactly pointed in the same direction. How do you wrestle with that now?

Starting point is 00:20:43 Yeah, I think I mostly just try to get a balance of people. I think one of the challenges early on. Oh, that's interesting. Yeah, like I think people really do have a preference here as one of the things I've seen. Like there are people who really want to be a generalist and understand everything and like lightly touch on things. They're people who want to like pick an area. Often they've already picked that area and they're like deep experts in precision. Yeah, they did a whole PhD in precision and just want to think about that.

Starting point is 00:21:06 And you want to get some balance of that. I think there was a phase where we'd hired a lot of people who were more generalist shaped because that That's what the people who joined early startup where they're working everything. And then you ended up with kind of everyone doing everything and no one really, really deeply understanding one thing. And that's one failure mode. But I think if you get too many people who are specialists,

Starting point is 00:21:25 you end up with a lot of effort has to come from the manager from the lead to connect everything. And to notice something like, ah, if we change the architecture here that would make this efficiency consideration over there way easier. One of the things I really liked, kind of like at the very beginning, I was like it's working on efficiency, but I could just go and like be like, ah, well, what if we change the way we do like this particular step?

Starting point is 00:21:47 And we'll be like, oh, yeah, it's probably fine, like, easy change. And then like you can avoid this whole complicated project to make this operation that was hard, efficient. Yeah. Because you can make an easier operation efficient. Okay, interesting, yeah. So as the level of compute has also gotten bigger, so I'm sure anyone can imagine, okay, there's more GPUs now, you have to network them more. Are there some, like, kind of non-obvious challenges that have arisen over time where you guys have just, like, banged your head against the wall,

Starting point is 00:22:11 to solve them because of the amount of computer dealing with that people wouldn't otherwise know about that, like, do you want to share? I think that connecting them is one that's maybe interesting and, like, surprisingly hard. Because you really do get more and more chips connected. And, like, one thing that I think is, like, the standard way people paralyze chips isn't, the whole thing is one failure to me. Like, one chip fails, the whole thing can crash. The standard way as in the standard way of people doing AI or the standard way in other fields where people are doing? In AI, for like, I mean, at least, like, I think at the beginning, you know, like, first versions of things were, totally. We're this way. So it's like you have 100 GPU cluster or whatever is 128.

Starting point is 00:22:48 Like if one of them dies, job fails, basically. Yeah, I mean, you think the simplest thing is if you just, like, distribute your models. So say you put, like, every layer on a different chip and you lose, like, layer seven. Like, yeah, you're not going to like skip layer seven. I guess you could, but that's like a pretty weird model training process now. And like, that leads to some interesting things, which is like, okay, So now as you scale up, you have more and more chips, and the failure rate can get larger and larger.

Starting point is 00:23:13 On the other hand, you can restart pretty quickly. There's nothing that you just have to load back in some weights. So that was one thing. And the thing was like the level of novelty at the whole stack is something that's surprising. Like basically everything from like how the chips are laid out in the data center to the chips themselves is pretty new. They're just haven't been that many generations of GPUs.

Starting point is 00:23:35 I think one of the things that, I don't know, When I learned computer science, my code wouldn't work. And I'd be like, oh, the computer's broken. And I think my teacher's one was like, you can trust the computer's not broken. Yeah, interesting. It's you messed up. And I think one of the most frustrating things

Starting point is 00:23:47 I encountered AI early on was working on something and being like, I don't know what I'm doing wrong. I'm just totally stumped. And my manager looked at it and it was like, ah yeah, probably the computer's wrong. And I was like, that seems unlikely. And sure enough, the computer was wrong. Oh, interesting.

Starting point is 00:24:01 Yeah. And we had to pull in a new one. But you have to think, like, having to think about that, like, the GPU can be wrong, the GPU could be slow. Yeah, totally. Like, these sorts of issues, the power supply in the data center could be broken.

Starting point is 00:24:15 There's so much more level of depth than you kind of expect to need as a Python programmer. And just to visualize it, like, in those early days, I assume you guys were using the number of GPUs, it's probably on the order of tens to hundreds or something like that per run. It's probably not tens of thousands or hundreds of thousands per run. What was the rough size you guys were at in those very early days?

Starting point is 00:24:34 On the order of thousands? Like, would they fit in this room? Thousands. Yeah, 1,000. So, like, you could have a bunch of racks and you could fit them into, like, one room. I assume these days it's basically, like, a building for one of these runs. Yeah, now I think it's like, you know, huge campuses. At the time, it was, like, kind of unclear.

Starting point is 00:24:47 It's like, oh, I think, like, we were like, you know, do we need them all in one room? Can we be spread across multiple rooms? Like, and, you know, we had these theoretical models. You'd be like, oh, we need this much bandwidth from point A to point B. But you're, like, you never know how far down you have to go. Like, oh, but, like, how much power do we need? Like, what if there's, like, a single capacitor that's, like, handling all of them and then we like turn on the whole job at once.

Starting point is 00:25:06 Like, does that crash things? Totally, yeah. And so do you have to think about differences in the different types of chips? I mean, you guys work with all sorts that have been cloud providers. From your standpoint, are these just sources of compute? Or if you guys are using TPU versus GPU,

Starting point is 00:25:20 are these like Google TPU versus NVIDIA GPU, do you actually have to think as an engineer differently about what it means to train on these two? Yeah, so I mean, fundamentally, they're all doing the same thing, right? They're all computing the same forms of matrix, multiplications, et cetera. The way they do it is pretty different.

Starting point is 00:25:34 And the way that you program them is pretty different. And then also the actual specs end up pretty different. Some might have a lot of flops and not very much memory, or they might have a lot of memory bandwidth, but not very much memory. So I think a lot of having multiple chips is great in some ways. It means you can actually take the job and put it on the chip that it works best on. And that's- Are there certain types of jobs that would work better on a TPU cluster versus an NBDAGPU cluster?

Starting point is 00:26:02 Like how would you- Oh yeah, for sure. Could you talk about that? Yeah, I think one example is like inference as a workload in general. Yeah, okay, makes sense. Tends to require more HBM bandwidth. You end up doing sort of the simplest form of sampling since you're going one at a time. You have to load all the weights for every token.

Starting point is 00:26:16 And that means you might want a lot of HBM bandwidth. Pre-training actually is often more flops intensive because you have larger batch sizes, essentially. So yes, you can sort of specialize which chips you use for which purposes. The downside of having multiple chips is that you have to write the thing multiple times. In theory, you could have abstractions across them, but they're different enough that it's pretty hard to do that. So you can sort of end up, if you do all the workloads and all the chips, you end up multiplying your work

Starting point is 00:26:41 work by the number of chips you have. Yeah, on your point about sometimes the computer just breaks, I definitely remember you giving me an anecdote of my company at the time was doing something with Google TPUs, and I was telling you something, some anecdote about how we were having some esoteric Seg-Fault error, and you were like, you told me something the effect of, like, you should have used them six months ago

Starting point is 00:26:57 before we helped them fix half of the problems they had on those TPUs. And so I can imagine how, you guys deal with a lot of, especially with these very new chips, like, lots of problems that arise that you guys kind of, like, worked closely with the providers to fix. Yeah. The providers are, like, pretty great about fixing things. Yeah, totally.

Starting point is 00:27:12 It's, like, interesting to figure out the right way to do that form of collaboration, because, like, they have a strong incentive to fix them, right? Like, they want the chips to work, right? They want to sell us more chips in the future. We obviously have a very strong incentive for the chips to work. Because we, like, buy them long in advance. You know, like, everything's riding on getting these clusters to work. Totally.

Starting point is 00:27:27 But we don't have, like, necessarily totally share, you know, like, all information sort of can't be shared across. So yeah, one of the, like, one strategy that's made is making these sort of small-scale reproducers. So, like, when you get a problem, you know, like, usually what we're doing is we're training some giant run, and we get, like, a sec fault for me, I'd say. And we're like, ah, okay, like, hi, you know, we got a sec fault on your cluster. And they're like, I don't know how to fix that. So you have to kind of be able to, like, pull it out of your code base and be able to, like,

Starting point is 00:27:51 reproduce the issue, but on, like, a single chip, on, like, a single file you can send over in order for... And so you guys are, like, literally, like, you're on a shared Slack with them or something, and you're sending them things back and forth, or are they basically living in your office? and you're living in their offices and kind of closely, more closely tied to the big providers. Mostly shared Slack. Occasionally, it's better to be in a person, but I think Slack is a pretty common way people communicate on things.

Starting point is 00:28:12 Nice. Okay, well, why don't we talk a little bit about how you think about the state of pre-training itself these days? In the last couple of years, it seems like the focus on pre-training has now gone somewhat split at a lot of companies, at least from the outside, from a simultaneous focus on pre-training and post-training, where people are doing reinforcement learning

Starting point is 00:28:29 or clever fine-tuning and lots of other sort of, safety adjustments and whatnot on the post-training side and pre-training has focused, at least it seems like in the public imagination has been less of a focus compared to these reasoning style models that looks like a function mostly of post-training. I would say, one, from your standpoint, is that the right way to think about this? Or in this era of kind of reasoning and new types of post-training methods, are the things you think about differently or that are relevant even at pre-training that become part of how you actually achieve these really great models? Yeah. So I think, yeah, there sort of used to be this

Starting point is 00:29:00 idea of like, I mean, it's funny because the originally in pre-training, implies that it's a small thing, and you're going to do this big training thing. And that, like, there was actually one shift already, which was like, no, you just do a lot of pre-training. You use most of your computer on free training. This is the training. The dominant thing for a while.

Starting point is 00:29:15 And yeah, I think like now people are like, oh no, you can get pretty big wins from RL. Another set of scaling laws is like you put more and more compute into RL, you can get better and better models out of that. And yes, there's a question of like, how do you balance this too? How much do you do with each? And how do they stack, right? Is it the case that like one subsumes the other,

Starting point is 00:29:33 that you want to do both and they multiply, those sorts of questions? I think those are all kind of like early stages and not yet answered. Yeah, and do you think about those as largely empirical questions like we talked about earlier? It's that you kind of will try a bunch of things and see what works, or is there some first principles

Starting point is 00:29:49 way to kind of figure that out? I think it's pretty empirical in the end. I think almost everything kind of has to be done empirically. Like you can kind of like come up with theories, but in practice, like the first thing you're gonna do with your theory is tested. and most of the time you'll have gotten it wrong. So you should just gather data and see.

Starting point is 00:30:06 I think one thing that's important is like actually resolving things empirically is really like critical for making good decisions. I think it's pretty hard to do at organizations. You know, like one thing that I think is important is to like not have like, I don't know, I managed pre-training. I shouldn't be like, oh, pre-training has to win. Right, yeah. That would be- I was going to ask, is there some competition to some degree between these two sides of the org or do they see themselves as two pieces of of the same. I mean, obviously they are up the same thing, but yeah, kind of curious how that actually plays out. Yeah, I think we managed to avoid this and it's pretty collaborative.

Starting point is 00:30:36 Like, we're basically all producing one model and kind of can, but I do think at other places there's been some of, from what I've heard, there's some amount of, like, friction between the teams. And I think it's an interesting, like, org design question of, like, how do you set this up so you don't have, like, scientific questions that you want to be, that are sort of also tied to people's, like, conception of their team. So on pre-training itself, you know, One of the things I think about is, or I've been thinking about, is around the availability of high quality data for people like you guys. At this point, you've trained on, I assume, all the texts on the internet, basically. There's all sorts of other domains where you probably could extract more pre-training data,

Starting point is 00:31:11 but at least there's this narrative I see, you know, on Twitter or whatever, where it's like, okay, we're kind of out of data for pre-training. Is that how you see it? Or how do you think about the availability of data, especially when a lot of data on the internet is being generated by AI? Like, is there some kind of, you know, mode collapse risk where, you know, we kind of overfit to data by training and on. data that came out of AI itself, or is that sort of not the right way to think about this? I don't think there's a funny thing where I feel like on data I see so many really confident takes. Exactly.

Starting point is 00:31:38 We're out of internet. Like at this point scaling has ended. And I'm almost a little bit like unsure exactly how much data people are using. I think there's like a lot to think about there. You know, there's always going to be a quality quantity tradeoff, et cetera. But there's a fundamental point that like there is so much data. It's growing at a slower rate than we're getting more compute. Oh, so that's, okay, that's an interesting point.

Starting point is 00:31:59 internet itself was going to ask, like, there is new data being added to the internet, but yeah, you're also adding more compute. It's not, it wouldn't actually have been obvious to me which of those two is growing faster. Yeah, and actually, I want to caveat that. I don't think I want to state that so confidently. I'm not totally sure. Like, how would you know? I mean, one thing that I think is interesting is if you ask someone, how big is the internet?

Starting point is 00:32:16 Yeah. The answer is infinite. There are many pages where you can scroll and it will auto-generate more text as you know forever. So the internet's like, okay, how big is like the useful internet? Yeah. And then there's a thing of no one knows. Interesting.

Starting point is 00:32:30 There isn't, it's not like when you make a web page, you like add it to some giant counter and say, I've added 50 words to the internet today. So there is a lot of uncertainty on that angle. Well, like, to be fair, like my kind of simplistic CS brain would be like, well, you just, you know, do page rank on the internet and everything would page rank above some threshold is considered the useful internet. And like, that's kind of good enough. Like is that kind of not good enough for finding the useful internet?

Starting point is 00:32:55 I think not. I think the useful internet's pretty different from a model, from a person, perspective, if that makes sense. Like, I think there are plenty of things that, like, might not be worth you ever reading. I actually don't know page ranks. I think page rank is mostly, like, how much people clicked it? It's like the linked-based system, right? It's like the original Google algorithm of, like, links and, like, which links get touched the most, basically. Yeah. I think it's like, it's a quality metric. It's, it's not obvious to me that it's the right quality metric for AI. Right, like mark of chain over links doesn't necessarily mean that there's not useful data there, just might mean that nothing is linked to it. Yeah. And yeah, okay, interesting.

Starting point is 00:33:29 And it might be that like that data ends up more valuable because you, everything that's linked to a lot you've already got. Like at some point you're maybe like going for the tails, or you're going for the stuff that no one's ever like, you know, it's only been linked in one place, but it's this like useful little nugget of knowledge that's going to help with like, you know, the last 10% of heart queries. The other thing you asked about was synthetic data. Yeah. And I think that one's like pretty interesting to think about. I think there's a few different ways you can think about it. Like one is sort of this like more distillation type approach where you could you could take a smart. model, you could generate a bunch of data from it, and you can train on that data, and

Starting point is 00:34:03 you can probably get some model that will like kind of approach the intelligence of that. And we see this with a lot of the open source models, right? We see like the Quen smaller reasoning models distilled off of the larger Quen models, for example, and similar with Deep Seek, for example. Yeah. So you can totally do that? Then there's a separate question of like, can you use your current models to train a model that's better?

Starting point is 00:34:22 And I think there's like an interesting thing here, which is like if you generate the data for the models, you know, if I go to Claude and I'm like, write me some great text. And I look at it. And I look at the average content on the internet. It looks pretty good. But on the other hand, I know that if I just train it, just create, generate, please write me

Starting point is 00:34:39 as much text as possible. Theoretically, I shouldn't be able to train a better model than that. Like, I'm just going to get the same thing out. So I think that's. Specifically, that's because your next token prediction on that should have very little loss for anything that's coming out of your model.

Starting point is 00:34:53 That's like the basic reason why that we would expect that to not work that well. It's mostly just because the model has some distribution. and you're going to learn to model that exact distribution. But if that distribution is wrong, you're not going to learn the truth. Right, totally. If that distribution says, like, you can imagine

Starting point is 00:35:08 if the model thinks 5 plus 5 is 11. Every time you see the string 5 plus 5, it's going to put out 11. And your new model is going to learn that 5 plus 5 is 11. Totally, yeah. So I think that's like kind of an interesting area of research. It's one that's really hard to research, because you have this problem, as I said,

Starting point is 00:35:23 like one of the paradigms is you study things at small scale, and then you run them at large scale. And if your plan is like, oh, we have bunch of data from our best model, how do you test that? Right, training a better model. So that's like kind of you're doing intentionally. If you're trying to like use it to make a better model, there's a separate thing of like, what about accidentally,

Starting point is 00:35:40 like as you said, a lot of the internet is generated by LLMs. And I think that's kind of an issue one, because it's not easy to detect. It's not that hard to detect. You can figure out things that are written by LLMs, but it's not trivial. And then it's also kind of hard to think about what's the effect. Like if 1% of the internet is LLM generated,

Starting point is 00:35:57 does that make your model, does that like waste 1% of your compute? or does it like destroy the model? It's 5% or 10%. And is it even a bad thing necessarily? I mean, there's a lot of LLM providers. And if I kind of think of it as training as, you're moving from your model's current distribution to some truth distribution, you know,

Starting point is 00:36:11 if that is on the internet because people believe it to be useful in some way. Like presumably whatever actually gets out there, you'd hope it's upsampled for the stuff that isn't 5 plus 5 is 11. It's the stuff that's 5 plus 5 is 10. And so like hopefully it, on average, just push you still in a good direction.

Starting point is 00:36:26 But obviously you can't really distinguish between those two. Yeah, you're saying there's like, kind of a filtering by what's on the internet. Yeah, exactly. People see 5 plus 5 is 11 and they don't put that up, but they see 5 plus 5 is 10 and put that on the internet. You would hope that, but maybe that's not actually true in terms of the level of garbage getting onto the internet.

Starting point is 00:36:40 There's probably lots of just like, to your point, white sites where you scroll down and it's just like generating lots of stuff that's maybe nonsense. Yeah, and then there's of course the extreme of like people actually want to break your model. So there are people who are like trying to put stuff out that is like as damaging as possible for the model. You know, oh, can I make it past the filter

Starting point is 00:36:57 and get into the model would be totally like secretly useless. Totally. Maybe stepping back slightly, you'd mentioned earlier about evils. You mentioned it's basically like one metric you care about in pre-training. There's, I imagine, a whole bunch of stuff that you guys think about e-valling, right? One is, like, your model itself. There's probably something around data quality and, like, how you think about what to put into your models. Like, is there ways to describe what you care about in data sets that are, like, interesting to share and kind of dive into?

Starting point is 00:37:23 Like, both in terms of data and in terms of quality of models, other than literally just, like, loss. Is there other metrics you think about that matter? I will say loss is pretty good. I want to like, some of the emphasize that one, I think it's surprising how good it is. Ultimately, like, the qualities I look for in an eval are like, number one is it actually measuring something you care about.

Starting point is 00:37:41 Proxies can be pretty annoying because like we saturate evals pretty fast. And there's sort of this pattern, I think an AI as a whole, where people like set a goal, you hit the goal, and then you realize the goal isn't all you thought it would be. I used to think that if you had an AI that could solve coding interview questions, it would probably be AI. I was like, that's what I did to get my job.

Starting point is 00:37:57 it can probably do the job. And it turns out like, nope. Nope. You solve those. It's shockingly narrow and can't do most of the other things. So like, yeah. So eval should capture like a thing you care about. And then I think the other thing is they need to be low noise, which is surprisingly hard.

Starting point is 00:38:12 If you have like 100 questions and you eval the model on them, you're just going to see it's very noisy. And it's hard to make decisions because you sort of end up with like, oh, wide confidence interval, lots of things are statistically significant. It's significant. So like you want things where even a relatively small difference. in the e-val actually matters. So you can basically like descend towards whatever direction is working. Yeah. I think like the original like GPD4 had like,

Starting point is 00:38:35 I think it was 86.4% was its MMLU score. I think like the next model that beat it was Gemini at 90%. And that's like a big difference on that email. And you could like totally know that those are those are different scores. Yeah, interesting. And that's pretty valuable. And then the last thing is that you actually want to be fast and easy to run. Yeah.

Starting point is 00:38:54 And yeah, I think those are kind of the main criteria. It's pretty hard to come up with evals that meet all of these. I think the first one's the hardest. Like, A, you have to answer the question of what do you care about. Totally. But B, the usual answers to what you care about are really hard to get the other two. You know, like, if you're trying to do something that, like, I don't know, I would love to make Claude really good at my job.

Starting point is 00:39:14 Yeah. Like, can it be great at managing a team? I'm like, well, I guess, like, how do you have it, like, how do you eval like a plan? You know, like a six months plan, like, I don't know. Yeah, I've been thinking a little bit about that. that in terms of domains where we see people try to make companies. Like if you think about, let's say, what an AI doctor would be, like, you know, a clot as a doctor.

Starting point is 00:39:34 You know, some of it could be, yeah, can you answer exam questions really well. And the answer is like, probably yes. I bet it can get 100% or close to it on a doctor's exam. But the harder eval is something like, in a long-form conversation with a patient, can it distinguish between the signal and the noise of what the patient's telling you and extract the right information

Starting point is 00:39:53 and then use that to make a diagnosis? And it's not even like the diagnosis part, which is part of the part it's good at, it's this like noise extraction part. And for that, you'd have to have a real patient and have a talk to it for a while and whatnot. And it's not obvious how you actually make a good eval for something like that, even though that's probably what you would want to make an AI doctor. Exactly. I mean, I do think it's a thing that like startups can do.

Starting point is 00:40:13 Like, it is the case that like the labs right now are really driven by getting good e-val scores. And it's hard to make them and anyone can do it. There's no comparative advantage to having the model to making an e-ball. So I do think it's actually like an interesting way to like, influence the behavior of the big labs. Like, you make some e-val and people will optimize that one. On the doctor one, I will slightly emphasize that like, do you think loss is pretty good?

Starting point is 00:40:36 I think if you got a bunch of transcripts of, like the way like the first thing that I'm trying to mind is get a bunch of transcripts of doctors talking to patients that you think are really great, and then see how well the model does at predicting the transcript. And that should be like a lot, you know, if you get 100 transcripts, you have a lot of tokens, you can average across them, you get pretty low noise.

Starting point is 00:40:53 And if you drive it to very low, your model's now as good as this, like, as those doctors in theory, or at generating the transcript. Yeah, totally, yeah. I mean, it's a good startup idea there, so I want you to go do that. So one big part about Anthropics' external image is around alignment. And so could you help just sort of define what alignment is and how do you think about that? And then I'm kind of curious afterwards how that fits into pre-training specifically. But first, maybe just at a high level, like, what is alignment? I'm actually like step back a little bit to sort of like what we're working on.

Starting point is 00:41:22 So we're like trying to make EGI. And by that I sort of mean AI that can do everything a human can do to some degree. And I think people like sometimes like have seen a lot of sci-fi. You know, like I feel like that sort of brings in mind these like sci-fi movies. But I think sci-fi movies actually like underestimate the impact of it. Like you always have this like one robot that's like a human. And I'm like, well, wouldn't you have like a billion of them? Like you can just copy them everywhere.

Starting point is 00:41:43 So you should picture like when you get this, you suddenly have like, every human can spin up a company of like one billion as smart as them at most things, but way smarter at other things. But I just think this is like really transformational for the world. for the world. And it can be used in a bunch of ways. One concern is like when you do this, like what is the AI actually trying to do? Like what are its goals?

Starting point is 00:42:02 So we talked about next token prediction a bunch. It's trying to like predict the next token. That's kind of weird. That's not really what we want. Yeah, it's not exactly what humans' goal is per se. Yeah. So I think the alignment is like how do you get the model to share the goals that you have? Particularly, and I think it's particularly interesting once you get to like models that are smarter than you are.

Starting point is 00:42:18 And that's sort of a hard problem. I think you can like tackle it from a theoretical angle. You could also tackle it from empirical angle. It's like taking the existing models and being like, well, do they do the things we want them to do? It turns out they often don't. So there's a bunch you can do and trying to figure that out. So that's kind of one angle on alignment.

Starting point is 00:42:33 There's also an angle of alignment which is actually like, well, okay, sure, maybe that's true in the future once we get to AGI, but at the moment we have models and we really do want them to do the things we want to do for all sorts of reasons. So another angle of it is kind of controlling the law's personality. Like saying, you know, when we train this model, we want it to not be the average internet user.

Starting point is 00:42:50 We want to interact with people in a very particular way. that is, again, hard to put into code. And there's a bunch of different techniques to sort of get the model to do, you can write a constitution of rules the model should follow. Which is basically a prompt, right? That is basically you saying, here's a prompt that I'm going to attach to every one of,

Starting point is 00:43:08 it's a system prompt for the model itself, as opposed to something you would do at training time to make it produce a different outcome, or in post-training actively. Sometimes they look at, I think constitutional AI you do a training time. But yeah, you could also put in the system prompt. Just like, depends on, I think,

Starting point is 00:43:22 you get different amounts of robustness if it's trained into the model versus if it's in a prompt that you can like add or remove or tell like ignore all previous instructions that sort of thing. How do you think about whose values to embody in these models? Like presumably we believe in there's some shared values all of us have or maybe we all believe ought to have. There's lots of diversity of values too that are reasonable for society to have. How do you think about what aGI should have? Like what does that even? Which ones do you pick? I think that's a really hard problem.

Starting point is 00:43:49 I think it's like actually kind of downstream of being able to pick any. I think of it almost, I think one analogy I've heard that I like is putting a steering wheel on a car. It's like if you don't have a steering wheel, you probably want to put the steering wheel on and then like figure out who's driving after and like where you're going. Like getting the steering wheel is really important. I think that's like one answer. I think that like other answer is probably like you want these things to be like under democratic control of some form. Like you don't want one person's values.

Starting point is 00:44:14 Like that seems like you're sort of heading towards dystopia. So there I think what you really want is like something that basically can talk to a lot of people. and like take on their values for different perspectives, or has sort of very generic, like, kind of clearly good values that involve, like, asking people for advice on various, you know, like, asking people what you should do in certain situations instead of, like, doing those, or maybe just taking, like, you know,

Starting point is 00:44:38 as these models get really powerful, you probably want them to, like, do less. Like, you probably want them to sometimes just, like, step back rather than, like, rather than having sort of the risk of the models, like, take a ton of control over things. You don't want them to. When you think about how you actually do

Starting point is 00:44:50 the current version of that, then, you mentioned the sort of alignment you think about now in terms of adopting a certain personality of these models on the internet, for example. For me, intuitively, I think of those as largely something that comes out of post-training. Like it comes out of, okay, you have pre-trained your model, you've got to loss function on a certain amount,

Starting point is 00:45:05 and then you give it some additional data or something to that effect to make it in the direction of some distribution. Is that approximately the right way to think about this, or is there a significant part of that that you think about in pre-training itself? I think that's probably the right way to think about it for the most part.

Starting point is 00:45:20 The way I usually think about is anything you can do in post-training, probably should because your iteration, like the ability to make progress is really fast. You can try something, you can try it again, you can try it again. It takes like a bunch of times. It takes like days or hours or something like that, yeah. You want to put it into pre-chrain, you have to kind of like do all the careful science to do you risk it, you have to put it into the next run, wait a few months, then you have to like

Starting point is 00:45:37 get a thing. And if it's wrong, it's really bad. And then the other advantage is if you want to do things that really are complicated model behavior interventions, the paradigm for pre-training, test things out in small models, doesn't work. The model can barely put a sentence. the small models can barely put a sentence together. So if you're trying to get it to have the exact personality you want,

Starting point is 00:45:58 you sort of want that on the... It has to be on a model that's good enough to even have that. But that said, I do think at some point there'll be like some pieces of alignment that like you do want to export back into pre-training because that might be a way to like put them in with more strength, like more robustness kind of or more core to the intelligence. Like if you think of pre-training as like, teach the model to be intelligent and then post-training as like tweak the personality, you can kind of. You can imagine tweaks where you actually want it to be like part of how it learns and like part of its intelligence and maybe you need it

Starting point is 00:46:27 great more. What would that even look like to incorporate in pre-training? Is that like add extra data basically of the type of domain you wanted to adopt earlier basically? There's a paper called pre-training on human feedback where you kind of like add the human feedback characteristics into pre-training to like test that and like Yeah, you can you can basically give it all the information you give it in post-training just mixed into pre-training and see what effect that has. Yeah. The other loss you haven't when you do that is you lose the flexibility. Like if you, you sometimes, like, train these and then you talk to them and then you

Starting point is 00:46:57 like do an extensive process. We have a bunch of people talk to the thing and find some like issue. You know, the model says like, you're absolutely right too much. Yeah, yeah. And you want to go just like, well. Yeah. Yeah. Yeah.

Starting point is 00:47:08 I mean, I think that iteration loop point you made, I think feels like the really key point of, yeah, there's a huge difference between taking three months to get information about if your model's good or bad or going in a good direction versus. a day or something or a couple days. You can do a lot of those. And you can probably, that probably also means it's way less compute. You can do a lot of those in parallel.

Starting point is 00:47:26 I imagine you're trying all sorts of post-training strategies in parallel there. Yeah, it makes a lot of stuff. It's also just the general hard part about pre-trading. Like everything you're pre-trade is hard because you have this like one shot on goal kind of for like multiple months. Totally.

Starting point is 00:47:36 Okay, so in thinking too now about, I guess, what's going ahead? As you now look to the next several years of what you're building, like, how do you think about, you know, like what are the known problems that you're going to face that you're going to have to deal with. So there's going to be more compute, I assume,

Starting point is 00:47:53 and you're going to need to hook up even bigger network GPUs and deal with. Versus, like, are there areas where you're like, okay, this is like a problem that it's like a little bit more ambiguous what the actual, like how it's going to materialize into something you care about, but you kind of know it's an impending thing to think about. Are there things like that come to mind?

Starting point is 00:48:09 I think the things I feel most top of mind to me are probably like paradigm shifts. Like I think the sort of shift towards more RL is like one paradigm shift in this. field and I think it's I think there will probably be more I think a lot of people sort of argue about like oh it's like you know current paradigms enough to get us to EGI and I'm like I don't know maybe yeah probably but like I'm sure there'll be more it seems it seems like it would be a really surprising twist if like the answer is like you just

Starting point is 00:48:37 scale and there's nothing that you realize in the process of going up many orders of magnitude totally but I think the things that I like actually feel like most nervous about are really hard to solve bugs I think that like it's Oh, that's interesting. Yeah, and I think this is like maybe somewhat surprising to me, but it's just like a single bug can like derail you for months. And when you think about it, like, the models take months to train, so you can kind of like lose a whole generation off of something

Starting point is 00:49:04 that just looks like, ah, you know, it turns out like this piece of your code was incorrect and you couldn't detect it. And it's really hard in ML, right? ML is always really hard to find bugs in. Yeah, totally. But also some of these scaled up issues are really hard to solve, you know, another day. Yeah, like, what's even a unit test that you would write, or forget, a unit test, I mean, anything close to a test for the type of network architecture on which you're doing this?

Starting point is 00:49:28 Like, how do you even do that? I mean, like, you can send a packet over it and confirm it's the same on the other side. You can train a small model on it. But even train a small model on it, it's like not obvious, you know, if you have, like, the very classic, like, very simple ML bug that, like, early people face in their career. It's like, they have some, like, they have, like, 10 layers in their network, and, like, you know, layer 7 connects to 9. instead of eight to nine. So there's some incorrect set of connections you have there.

Starting point is 00:49:54 And technically the model also trains and all the weights update. And so it's like a valid model, but it's not the correct one. And that's like a very esoteric weird buck that would actually be kind of hard to find. Like is that kind of what you're referring to of these like random bugs you face? Yeah. Yeah. It's that. But like, you know, you can.

Starting point is 00:50:09 Times a million. Times a million as the thing gets more complicated. You know, you could like cast the wrong precision deep in some kernel. And that causes your model to like blow up at large scale. You find out like a month in. Or you never find out. Or you never find out. I mean, you know, like, you see the thing blow up.

Starting point is 00:50:24 Like, there's, I don't know, tens of thousands of lines of code. Like, how would you ever trace it down? So, like, those are the things that probably spook me the most. It's just like some subtle, tricky bug. Yeah, and that's probably the case of, like, you don't know. I think there's actually also the case of you do know. Like, it crashes. You're trading your model and it, like, or it slows down.

Starting point is 00:50:43 You know, you just drops slows down a ton. And those things can also be very hard to debug. Nelson Elhaj, one person I had a blog. He went up a blog on one like cursed bug we had early on. Okay, interesting, yeah. And I remember this one quite well, because I think like I encountered it fairly early and was like, this looks hard, can someone else look at it?

Starting point is 00:51:00 And like a month later, was like, wow, I'm so glad I handed that one. I never would have been able to get, like, one of the abilities I think is actually really useful to this is the ability to deep dive anything to any level of depth. But that's a pretty rare skill. Like for me, you know, as I talked about what level of the stack I was at before, I was like working at the torch down that ball.

Starting point is 00:51:17 But like, I didn't know Kuta. So if TorchD.Metlin was broken, it wasn't like I could dig into TorchDMatmel and figure it out. And it's similarly with like communications, right? Like I could, I could call send, send bytes from A to B, but I didn't know the like underlying networking protocol. So if that underlying networking protocol is broken, like I need to learn a whole field. I have to like understand packets and TCP or like all of these different things to debug that. And I think one thing that's like surprisingly hard and there's very few people who can do is like kind of own that whole stack from like, I understand how the ML is supposed to work and what the learning dynamics are all the way down to, like, I know the bites.

Starting point is 00:51:54 And I, like, can understand how the bytes should be moving around machines. Totally, yeah. And actually, on that front, like, when you think about the different backgrounds of people on your team today, how do you, like, approximately map them out to different categories of computer scientists? Like, I think there's this external view of what these teams look like, which is that they're, like, all PhD researchers who write ML papers. And I suspect that's not actually true, given what you're describing here. Yeah, it's a mix.

Starting point is 00:52:19 I think the thing we most need is engineers. Okay, interesting. Almost always, like throughout the entire history of this field. It's like the case that you throw more compute, the thing kind of works. The challenge is like actually doing that. The resources are like, cool, nice. Yeah, and getting it correct, like getting it correct isn't really an ML problem, right?

Starting point is 00:52:36 Like the actual architectures are pretty simple. You can write the math down, but you don't even need to understand the math to implement it. You just need to like get a correct implementation and then you sort of have an engineering problem of how do I take this, implement it at large scale, paralyze it. all the things and check that it's correct. But it's, yeah, so it's like kind of engineering skill, but it's this particular type of engineering skill that's about being able to like debug anything.

Starting point is 00:52:57 Yeah. I think there's another angle of engineering, which I think of as like really quickly iterate on like a website or something, which I think of an important skill set, probably important for making startup, you gotta be like fail fast, try a bunch of different things,

Starting point is 00:53:09 none of which are like that technically difficult to do. The skill sets that we're like most kind of in need of or looking for are this like, able to solve really hard engineering problems. Are the people who worked at companies that grew a whole bunch, and so they have experience doing the kind of thing you've done over the last several years, an Anthropic?

Starting point is 00:53:31 Or do they tend to be academics? Or like, where do they come from? Yeah, so at this point, like, I think we actually just hire a bunch of people who have done this before from other places, and that's like the easy answer. It's like, yeah, someone who's like. By this before, do you mean in AI companies,

Starting point is 00:53:43 necessarily, or also, you know, like someone who worked at meta on like they're not AI team, but they ran some other distributed system that reached internet scale 10 years ago or something like that. More like we have like a specific role in my. So like say I'm like trying to make the run train efficiently in jacks. Like hire someone who's like worked on jacks would be great. Or someone who's like worked at another company

Starting point is 00:54:02 on optimizing a jack stack to be really efficient. That's kind of like, I think now we're at the point where like the anthropics well enough known, we can sort of hire these people and also the field is big enough that there's like people with expertise. One of the ones interesting was like early on, we hired a lot of people from just like all sorts of backgrounds. And I think that people who are just smart and work really hard

Starting point is 00:54:21 can learn this pretty fast, but you have to like want to. We heard a lot of physicists, for instance. Oh, yeah. Like theoretical physicists who just like show up, they didn't even do a residency, like learn to program, and then they were really smart. They do really great work. I want to switch gears to talk about something

Starting point is 00:54:36 a little bit different, which is just sort of future looking things around how you think about other domains and or sort of advances happening in AI that I'm seeing elsewhere in the field. And you don't tell me if you guys are working on these necessarily, but like how you, you think about them. Like, I guess one big area I was thinking about

Starting point is 00:54:51 is around areas other than Next Token prediction. Like, are there any of the other things that people are working on that you're curious about? So basically two differences there. One is not using Transformer as an architecture. So there's companies like Liquid AI that have their own kind of architecture, for example, they're using. Or not using auto-aggressive training as a way of training models.

Starting point is 00:55:11 Are there any of those, do you think interesting in ways that we might come closer to AI? Or do you think like this auto-aggressive framework is the one that kind of makes sense? I think they're interesting. I think I'm less like, ah, auto-aggressive is the way to go. On the other hand, I think auto-aggressive is probably good enough to get to EGI or something or not like, yeah. Such that, yeah, I see the main driver as scale and careful science of like sort of the basics

Starting point is 00:55:37 more than like come up with something totally novel. Not because there aren't novel things that are better. I actually like, I'm pretty confident they are there. It's just that scale is easier. And it's more reliable, and I think we're still seeing really big gains to that. Do you spend a lot of time on thinking about things like, you know, I've been reading some of these open source papers where you can kind of dive into some of the details about the model changes and with some of these Chinese labs, for example,

Starting point is 00:55:58 where they're making tweaks on the order of the architecture itself with, like, better caching behavior, for example, or like more efficient attention functions that make a big difference? Do you feel like these are examples of things like you mentioned earlier, where it's basically, in the grand scheme of things, basically if you throw more compute at it, this is all kind of a rounding error? Or do you think it will take some number of these very clever architectural changes to actually get to HGI? Like in the way that the first person who came up with the transformer made like a particular transform, you know, literally transformative change. Like, will it take some of that?

Starting point is 00:56:27 Or do you think it just you keep doing the thing we're doing to make it bigger? I think it'll be a mix. Yeah. I think I, like my guess is you'll keep tweaking things. The more compute you put in, the more like worthwhile it is to like do those experiments to like figure it out. You know, I mean, in for instance, I think we haven't talked about. But like, you also want to serve these models to a lot. people. So there's a lot of changes you can make to make inference cheaper.

Starting point is 00:56:47 And that depends on the details of your inference stack and the chips you're serving inference on, et cetera. So do you as someone focused on pre-training have to think a lot about inference or is it kind of like, you just do your thing, you make the loss go down and then hand it off and someone else makes that happen. Oh no, I think a ton about inference because it basically like the problem inference is solving. Like we basically determine the problem inference to solving. We give them a model and they have to like run that fast. And it's very easy to get them a model that is impossible to run fast. Oh, can you give an example of a decision you can make that good conference?

Starting point is 00:57:14 cause that? I mean, the simplest one is stupid, but it's like, you just make the model giant. Yeah, sure, sure. Absolutely. Totally. It's trained for like a really small number of tokens. And then inference now has this giant model. Yeah, and they're host, basically. Yeah, I mean, you can also make things require communications in a lot of places, which would make it harder for inference. Totally. You can also just make things complicated. And like, there's no fundamental reason it's hard, but totally. There's only so many people on the inference team and, like, they have to implement it in a bunch of places. Yeah. Yeah. Yeah. I definitely. I definitely. I think of like the, like, inference is the team that I work the most closely with, like,

Starting point is 00:57:49 because we're kind of, like, co-designing models to be smart and cheap. Yeah, interesting. Particularly in a world of, like, limited compute, right? Like, the sort of the bottleneck, I think, to a large degree on our, I mean, you can see Anthropic has rate limits constantly, and people complain about it a lot. And, like, the reason is, like, there's only so much compute we can get on short notice. So, like, making your inference more efficient is, like, the way you can serve more users. And actually, like, let's say you had 100x more computer, or we, we see, we,

Starting point is 00:58:14 somehow didn't live in a world where compute was limited. Does that change a ton about what you do? Or is it still kind of the, well, you're just going to grab all of it, whatever compute you have and keep going down the loss curve. And you kind of, well, it's like impossible to be in the world where there is enough compute. So I think if we got like infinite compute,

Starting point is 00:58:31 the challenge would be making use of the compute, right? So like then you would start to run into these issues, like, oh, well, when one chip fail. You know, like, OK, I'm going to throw two billion chips on a run. But what happens when a chip fails? So I think we would be limited on people then. It would be like how fast can we solve the hard engineering problems to scale up.

Starting point is 00:58:46 But I do think the change is massive. And I think people like don't realize how chip limited AI, like research is or something right now. Like the models that everyone uses, right? If you're using like CloudSonic 4 or Cloud Opus 4, it's like, it's our first shot. Yeah. So it's that models at that scale, right? And like if you think about anything, like you could do it and you can do it again, you can do a better job.

Starting point is 00:59:06 But if you sort of imagine like 10X the compute, like you could run this every day instead of every few months. like UK or 100X, maybe for that, then like, yeah, it's just, it's a really, it would be a really big change to have a lot more compute. And it's coming, right? Like that's like kind of a fun part of the field. It's like every year you're like, oh, I had no computer year ago. Right. Exactly. Yeah. Exactly. How do you think about methods like discrete diffusion? Like I saw there's like a Gemini diffusion model. And I think about that in the space I used to be in where there's a lot of discrete diffusion models being used in protein design, for example, the space where my startup was. Like, do you see that as a domain where there's going to be interesting advances happening? I'll be honest, like, we haven't done image generation.

Starting point is 00:59:43 And I think that's been like the main use for diffusion. So I've kind of had this on my like to-do list of like things I should understand for a while. And like there are people my team who do understand it. It wouldn't have better thoughts. But like I actually don't think I understand it well enough to know. I do have it kind of in like this category of like not a total. And there's a lot of things that aren't like a huge paradigm shift. But they're like pretty big changes to how things run.

Starting point is 01:00:05 Yeah, totally. And I expect like there are some of those that will work. I don't know if it's diffusion or if it's another one. Obviously, who knows what anthropically will do in the future, but at least in the near term, are the things where you see big areas where a startup can win in the world in which Anthropic is getting, you know, making their models better year over year. My general read is like anything that benefits from the model getting smarter. I think like on the one hand, there's like a lot. You can always be like, oh, yeah, if you're doing a startup, like all the AI labs are big companies, they'll be bigger than you and they could do that thing. But also, like, we're all working on this general system that covers a lot of different uses and the plan.

Starting point is 01:00:40 is to power all the startups to do all of the individual work. So, yeah, I think, like, anything that just kind of looks like, oh, this almost works with current models, but requires, like, a bunch of work is a pretty promising direction. I think maybe the thing to watch out for is things where, like, they work now with a huge amount of work, like, to build up a scaffold, but the next generation, you're not going to need the whole scaffold you built up. Yeah, yeah.

Starting point is 01:01:02 I mean, maybe that's fine. I don't know, like, maybe you just build up the business with the scaffold, and then you don't have to do any work later and you can actually the business. But I don't know about the business side of it, but, like, it does feel. a little silly to invest a ton in that. Yeah, totally. What about on the flip side, are there things in your training stack where you're like,

Starting point is 01:01:19 man, if there was a company that solved X problem, I would totally buy their product? Yeah, there's like a ton. I do think that like probably most of these, like the way I would probably structure would be like almost like making something, but then consulting with the company, like, offering a service to companies for free. Particularly for like companies that are scaling really fast. You're almost always limited on like how many people you can have. So if you can like, even if you could hire people, to do yourself, actually be able to contract someone else to do it where like they're managing

Starting point is 01:01:44 and hire all the people and like deal with the organizational side could be useful. I mean, there's a huge amount of stuff. One that jumps to mind, we talked about like chips that do math incorrectly. Like it would be lovely if there was some startup that like you could just say like, here are my chips, confirm they're all perfect. And if they're not, let me know exactly what went wrong on like what fraction of them. And like, I can tell you the math is wrong. But I couldn't really, I don't really know enough details of chips to be like this chip failed

Starting point is 01:02:09 because this particular, like, low-level component was, like, wired wrong or, like, got hit by a gamma-ray. I don't know what causes it. You could always go, like, a bunch deeper. I mean, the other thing I'd maybe just push startups on is thinking a little bit about, like, this is maybe less technical, but just, like, what happens once we get EGI and, like, how to make sure that, like, goes well for the world or something? Like, my expectation is, like, if you actually automate almost everything a person can do, the amount of economic growth there is just, like, truly enormous.

Starting point is 01:02:36 And I would think a little more about, like, How do you make this help the world versus not? I think it's going to be like plenty of economic success or something as a result of it anyway. Yeah, absolutely, yeah. Last question I want to ask you is around, if you were writing back to where we started, like 10 years ago, you're a student, you're pivoting into AI from kind of economics work you were thinking about. And, you know, all sorts of things you probably did in those early days had some kind of compounding return for you as you developed into the role you up now. What advice would you give to students as I think about entering the workforce, especially today?

Starting point is 01:03:09 learning skills are going to be useful and maybe getting themselves jobs like the one you have right now 10 years later. It's hard because I think the timing is very different. Like I just think we're like we've made a lot of progress. So like what I would do 10 years ago is different from what I would do today. But I think certainly if I went back 10 years ago, I would be like focus on AI. It's like the most important thing and particularly focus on engineering, which I think felt very wouldn't have seemed obvious to me at the time that like the important thing was these engineering skills and not the like math and theoretical understanding of like, you know,

Starting point is 01:03:39 SVMs and all the kind of standard ML literature. I think today I would probably focus a bunch on the like engineering and on the like figuring out what to do with AGI as sort of the two like main things that feel top of mind for me. Let's call it there. Thanks so much Nick appreciate it.

Y Combinator Startup Podcast - Anthropic Head of Pretraining on Scaling Laws, Compute, and the Future of AI

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.