Big Technology Podcast - Generative AI 101: Tokens, Pre-training, Fine-tuning, Reasoning — With Dylan Patel

Starting point is 00:00:00 If you've ever wondered how generative AI works and where the technology is heading, this episode is for you. We're going to explain the basics of the technology and then catch up with modern-day advances like reasoning to help you understand exactly how it does what it does and where it might advance in the future. That's coming up with semi-analysis founder and chief analyst Dylan Patel right after this. Welcome to Big Technology Podcast, a show for cool-headed and nuanced conversation of the tech world and beyond, we're joined today by semi-analysis founder and chief analyst Dylan Patel, a leading expert in semiconductor and generative AI research and someone I've been looking forward to speaking with

Starting point is 00:00:39 for a long time. Now, I want this to be an episode that A. helps people learn how generative AI works, and B, is an episode that people will send to their friends to explain to them how generative AI works. I've had a couple of those that I've been sending to my friends and colleagues and counterparts about what is going on within generative AI. That includes one, this three and a half hour long video from Andre Carpathie explaining everything about training large language models. And the second one is a great episode that Dylan and Nathan Lambert from the Allen Institute of AI did with Lex Friedman.

Starting point is 00:01:14 Both of those three hours plus, so I want to do ours in an hour. And I'm very excited to begin. So Dylan, it's great to see you and welcome to the show. Thank you for having me. Great to have you here. Let's just start with tokens. Can you explain how AI researchers basically take words and then give them numerical representations and parts of words and give them numerical representations? So what are tokens? Tokens are in fact like chunks of words, right? In the human way, you can think of like syllables, right? um syllables are often you know viewed as like chunks of word um they have some meaning um it's the base level of of speaking right is syllables right now for models tokens are the base level of

Starting point is 00:01:58 output they're all about like sort of compressing you know sort of this is the most efficient representation of language from my understanding AI models are very good at predicting patterns so if you give it one three seven nine it might know the next number is going to be 11. And so what it's doing with tokens is taking words, breaking them down to their component parts, assigning them a numerical value, and then basically in its own word, in its own language, learning to predict what number comes next because computers are better at numbers, and then converting that number back to text. And that's what we see come out. Is that accurate? Yeah. And each individual token is actually, it's not just like one number, right? It's multiple

Starting point is 00:02:41 vectors. You could think of, like, well, the tokenizer needs to learn. King and Queen are actually extremely similar on most, in terms of like the English language, extremely similar, right? Except there is like one vector in which they're super different, right? Because a king is a male and a queen is a female. Right. And then from there, like, you know, in language, oftentimes kings are considered conquerors and, and, you know, all these are the things and like these are just like historical things. Right. So a lot of the text around them, while they're both like royal, regal, right, like, you know, monarchy, et cetera, there are many vectors in which they differ. So, like, it's not just, like, converting a word into one number, right?

Starting point is 00:03:18 It's like converting it into multiple vectors, and each of these vectors, the model learns what it means, right? You don't initialize the model with, like, hey, you know, king means male, monarch, blah, and it's associated with, like, war and conquering, because that's what all the writing about kings is in history and all that, right? Like, people don't talk about the daily lives of kings that much. or they mostly talk about like their wars and conquests and stuff. And so like there will be each of these numbers in this embedding space, right, will be assigned over time as the model reads the Internet's text and trains on it, it'll start to realize, oh, King and Queen are exactly similar on these vectors, but very different on these vectors.

Starting point is 00:03:59 And these vectors aren't, you don't explicitly tell the model, hey, this is what this vector is for, but it could be like, you know, it could be as much as like, one vector could be like, is it a building or not, right? And it doesn't actually know that you don't know that ahead of time. It just happens to in the latent space. And then all these vectors sort of relate to each other. But yeah, these numbers are inefficient representation of words because you can do math on them, right? You can multiply them, you can divide them. You can run them through an entire model. And your brain does something similar, right?

Starting point is 00:04:30 When it hears something, it converts that into a frequency in your ears. And then that gets converted to frequencies that should go through your brain, right? This is the same thing as a tokenizer, right? Although it's like obviously a very different medium of compute, right? Ones and zeros for computers versus, you know, binary and multiplication, et cetera, being more efficient. Whereas humans' brains are more like analog in nature and, you know, think more in waves and patterns in different ways. While they are very different, it is a tokenizer, right? Like language is not actually how our brain thinks.

Starting point is 00:05:01 It's just a representation for which it to, you know, reason over. Yeah, so that's crazy. so the tokens are the sufficient representation of words but more than that the models are also learning the way that they are all these words are connected and that brings us to pre-training from my understanding pre-training is when you take the entire basically the entire internet worth of text and you use that to teach the model these representations between each token so therefore like we talked about if you gave if you gave a model the sky is, and the next word is typically blue in the pre-training, which is basically all of the English language, all of language on the internet. It should know that the next token is blue. So what you do is you want to make sure that when the model is outputting information, it's closely tied to what that next value should be. Is that a proper description of what happens in pre-training? Yeah, I think that's pretty, that's the objective function,

Starting point is 00:06:06 which is just to reduce loss, i.e. how often is the token predicted incorrectly versus correctly, right? Right. So if it's like this, if you said the sky is red, that's not the most probable outcome. So that would be wrong. But that text is on the internet, right? Like, because the Martian sky is red and there's all these books about Mars and sci-fi. Right. So how does the model then learn how to, you know, figure this out. And in what context is it accurate to say blue and red? Right. So, so, I mean, first of all, the model doesn't just output one token, right? It outputs a distribution. it turns out the way most people take it is they take the top, top K, i.e. the most high probability. So yes, blue is obviously the right answer if you give it to anyone on this planet. But there are situations and context where the sky is red is the appropriate sentence,

Starting point is 00:06:52 but that's not just in isolation, right? It's like if the prior passage is all about Mars and all this, and then all of a sudden it's like, and that's like a quote from a Martian settler and it's like the sky is. And then the correct token is actually red, right? The correct word. And so it has to know this through the attention mechanism, right? If it was just the sky is blue, always you're going to output blue because blue is, let's say, 80%, 90%, 99% likely to be the right option. But as you start to add context about Mars or any other planet, right? Other planets have different colored, colored atmospheres, I presume, you end up with this distribution starts to shift, right? If I add, we're on Mars, the sky is, you know, then all of a sudden blue goes

Starting point is 00:07:34 from 99%, you know, in the prior context window, right, the text that you sent to the model, the attention of it, all of a sudden it realizes the sky is blue preceded by the stuff about Mars. Now, bluish rockets down to like, you know, let's call it 20% probability and red rockets up to 80% probability, right? Now, the model outputs that, and then most people just end up taking the top probability and outputting it to the user. And that's sort of like, how does the model learn that is, is the attention? mechanism, right? And this is sort of the beauty, yeah, the, the, the, uh, the attention mechanism is the beauty of, of modern sort of large language models. It takes the relational value, you know,

Starting point is 00:08:17 in this vector space between every single token, right? Um, so the sky, you know, the sky is blue, right? Like, when I think about it, yes, blue is the next token after the sky is, but in, in, in a lot of like older style models, you would have, you would just predict the exact next word. So after Sky, obviously, it could be many things. It could be blue, but it could also be like a scraper, right? You know, Sky Scrapers, yeah, that makes sense. But what attention does is it is taking all of these various values, the query, the key, the value, which represents what you're looking for, where you're looking, and what that value is across the attention.

Starting point is 00:08:57 and you're calculating mathematically what the relationship is between all of these tokens. And so going back to the king-queen representation, right? The way these two words interact is now calculated, right? And the way that every word in the entire passage you sent is calculated is tied together, which is why models have like challenges with like how long can you, how many documents can you send them, right? Because if you're sending them, you know, just the question, like what color is the sky? okay, it only has to calculate the attention between, you know, those words, right?

Starting point is 00:09:29 But if you're sending it like 30 books with like insurance claims and all these other things and you're like, okay, figure out what's going on here, is this a claim or not, right? And in the insurance context, all of a sudden it's like, okay, I've got to calculate the attention of not just like the last five words to each other, but I have to calculate every, you know, 50,000 words to each other, right? Which then ends up being a ton of math. Back in the day, actually the best language models were different architecture entirely, right? But then at some point, you know, Transformers, large language models sort of large language

Starting point is 00:09:58 models which are based on transformers primarily, rocketed past and capabilities because they were over to scale and because the hardware got there. And then we were able to scale them so much that we were able to just put like some text in them and not just a lot of text or a lot of books, but the entire internet, which, you know, one could view the internet oftentimes as a microcosm of all human culture and learnings and knowledge to many extents because most books are on the internet, most. Papers are on the internet. Obviously there's a lot of things missing on the internet, but this is sort of, this is the sort of modern, you know, magic of like what, it was sort of like three different things like coming all together at once, right? An efficient way for models to relate every word to each other, the compute necessary to scale the data large enough. And then someone actually like pulling the trigger to do that, right, at the scale that was, you know, got to the point where it was useful, right? Which was sort of like GPT 3.5 level or four level, right, where it became extremely.

Starting point is 00:10:53 extremely useful for normal humans to use, you know, chat models. Okay. And so why is it called pre-training? So pre-training is sort of called that because it is what happens, you know, before the actual training of the model, right? The objective function in pre-training is to just predict the next token, but predicting the next token is not what humans want to use AIs for, right? I want it to ask a question and answer it.

Starting point is 00:11:22 But in most cases, asking a question does not necessarily mean that the next most likely token is the answer, right? Oftentimes it is another question, right? For example, if I ingested the entire SAT, you know, and I asked a question, the next like five answers, then all the next tokens would be like, A is this, B is this, C is this, D is this. Like, no, I just want the answer, right? And so pre-training is, the reason it's called pre-training is because you're ingesting

Starting point is 00:11:48 humongous volumes of text no matter the use case. and you're learning the general patterns across all of language, right? I don't actually know that King and Queen relate to each other in this way. And I don't know that King and Queen are opposites in these ways, right? And so this is why it's called pre-training is because you must get a broad general understanding of the entire sort of world of text before you're able to then do post-training or fine-tuning, which is let me train it on more specific data that is specifically useful for what I want it to do. whether it's, hey, in chat style applications, you know, go in, you know, when I ask a question, give me the answer,

Starting point is 00:12:30 or in other applications like, teach me how to build a bomb. Well, obviously, no, I'm not going to help you teach build a bomb because that's what I don't want the model to teach me how to build a bomb. So, you know, it's sort of got to do this. And it's not like you're teaching it, you know, when you're doing this pre-training, you're filtering out all this data. Because in fact, there's a lot of good useful data on how to build bombs because there's a lot of useful information on like, hey, like, C4 chemistry and like, you know, people want to use it for chemistry, right? So you don't want to just filter out everything so that the model doesn't know anything about it. But at the same time, you don't want it to output, you know, how to build a bomb. So there's like a fine

Starting point is 00:13:05 balance here. And that's why pre-training is defined as pre because you're, you're still letting it do things and teaching it things and inputting things into the model that are theoretically, like quite bad, right? For example, books about like killing or war tactics or or what have you, right? Like things that like plausibly you could see like, oh, well, maybe that's not okay. Um, or, or wild descriptions of like really grotesque things all over the internet, but you want the model to learn these things, right? Because first you build the general understanding before you say, okay, now that you've got a general framework or the world, let's align you so that you with this general understanding the world can figure out what is useful for people,

Starting point is 00:13:43 what is not useful for people. What should I respond on? What should I not respond on? So what happens then in the training process? So the, the, the, is the training process that the model is then attempting to make the next prediction and then just trying to minimize loss as it goes? Right, right. I mean, like, basically, uh, you, you have, you have loss is, is how often you're wrong versus right in the most simple terms, right? You'll, you'll run through passages, right, through the model, um, and you'll see how often did the model get it right when it got it right? Great. Reinforce that. Uh, when I got it wrong, let's figure out which neurons, neurons in the model, you know, quote unquote, neurons.

Starting point is 00:14:22 In the model, you can tweak to then fix the answer so that when you go through it again, it actually outputs the correct answer. And then you move the model slightly in that direction. Now, obviously, the challenge with this is if I first, you know, I can come up with a simplistic way where all the neurons will just output the sky is blue. Every single time says the sky is. But then when it goes to, you know, hey, the color blue is commonly used on walls because it's soothing, right? And it's like, oh, what's the next word?

Starting point is 00:14:54 Is soothing, right? So, like, that is a completely different representation. And to understand that blue is soothing and that the sky is blue and those things aren't actually related, but they are related to blue is like very important. And so, you know, oftentimes you'll run through the training data set multiple times, right? because the first time you see it, oh, great, maybe you memorized that the sky is blue, and you memorized the wall is blue, and when people describe art and oftentimes use the color blue, it can be representations of art or the wall, right? And so over time, as you go through all this text in pre-training, yes, you're minimizing loss initially by just

Starting point is 00:15:34 memorizing, but over time, because you're constantly overriding the model, it starts to learn the generalization, right, i.e. blue is a soothing color, also right, represents the sky, also used in art for either of those two motifs, right? And so that's sort of the goal of pre-training is you don't want to memorize, right? Because that's, you know, in school you memorize all the time. And that's not useful because you forget everything you memorize. But if you get tested on it then, and then you get tested on it six months later, and then again, six months later after that or however you do it, ends up being, oh, you don't actually like memorize that anymore, you just know it innately, and you've generalized on it. And that's the real goal

Starting point is 00:16:11 that you want out of the model, but that's not necessarily something you can just measure, right? And therefore, loss is something you can measure, i.e., for this group of text, right? Because you train the model in steps. Every step, you're inputting a bunch of text. You're trying to, you're trying to see what's predict the right token, where you didn't predict the right token, let's adjust the neurons. Okay, onto the next batch of text. And you'll do this, these batches over and over and over again across trillions of words of text, right? And as you step through and then you're like, oh, well, I'm done. But I bet if I go back to the first group of text, which is all about the sky being blue, it's going to get the answer wrong. Because maybe later

Starting point is 00:16:51 on in the training it discovered it saw some passages about sci-fi and how the Martian sky is read. So like it'll, it'll overwrite. But then over time as you go through the data multiple times, as you see it on the internet multiple times, you see it in different books multiple times, whether it'd be scientific, sci-fi, whatever it is, you start to realize and it starts to learn that that representation of like, oh, when it's on Mars, it's red because the sky and Mars is red, because the atmospheric makeup is this way, whereas the atmospheric makeup on Earth is a different way. And so that's sort of like the whole point of pre-training is to minimize loss, but the nice

Starting point is 00:17:23 side effect is that the model initially memorizes, but then it stops memorizing and it generalizes. And that's the useful pattern that we want. Okay, that's fascinating. We've touched on post-training for a bit, but just to recap, post-rating is so you have a model that's good at predicting the next word. And in post-training, you sort of give it a personality by inputting sample conversations to make the model want to emulate the certain values that you want it to take on. Yeah, so post-training can be a number of different things. The most simple way of doing it is, yeah, pay for humans to label a bunch of data, take a lot of data.

Starting point is 00:18:01 a bunch of example conversations, et cetera, and input that data and train on that at the end, right? And so that example data is useful, but this is not scalable, right? Like using humans to train models is just so expensive, right? So then there's the magic of sort of reinforcement learning and other synthetic data technologies, right, where the model is helping teach the model, right? So you have many models in a sort of in a post-training where, yes, you have some example human data, but human data does not scale that fast, right? Because the internet is trillions and trillions of words out there, whereas, you know, even if you had, you know, Alex and I write words all day long for our whole lives, we would have millions or, you know, hundreds of millions

Starting point is 00:18:47 of words written, right? It's nothing. It's like orders of magnitude off in terms of the number of words required. So then you have the model, you know, take some of this example data, and you have various models that are surrounding the main model that you're training, right? And these can be policy models, right, teaching it, hey, is this what you want or that what you want? Reward models, right? Like, is that good response? Is that a good response? Is that a bad response? You have value models like, hey, grade this output, right? And you have all these different models working in conjunction to say, you know, different companies have different objective functions, right? In the case of Anthropic, they want their model to be helpful, harmless, and safe, right? So, be

Starting point is 00:19:28 helpful but also don't harm people or anyone or anything and then you know you know safe right in other cases like grok right Elon's model from xAI it actually just wants to be helpful and maybe it has like a little bit of a right leading to it right and for other folks right like you know you mean most AI models are made in the Bay Area so they tend to just be left leaning right but also the internet on general is a little bit left leaning because it skews younger than older and so like all these things like sort of affect models but like it's not just around politics, right? Post-training is also just about teaching the model. If I say, like, the movie where the princess has a slipper and it doesn't fit, it's like, well, if I said that

Starting point is 00:20:12 into a base model that was just pre-training, like, the answer wouldn't be, oh, the movie you're looking for Cinderella, you know, it would only realize that once it goes, you know, once it goes through post-training, right? Because a lot of times people just throw garbage into the model, and then the model still figures out what you want, right? And this is part of what post-training is. Like, you can just do stream of consciousness into models, and oftentimes it'll figure out what you want. Like, you know, if it's a movie that you're looking for, or if it's help answering a question, or if you throw a bunch of, like, unstructured data into it and then ask it to make it into a table, it does this, right? And that's because of all these different

Starting point is 00:20:42 aspects of post-training, right? Example data, but also, you know, generating a bunch of data and grading it and seeing if it's good or not, and whether it matches the various policies you want. Is it help, you know, a lot of times grading can be based on multiple factors, right? There can be a model that says, hey, is this helpful? Hey, is this safe? And what is safe? Right? So then that model for safety needs to be tuned on human data, right? So there's, it is a quite complex thing, but the end goal is to be able to get the model to output in a certain way. Models aren't always about just humans using them either, right? There can be models that are just focused on like, hey, like, you know, if it doesn't output code, you know, yes, it was trained on the whole

Starting point is 00:21:19 internet because the person's going to talk to the model using text, but if it doesn't output code, you know, penalize it, right? Now all of a sudden the model will never output like text ever again and only output code um and so like these sorts of like models exist too so post training is not just a univariable thing right it's what variables do you want to target um and so that's why models have different personalities from different companies it's why they target different use cases and why you know it's not just like one model that rules them all but actually many that's fascinating so that's why we've seen so many different models with different personalities is because it all happens in the post training uh moment and this is when when you talk about

Starting point is 00:21:57 giving the models examples to follow this. That's what reinforcement learning with human feedback is, is the humans give some examples, and then the model learns to emulate what the human is interested in what the human trainer is interested in having them embody. Is that right? Yeah, exactly. Okay, great.

Starting point is 00:22:17 All right, so first half we've covered what training is, what tokens are, what loss is, what post-training is post-training, by the way, also called fine-tuning. We've also covered reinforcement learning with human feedback. We're going to take a quick break, and then we're going to talk about reasoning. We'll be back right after this. Hey, everyone, let me tell you about The Hustle Daily Show, a podcast filled with business, tech news, and original stories to keep you in the loop on what's trending.

Starting point is 00:22:44 More than 2 million professionals read The Hustle's daily email for its irreverent and informative takes on business and tech news. Now, they have a daily podcast called The Hustle Daily Show, where their team of writers break down the biggest business headlines in 15 minutes or less and explain why you should care about them. So search for The Hustled Daily Show and your favorite podcast app like the one you're using right now. And we're back here on Big Technology Podcast with Dylan Patel. He's the founder and chief analyst at Semi Analysis. He actually has a great analysis on NVIDIA's recent GTC conference, which we just covered recently on a recent episode. You can find Semi Analysis

Starting point is 00:23:24 at semi-analysis.com. It is both content and sort of, and consulting. So I definitely check in with Dylan for all of those, all those needs. And now we're going to talk a little bit about reasoning. Because a couple months ago, and Dylan, this is really where, you know, I sort of entered the picture of watching your conversation with Flex, with Nathan Lambert, about what the difference is between reasoning and, and your traditional LLMs, large language models. If I gathered it right from your conversation, what reasoning is, is basically instead of the model going, basically predicting the next word based off of its training, it uses the tokens to spend more time basically figuring out what the right answer is, and then coming out with a new prediction.

Starting point is 00:24:14 I think Carpathie does a very interesting job in the YouTube video talking about how models think with tokens. the more tokens there are, the more compute they use because they're running these predictions through the transformer model, which we discussed, and therefore they can come to better answers. Is that the right way to think about reasoning? So I think that humans are also fantastic at pattern matching, right? We're really good at recognizing things. But a lot of tasks, it's not like an immediate response, right? We are thinking. Whether that's thinking through words out loud, thinking through words in an inner monologue in her head or it's just like processing somehow and then we know the answer right

Starting point is 00:24:52 and this is the same for models right models are horrendous at math right historically have been right um you could ask it you know what is 9.11 bigger than 9.9 um and it would say yes it's bigger even though like everyone knows that 9.11 is is way smaller than 9.9 right um and that's just like a thing that happened in models because they didn't think or reason right and it's the same for you Alex, right? Like, you know, or myself, right? Like, if someone asked me, you know, uh, 17 times 34, I'd be like, I don't know, like right off top of my head, but, you know, give me, give me a little bit of time. I can do some long form multiplication and I can get the answer, right? And that's because I'm thinking about it. Um, and this is the same thing

Starting point is 00:25:35 with reasoning for models, um, is, you know, when you look at a transformer, every word is this, every token output, it has the same amount of compute behind it, right? Um, i.e, you know, when I'm saying the versus sky is blue, the blue and the the the, or the is in the blue have the same amount of compute to generate, right? And this is not exactly what you want to do, right? You want to actually spend more time on the hard things and not on the easy things. And so reasoning models are effectively teaching, you know, large pre-trained models to do this, right? Hey, think through the problem. Hey, output a lot of tokens. Think about it. Generate all this text. And then when you're done, you know, start answering the question, but now you have all of this stuff you generated

Starting point is 00:26:20 in your context, right? And that stuff you generated is, is helpful, right? It could be like, you know, all sorts of, you know, just like any human's thought patterns are, right? And so this, this is the sort of like new paradigm that we've entered maybe six months ago, where models now will think for some time before they answer. And this enables much better performance on all sorts of tasks, whether it be coding or math or understanding science or understanding complex social dilemmas, right? All sorts of different topics they're much, much better at. And this is done through post-training, similar to the reinforcement learning by human feedback that we mentioned earlier. But also, there's other forms of post-training,

Starting point is 00:27:05 and that's what makes these reasoning models. Before we head out, I want to hit on a couple things. first of all, the growing efficiency of these models. So I think one of the things that people focused on with DeepSeek was that it was just able to be much more efficient in the way that it generates answers. And there was this, obviously, this big reaction to Nvidia stock where it fell 18% the day or at the Monday after Deep Seek weekend because people thought we wouldn't need as much compute.

Starting point is 00:27:35 So can you talk a little bit about how models are becoming more efficient and how they're doing it? So there's a variety of, the beauty of these, of AI is not just that we continue to build new capabilities, right? Because those new capabilities are going to be able to benefit the world in many ways. And there's a lot of focus on those. But there's also a lot of, there's a lot of focus on, well, to get to that next level of capabilities is the scaling laws, i.e., the more compute and data I spend, the better the model gets. But then the other vector is, well, can I get to the same level with less compute and data, right? And those two things are hand in hand, because if I can get to the same level with the less compute and data,

Starting point is 00:28:14 then I can spend that more compute and data and get to a new level, right? And so AI researchers are constantly looking for ways to make models more efficient, whether it be through algorithmic tweaks, data tweaks, tweaks in, you know, how you do reinforcement learning, so on and so forth. Right. And so when we look at models across history, they've constantly gotten cheaper and cheaper and cheaper, right, at a stupendous rate, right? And so one easy example is GPT3, right? Because there's GPT3, 3.5 turbo, Lama 27B, Lama 3.1, Lama 3.2, right? As these models have gotten bigger, we've gone from, hey, it costs $60 for a million tokens to it costs less than, it costs like 5 cents now for the same quality of model. Now, and the model has shrank dramatically in size as well.

Starting point is 00:29:04 And that's because of better algorithms, better data, et cetera. And now what happened with DeepSeek was similar, you know, opening I had GPD4, then they had 4 turbo, which was half the cost, then they had 4O, which was again half the cost. And then meta released Lama 405B, open source, and so the open source community was able to run that. And that was again, like roughly like half the cost, or 5X lower cost than 4O, which was lower than 4 turbo and 4. But DeepSeek came out with another tier, right? So when we looked at GPD3, the cost fell 1,200x from GPD3's initial cost to what you can get Lama 3.2 3B today, right? And likewise, when we look at from GPD4 to Deepseek v3, it's fallen roughly 600x in cost, right? So we're not quite at that 1,200X, but it has fallen 600X and cost from $60 to less than a, you know, to about a dollar, right?

Starting point is 00:29:58 Or to less than a dollar, sorry, 60x. And so you've got this massive cost decrease, but it's not necessarily out of bounds, right? We've already seen it. I think what was really surprising was that it was a Chinese company for the first time, right? Because Google and Open AI and Anthropic and meta have all traded blows, right? You know, whether it be open AI always being on the leading edge or Anthropic always being on the leading edge or, you know, Google and meta, you know, being close followers, but oftentimes sometimes with a new feature and sometimes just being much cheaper. we have not seen this from any Chinese company, right?

Starting point is 00:30:32 And now we have a Chinese company releasing a model that's cheap. It's not unexpected, right? Like, this is actually within the trend line of what happened with GPT3, is happening to GPD4 level quality with Deepseek. It's more so surprising that it's a Chinese company, and that's, I think, why everyone freaked out. And then there was a lot of things that, like, you know, from there became a thing, right? Like, if meta had done this, I don't think people would have freaked out, right?

Starting point is 00:30:56 and meta is going to release their new llama soon enough right and that one is going to be you know a similar level of cost decrease uh probably similar areas deep seek v3 right um it's just not people aren't going to freak out because it's an american company it was sort of expected all right dylan let me ask you the last question which is the you mentioned i think you mentioned the bitter lesson which is basically that the i mean i'm going to just be kind of facetious and summing it up but the answer to all questions and machine learning is just to make bigger uh bigger models and scale solves almost all problems. So it's interesting that we have this moment where models are becoming way more efficient, but we also have massive, massive data center

Starting point is 00:31:36 buildouts. I think it would be great to hear you kind of recap the size of these data center buildouts and then answer this question. If we are getting more efficient, why are these data centers getting so much bigger? And what might that added scale get in the world of generative of AI for the companies building them. Yeah, so when we look across the ecosystem at data center buildouts, we track all the buildouts and server purchases and supply chains here. And the pace of construction is incredible, right? You can just, you can pick a state and you can see new data centers going up all across

Starting point is 00:32:11 the U.S. and around the world, right? And so you see things like capacity in, you know, for example, of the largest scale training supercomputers goes from, hey, it's a few. hundred million dollars it's it's not even a few hundred million dollars years ago but like you know hey for gpt four it was a few hundred million dollars um and it's it's one building full of gp uptu's to uh gpt 4.5 um and uh the reasoning models like oh 1.03 were done in a in three buildings on the same site and you know a billions of dollars to hey these next generation things that people are making are tens of billions of dollars um like opening i's

Starting point is 00:32:52 data center in Texas called Stargate, right, with Crusoe and Oracle and et cetera, right? And likewise applies to Elon Musk, who's building these data centers in an old factory where he's got like a bunch of like gas generation, you know, outside and he's doing all these crazy things to get the data center up as fast as possible, right? And you can go to just basically every company and they have like these humongous buildouts. And this sort of like, and because of the scaling laws, right, you know, 10x more computer. for linear, like, improvement gains, right? Like, it's sort of like, or it's log, log, sorry.

Starting point is 00:33:27 But you end up with this, like, very confusing thing, which is, like, hey, models keep getting better as we spend more, but also the model that we had a year ago is now done for way, way cheaper, right? Oftentimes, 10x cheaper or more, right? Just a year later. So then the question is, like, why are we spending all this money to scale?

Starting point is 00:33:49 And there's a few things here, right? A, you can't actually make that cheaper model without making the better, bigger model, so you can generate data to help you make the cheaper model, right? Like, that's part of it. But also another part of it is that, you know, if we were to freeze AI capabilities where we were basically in, what was it, March, 23, right, two years ago when GPT4 released, and only made them cheaper, right? Like, Deepseek is, like, much cheaper, it's much more efficient.

Starting point is 00:34:19 but it's roughly the same capabilities as GPD4. That would not pay for all of these buildouts, right? AI is useful today, but it is not capable of doing a lot of things, right? But if we make the model way more efficient and then continue to scale, and we have this like stair step, right, where we like increase capabilities massively, make them way more efficient, increase capabilities massively, make them way more efficient. We do the stair step, then you end up with creating all these new capabilities that could in fact, pay for, you know, these massive AI buildouts. So no one is trying to make

Starting point is 00:34:53 with these, you know, with these $10 billion data centers, they're not trying to make chat models, right? They're not trying to make models that people chat with, just to be clear, right? They're trying to solve things like software engineering and make it automated, which is like a trillion dollar plus industry, right? So these are very different like sort of use cases and targets. And so it's the bitter lesson because, yes, you can make, you can spend a lot of time and effort making clever, specialized methods. you know, based on intuition, and you should, right? But these things should also just have a lot more compute thrown behind them

Starting point is 00:35:25 because if you make it more efficient, as you follow the scaling laws up, it'll also just get better and you can then unlock new capabilities, right? And so today, you know, a lot of AI models, the best ones from Anthropic are now useful for, like, coding as an assistant with you, right? You're going back and forth, you know, as time goes forward, as you make them more efficient and continue to scale them, the possibility is that, hey, it can code for like 10 minutes at a time. I can just review the work and it will make me 5x more efficient, right?

Starting point is 00:35:52 You know, and so on and so forth. And this is sort of like where reasoning models and sort of the scaling sort of argument comes in is like, yes, we can make it more efficient, but we also just, you know, that's not going to solve the problems that we have today, right? The Earth is still going to run out of resources. We're going to run out of nickel because we can't make enough batteries and we can't make enough batteries. So then we can't with current technology that we can't replace all of, you know, gas, you know, gas, you know, gas. and coal with renewables, all of these things are going to happen unless, like, you continue to improve AI and invent, or just generally research new things. And AI helps us research new things. Okay, this is really the last one. Where is GPT5?

Starting point is 00:36:34 So OpenAI released GPD 4.5 recently with what they called training run Orion. There were hopes that Orion could be used for GPD5. But its improvement was like, not enough to be like really a GPT-5. Furthermore, it was trained on the classical method, which is a ton of pre-training and then some reinforcement learning with human feedback and some other reinforcement learning, like PPO and DPO and stuff like that.

Starting point is 00:37:04 But then along the way, right, this model was trained last year, along the way, another team at OpenAI made the big breakthrough of reasoning, right? Strawberry training, and they released 01, and then they released 0.3. And these models are rapidly getting better with reinforcement learning with verifiable reward.

Starting point is 00:37:19 And so now, GPD 5, as Sam calls it, is going to be a model that has huge pre-training scale, right, like GPD 4.5, but also huge post-training scale like 01 and 03 and continuing to scale that up, right? And this would be the first time we see a model that was a step up in both at the same time. And so that's what OpenAI says is coming. They say it's coming, you know, this year, hopefully in the next three to six months, maybe sooner. I've heard sooner, but, you know, we'll see.

Starting point is 00:37:49 Um, but this, this path of scaling both pre-training and a post-training with reinforcement learning with verifiable rewards massively should yield much better models that are capable of much more things and we'll see what those things are. Very cool. All right, Dylan, do you want to give a quick shout out to those who are interested in potentially working with semi-analysis who you work with and where they can learn more? Sure. So we, you know, at semi-analysts.com, we have, you know, the, we have the public stuff, which is like

Starting point is 00:38:17 all these reports that are, uh, pseudo-free. But then most of our work is done on directly for clients. There's these datasets that we sell around every data set in the world, servers, all the compute, where it's manufactured, how many, where, what's the cost, and who's doing it. And then we also do a lot of consulting. We've got people who have worked all the way from ASML, which makes lithography tools all the way up to, you know, Microsoft and Nvidia, which, you know, making models and doing infrastructure. And so we've got this whole gamut of, you know, folks. There's, there's roughly 30 of us across the world in US, Taiwan, Singapore, Japan, France, Germany, Canada.

Starting point is 00:38:54 So, you know, there's a lot of engagement points. But if you want to reach out, just go to the website, you know, go to one of those specialized pages of models or sales and reach out. And that'd be the best way to sort of interact and engage with us. But for most people, just read the blog, right? Like, I think, like, unless you have specialized, like, needs, unless you're a company in the space or your investor in the space, like, you know, you just want to be informed You can read the blog and it's free, right? I think that's the best option for most people.

Starting point is 00:39:22 Yeah, well, I will attest the blog is magnificent and Dylan is really a thrill to get a chance to meet you and talk through these topics with you. So thanks so much for coming on the show. Thank you so much, Alex. All right, everybody, thanks for listening. We'll be back on Friday to break down the week's news. Until then, we'll see you next time on Big Technology Podcast.

Big Technology Podcast - Generative AI 101: Tokens, Pre-training, Fine-tuning, Reasoning — With Dylan Patel

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.