Big Technology Podcast - Generative AI 101: Tokens, Pre-training, Fine-tuning, Reasoning — With Dylan Patel
Episode Date: April 23, 2025Dylan Patel is the founder and CEO of SemiAnalysis. He joins Big Technology Podcast to explain how generative AI work, explaining the inner workings of tokens, pre-training, fine-tuning, open source, ...and reasoning. We also cover DeepSeek’s efficiency breakthrough, the race to build colossal AI data centers, and what GPT-5’s hybrid training approach could unlock. Hit play for a masterclass you’ll want to send to every friend puzzled (or excited) about the future of AI. --- Enjoying Big Technology Podcast? Please rate us five stars ⭐⭐⭐⭐⭐ in your podcast app of choice. Want a discount for Big Technology on Substack? Here’s 40% off for the first year: https://tinyurl.com/bigtechnology Questions? Feedback? Write to: bigtechnologypodcast@gmail.com
Transcript
Discussion (0)
If you've ever wondered how generative AI works and where the technology is heading,
this episode is for you.
We're going to explain the basics of the technology and then catch up with modern-day advances like reasoning
to help you understand exactly how it does what it does and where it might advance in the future.
That's coming up with semi-analysis founder and chief analyst Dylan Patel right after this.
Welcome to Big Technology Podcast, a show for cool-headed and nuanced conversation of the tech world and beyond,
we're joined today by semi-analysis founder and chief analyst Dylan Patel, a leading expert in
semiconductor and generative AI research and someone I've been looking forward to speaking with
for a long time. Now, I want this to be an episode that A. helps people learn how generative
AI works, and B, is an episode that people will send to their friends to explain to them how
generative AI works. I've had a couple of those that I've been sending to my friends and colleagues
and counterparts about what is going on within generative AI.
That includes one, this three and a half hour long video from Andre Carpathie explaining
everything about training large language models.
And the second one is a great episode that Dylan and Nathan Lambert from the Allen Institute
of AI did with Lex Friedman.
Both of those three hours plus, so I want to do ours in an hour.
And I'm very excited to begin.
So Dylan, it's great to see you and welcome to the show.
Thank you for having me.
Great to have you here. Let's just start with tokens. Can you explain how AI researchers basically take words and then give them numerical representations and parts of words and give them numerical representations? So what are tokens?
Tokens are in fact like chunks of words, right? In the human way, you can think of like syllables, right?
um syllables are often you know viewed as like chunks of word um they have some meaning um it's the
base level of of speaking right is syllables right now for models tokens are the base level of
output they're all about like sort of compressing you know sort of this is the most
efficient representation of language from my understanding AI models are very good at predicting
patterns so if you give it one three seven nine it might know the next number is going to be
11. And so what it's doing with tokens is taking words, breaking them down to their component parts,
assigning them a numerical value, and then basically in its own word, in its own language,
learning to predict what number comes next because computers are better at numbers,
and then converting that number back to text. And that's what we see come out. Is that accurate?
Yeah. And each individual token is actually, it's not just like one number, right? It's multiple
vectors. You could think of, like, well, the tokenizer needs to learn. King and Queen are actually
extremely similar on most, in terms of like the English language, extremely similar, right? Except
there is like one vector in which they're super different, right? Because a king is a male and a queen
is a female. Right. And then from there, like, you know, in language, oftentimes kings are
considered conquerors and, and, you know, all these are the things and like these are just like
historical things. Right. So a lot of the text around them, while they're both like royal, regal, right,
like, you know, monarchy, et cetera, there are many vectors in which they differ.
So, like, it's not just, like, converting a word into one number, right?
It's like converting it into multiple vectors, and each of these vectors, the model learns
what it means, right?
You don't initialize the model with, like, hey, you know, king means male, monarch, blah,
and it's associated with, like, war and conquering, because that's what all the writing about
kings is in history and all that, right?
Like, people don't talk about the daily lives of kings that much.
or they mostly talk about like their wars and conquests and stuff.
And so like there will be each of these numbers in this embedding space, right, will be assigned over time as the model reads the Internet's text and trains on it, it'll start to realize, oh, King and Queen are exactly similar on these vectors, but very different on these vectors.
And these vectors aren't, you don't explicitly tell the model, hey, this is what this vector is for, but it could be like, you know, it could be as much as like, one vector could be like, is it a building or not, right?
And it doesn't actually know that you don't know that ahead of time.
It just happens to in the latent space.
And then all these vectors sort of relate to each other.
But yeah, these numbers are inefficient representation of words because you can do math on them, right?
You can multiply them, you can divide them.
You can run them through an entire model.
And your brain does something similar, right?
When it hears something, it converts that into a frequency in your ears.
And then that gets converted to frequencies that should go through your brain, right?
This is the same thing as a tokenizer, right?
Although it's like obviously a very different medium of compute, right?
Ones and zeros for computers versus, you know, binary and multiplication, et cetera, being more efficient.
Whereas humans' brains are more like analog in nature and, you know, think more in waves and patterns in different ways.
While they are very different, it is a tokenizer, right?
Like language is not actually how our brain thinks.
It's just a representation for which it to, you know, reason over.
Yeah, so that's crazy.
so the tokens are the sufficient representation of words but more than that the models are also learning the way that they are all these words are connected and that brings us to pre-training from my understanding pre-training is when you take the entire basically the entire internet worth of text and you use that to teach the model these representations between each token so therefore like we talked about if you gave if you gave a model
the sky is, and the next word is typically blue in the pre-training, which is basically all of
the English language, all of language on the internet. It should know that the next token is blue.
So what you do is you want to make sure that when the model is outputting information,
it's closely tied to what that next value should be. Is that a proper description of what
happens in pre-training? Yeah, I think that's pretty, that's the objective function,
which is just to reduce loss, i.e. how often is the token predicted incorrectly versus correctly, right?
Right. So if it's like this, if you said the sky is red, that's not the most probable outcome.
So that would be wrong. But that text is on the internet, right? Like, because the Martian sky is red and there's all these books about Mars and sci-fi.
Right. So how does the model then learn how to, you know, figure this out. And in what context is it accurate to say blue and red?
Right. So, so, I mean, first of all, the model doesn't just output one token, right? It outputs a distribution.
it turns out the way most people take it is they take the top, top K, i.e. the most high probability.
So yes, blue is obviously the right answer if you give it to anyone on this planet.
But there are situations and context where the sky is red is the appropriate sentence,
but that's not just in isolation, right? It's like if the prior passage is all about Mars and all this,
and then all of a sudden it's like, and that's like a quote from a Martian settler and it's like
the sky is. And then the correct token is actually red, right? The correct word. And so it has
to know this through the attention mechanism, right? If it was just the sky is blue, always
you're going to output blue because blue is, let's say, 80%, 90%, 99% likely to be the right
option. But as you start to add context about Mars or any other planet, right? Other planets
have different colored, colored atmospheres, I presume, you end up with this distribution starts
to shift, right? If I add, we're on Mars, the sky is, you know, then all of a sudden blue goes
from 99%, you know, in the prior context window, right, the text that you sent to the model,
the attention of it, all of a sudden it realizes the sky is blue preceded by the stuff about
Mars. Now, bluish rockets down to like, you know, let's call it 20% probability and red
rockets up to 80% probability, right? Now, the model outputs that, and then most people just
end up taking the top probability and outputting it to the user. And that's sort of like,
how does the model learn that is, is the attention?
mechanism, right? And this is sort of the beauty, yeah, the, the, the, uh, the attention mechanism is
the beauty of, of modern sort of large language models. It takes the relational value, you know,
in this vector space between every single token, right? Um, so the sky, you know, the sky is blue,
right? Like, when I think about it, yes, blue is the next token after the sky is, but in, in,
in a lot of like older style models, you would have, you would just predict the exact next word.
So after Sky, obviously, it could be many things.
It could be blue, but it could also be like a scraper, right?
You know, Sky Scrapers, yeah, that makes sense.
But what attention does is it is taking all of these various values, the query, the key, the value,
which represents what you're looking for, where you're looking, and what that value is across the attention.
and you're calculating mathematically what the relationship is between all of these tokens.
And so going back to the king-queen representation, right?
The way these two words interact is now calculated, right?
And the way that every word in the entire passage you sent is calculated is tied together,
which is why models have like challenges with like how long can you,
how many documents can you send them, right?
Because if you're sending them, you know, just the question, like what color is the sky?
okay, it only has to calculate the attention between, you know, those words, right?
But if you're sending it like 30 books with like insurance claims and all these other things
and you're like, okay, figure out what's going on here, is this a claim or not, right?
And in the insurance context, all of a sudden it's like, okay, I've got to calculate
the attention of not just like the last five words to each other, but I have to calculate
every, you know, 50,000 words to each other, right?
Which then ends up being a ton of math.
Back in the day, actually the best language models were different architecture entirely, right?
But then at some point, you know, Transformers, large language models sort of large language
models which are based on transformers primarily, rocketed past and capabilities because
they were over to scale and because the hardware got there.
And then we were able to scale them so much that we were able to just put like some text
in them and not just a lot of text or a lot of books, but the entire internet, which, you know,
one could view the internet oftentimes as a microcosm of all human culture and learnings
and knowledge to many extents because most books are on the internet, most.
Papers are on the internet. Obviously there's a lot of things missing on the internet, but this is sort of, this is the sort of modern, you know, magic of like what, it was sort of like three different things like coming all together at once, right?
An efficient way for models to relate every word to each other, the compute necessary to scale the data large enough. And then someone actually like pulling the trigger to do that, right, at the scale that was, you know, got to the point where it was useful, right? Which was sort of like GPT 3.5 level or four level, right, where it became extremely.
extremely useful for normal humans to use, you know, chat models.
Okay.
And so why is it called pre-training?
So pre-training is sort of called that because it is what happens, you know,
before the actual training of the model, right?
The objective function in pre-training is to just predict the next token, but predicting
the next token is not what humans want to use AIs for, right?
I want it to ask a question and answer it.
But in most cases, asking a question does not necessarily mean that the next most likely
token is the answer, right?
Oftentimes it is another question, right?
For example, if I ingested the entire SAT, you know, and I asked a question, the next like
five answers, then all the next tokens would be like, A is this, B is this, C is this, D is
this.
Like, no, I just want the answer, right?
And so pre-training is, the reason it's called pre-training is because you're ingesting
humongous volumes of text no matter the use case.
and you're learning the general patterns across all of language, right?
I don't actually know that King and Queen relate to each other in this way.
And I don't know that King and Queen are opposites in these ways, right?
And so this is why it's called pre-training is because you must get a broad general understanding
of the entire sort of world of text before you're able to then do post-training or fine-tuning,
which is let me train it on more specific data that is specifically useful for what I want it to do.
whether it's, hey, in chat style applications, you know, go in, you know, when I ask a question, give me the answer,
or in other applications like, teach me how to build a bomb.
Well, obviously, no, I'm not going to help you teach build a bomb because that's what I don't want the model to teach me how to build a bomb.
So, you know, it's sort of got to do this.
And it's not like you're teaching it, you know, when you're doing this pre-training, you're filtering out all this data.
Because in fact, there's a lot of good useful data on how to build bombs because there's a lot of useful information on like, hey, like,
C4 chemistry and like, you know, people want to use it for chemistry, right? So you don't
want to just filter out everything so that the model doesn't know anything about it. But at the
same time, you don't want it to output, you know, how to build a bomb. So there's like a fine
balance here. And that's why pre-training is defined as pre because you're, you're still
letting it do things and teaching it things and inputting things into the model that are
theoretically, like quite bad, right? For example, books about like killing or war tactics or
or what have you, right? Like things that like plausibly you could see like, oh, well, maybe that's not
okay. Um, or, or wild descriptions of like really grotesque things all over the internet, but
you want the model to learn these things, right? Because first you build the general understanding
before you say, okay, now that you've got a general framework or the world, let's align you
so that you with this general understanding the world can figure out what is useful for people,
what is not useful for people. What should I respond on? What should I not respond on?
So what happens then in the training process? So the, the, the,
is the training process that the model is then attempting to make the next prediction and then just
trying to minimize loss as it goes? Right, right. I mean, like, basically, uh, you, you have,
you have loss is, is how often you're wrong versus right in the most simple terms, right? You'll,
you'll run through passages, right, through the model, um, and you'll see how often did the model
get it right when it got it right? Great. Reinforce that. Uh, when I got it wrong,
let's figure out which neurons, neurons in the model, you know, quote unquote, neurons.
In the model, you can tweak to then fix the answer so that when you go through it again,
it actually outputs the correct answer.
And then you move the model slightly in that direction.
Now, obviously, the challenge with this is if I first, you know, I can come up with a
simplistic way where all the neurons will just output the sky is blue.
Every single time says the sky is.
But then when it goes to, you know, hey, the color blue is commonly used on walls because it's soothing, right?
And it's like, oh, what's the next word?
Is soothing, right?
So, like, that is a completely different representation.
And to understand that blue is soothing and that the sky is blue and those things aren't actually related, but they are related to blue is like very important.
And so, you know, oftentimes you'll run through the training data set multiple times, right?
because the first time you see it, oh, great, maybe you memorized that the sky is blue,
and you memorized the wall is blue, and when people describe art and oftentimes use the
color blue, it can be representations of art or the wall, right? And so over time, as you
go through all this text in pre-training, yes, you're minimizing loss initially by just
memorizing, but over time, because you're constantly overriding the model, it starts to learn
the generalization, right, i.e. blue is a soothing color, also right,
represents the sky, also used in art for either of those two motifs, right? And so that's sort of the
goal of pre-training is you don't want to memorize, right? Because that's, you know, in school you
memorize all the time. And that's not useful because you forget everything you memorize. But if you
get tested on it then, and then you get tested on it six months later, and then again, six months
later after that or however you do it, ends up being, oh, you don't actually like memorize that
anymore, you just know it innately, and you've generalized on it. And that's the real goal
that you want out of the model, but that's not necessarily something you can just measure, right?
And therefore, loss is something you can measure, i.e., for this group of text, right? Because
you train the model in steps. Every step, you're inputting a bunch of text. You're trying to,
you're trying to see what's predict the right token, where you didn't predict the right token,
let's adjust the neurons. Okay, onto the next batch of text. And you'll do this, these batches
over and over and over again across trillions of words of text, right? And as you step through
and then you're like, oh, well, I'm done. But I bet if I go back to the first group of text,
which is all about the sky being blue, it's going to get the answer wrong. Because maybe later
on in the training it discovered it saw some passages about sci-fi and how the Martian sky is read.
So like it'll, it'll overwrite. But then over time as you go through the data multiple times,
as you see it on the internet multiple times, you see it in different books multiple times,
whether it'd be scientific, sci-fi, whatever it is, you start to realize and it starts to learn
that that representation of like, oh, when it's on Mars, it's red because the sky and Mars is red,
because the atmospheric makeup is this way, whereas the atmospheric makeup on Earth is a different
way.
And so that's sort of like the whole point of pre-training is to minimize loss, but the nice
side effect is that the model initially memorizes, but then it stops memorizing and it
generalizes.
And that's the useful pattern that we want.
Okay, that's fascinating.
We've touched on post-training for a bit, but just to recap, post-rating is so you have a model that's good at predicting the next word.
And in post-training, you sort of give it a personality by inputting sample conversations to make the model want to emulate the certain values that you want it to take on.
Yeah, so post-training can be a number of different things.
The most simple way of doing it is, yeah, pay for humans to label a bunch of data, take a lot of data.
a bunch of example conversations, et cetera, and input that data and train on that at the end,
right? And so that example data is useful, but this is not scalable, right? Like using humans
to train models is just so expensive, right? So then there's the magic of sort of reinforcement
learning and other synthetic data technologies, right, where the model is helping teach the model, right?
So you have many models in a sort of in a post-training where, yes, you have some example
human data, but human data does not scale that fast, right? Because the internet is trillions
and trillions of words out there, whereas, you know, even if you had, you know, Alex and I write
words all day long for our whole lives, we would have millions or, you know, hundreds of millions
of words written, right? It's nothing. It's like orders of magnitude off in terms of the number of
words required. So then you have the model, you know, take some of this example data, and you have
various models that are surrounding the main model that you're training, right? And these can be
policy models, right, teaching it, hey, is this what you want or that what you want? Reward models,
right? Like, is that good response? Is that a good response? Is that a bad response? You have value
models like, hey, grade this output, right? And you have all these different models working in
conjunction to say, you know, different companies have different objective functions, right? In the case
of Anthropic, they want their model to be helpful, harmless, and safe, right? So, be
helpful but also don't harm people or anyone or anything and then you know you know safe right
in other cases like grok right Elon's model from xAI it actually just wants to be helpful and maybe
it has like a little bit of a right leading to it right and for other folks right like you know
you mean most AI models are made in the Bay Area so they tend to just be left leaning right
but also the internet on general is a little bit left leaning because it skews younger than older
and so like all these things like sort of affect models but like it's not just
around politics, right? Post-training is also just about teaching the model. If I say, like,
the movie where the princess has a slipper and it doesn't fit, it's like, well, if I said that
into a base model that was just pre-training, like, the answer wouldn't be, oh, the movie you're
looking for Cinderella, you know, it would only realize that once it goes, you know,
once it goes through post-training, right? Because a lot of times people just throw garbage into
the model, and then the model still figures out what you want, right? And this is part of what post-training
is. Like, you can just do stream of consciousness into models, and oftentimes it'll figure out
what you want. Like, you know, if it's a movie that you're looking for, or if it's help
answering a question, or if you throw a bunch of, like, unstructured data into it and then ask
it to make it into a table, it does this, right? And that's because of all these different
aspects of post-training, right? Example data, but also, you know, generating a bunch of data
and grading it and seeing if it's good or not, and whether it matches the various policies
you want. Is it help, you know, a lot of times grading can be based on multiple factors, right? There can
be a model that says, hey, is this helpful? Hey, is this safe? And what is safe? Right? So then that
model for safety needs to be tuned on human data, right? So there's, it is a quite complex thing,
but the end goal is to be able to get the model to output in a certain way. Models aren't always
about just humans using them either, right? There can be models that are just focused on like,
hey, like, you know, if it doesn't output code, you know, yes, it was trained on the whole
internet because the person's going to talk to the model using text, but if it doesn't output
code, you know, penalize it, right? Now all of a sudden the model will never output like text
ever again and only output code um and so like these sorts of like models exist too so post training
is not just a univariable thing right it's what variables do you want to target um and so that's why
models have different personalities from different companies it's why they target different use cases
and why you know it's not just like one model that rules them all but actually many that's
fascinating so that's why we've seen so many different models with different personalities is because
it all happens in the post training uh moment and this is when when you talk about
giving the models examples to follow this.
That's what reinforcement learning with human feedback is,
is the humans give some examples,
and then the model learns to emulate what the human is interested
in what the human trainer is interested in having them embody.
Is that right?
Yeah, exactly.
Okay, great.
All right, so first half we've covered what training is, what tokens are,
what loss is, what post-training is post-training, by the way,
also called fine-tuning.
We've also covered reinforcement learning with human feedback.
We're going to take a quick break, and then we're going to talk about reasoning.
We'll be back right after this.
Hey, everyone, let me tell you about The Hustle Daily Show, a podcast filled with business, tech
news, and original stories to keep you in the loop on what's trending.
More than 2 million professionals read The Hustle's daily email for its irreverent and
informative takes on business and tech news.
Now, they have a daily podcast called The Hustle Daily Show, where their team of writers
break down the biggest business headlines in 15 minutes or less and explain why you should care
about them. So search for The Hustled Daily Show and your favorite podcast app like the one you're
using right now. And we're back here on Big Technology Podcast with Dylan Patel. He's the founder
and chief analyst at Semi Analysis. He actually has a great analysis on NVIDIA's recent
GTC conference, which we just covered recently on a recent episode. You can find Semi Analysis
at semi-analysis.com.
It is both content and sort of, and consulting.
So I definitely check in with Dylan for all of those, all those needs.
And now we're going to talk a little bit about reasoning.
Because a couple months ago, and Dylan, this is really where, you know, I sort of entered
the picture of watching your conversation with Flex, with Nathan Lambert, about what
the difference is between reasoning and, and your traditional LLMs, large language models.
If I gathered it right from your conversation, what reasoning is, is basically instead of the model going, basically predicting the next word based off of its training, it uses the tokens to spend more time basically figuring out what the right answer is, and then coming out with a new prediction.
I think Carpathie does a very interesting job in the YouTube video talking about how models think with tokens.
the more tokens there are, the more compute they use because they're running these
predictions through the transformer model, which we discussed, and therefore they can come to
better answers. Is that the right way to think about reasoning?
So I think that humans are also fantastic at pattern matching, right? We're really good at
recognizing things. But a lot of tasks, it's not like an immediate response, right? We are
thinking. Whether that's thinking through words out loud, thinking through words in an inner
monologue in her head or it's just like processing somehow and then we know the answer right
and this is the same for models right models are horrendous at math right historically have
been right um you could ask it you know what is 9.11 bigger than 9.9 um and it would say yes it's bigger
even though like everyone knows that 9.11 is is way smaller than 9.9 right um and that's just
like a thing that happened in models because they didn't think or reason right and it's the same
for you Alex, right? Like, you know, or myself, right? Like, if someone asked me, you know,
uh, 17 times 34, I'd be like, I don't know, like right off top of my head, but, you know,
give me, give me a little bit of time. I can do some long form multiplication and I can get
the answer, right? And that's because I'm thinking about it. Um, and this is the same thing
with reasoning for models, um, is, you know, when you look at a transformer, every word is this,
every token output, it has the same amount of compute behind it, right? Um, i.e, you know, when I'm saying
the versus sky is blue, the blue and the the the, or the is in the blue have the same amount
of compute to generate, right? And this is not exactly what you want to do, right? You want to
actually spend more time on the hard things and not on the easy things. And so reasoning models
are effectively teaching, you know, large pre-trained models to do this, right? Hey, think through
the problem. Hey, output a lot of tokens. Think about it. Generate all this text. And then when you're
done, you know, start answering the question, but now you have all of this stuff you generated
in your context, right? And that stuff you generated is, is helpful, right? It could be like,
you know, all sorts of, you know, just like any human's thought patterns are, right? And so
this, this is the sort of like new paradigm that we've entered maybe six months ago, where
models now will think for some time before they answer. And this enables much better
performance on all sorts of tasks, whether it be coding or math or understanding science or
understanding complex social dilemmas, right? All sorts of different topics they're much,
much better at. And this is done through post-training, similar to the reinforcement learning
by human feedback that we mentioned earlier. But also, there's other forms of post-training,
and that's what makes these reasoning models. Before we head out, I want to hit on a couple things.
first of all, the growing efficiency of these models.
So I think one of the things that people focused on with DeepSeek
was that it was just able to be much more efficient
in the way that it generates answers.
And there was this, obviously, this big reaction to Nvidia stock
where it fell 18% the day or at the Monday after Deep Seek weekend
because people thought we wouldn't need as much compute.
So can you talk a little bit about how models are becoming more efficient
and how they're doing it?
So there's a variety of, the beauty of these, of AI is not just that we continue to build new capabilities, right?
Because those new capabilities are going to be able to benefit the world in many ways.
And there's a lot of focus on those.
But there's also a lot of, there's a lot of focus on, well, to get to that next level of capabilities is the scaling laws, i.e., the more compute and data I spend, the better the model gets.
But then the other vector is, well, can I get to the same level with less compute and data, right?
And those two things are hand in hand, because if I can get to the same level with the less compute and data,
then I can spend that more compute and data and get to a new level, right?
And so AI researchers are constantly looking for ways to make models more efficient,
whether it be through algorithmic tweaks, data tweaks, tweaks in, you know, how you do reinforcement learning, so on and so forth.
Right. And so when we look at models across history, they've constantly gotten cheaper and cheaper and cheaper, right, at a stupendous rate, right?
And so one easy example is GPT3, right?
Because there's GPT3, 3.5 turbo, Lama 27B, Lama 3.1, Lama 3.2, right?
As these models have gotten bigger, we've gone from, hey, it costs $60 for a million tokens to it costs less than, it costs like 5 cents now for the same quality of model.
Now, and the model has shrank dramatically in size as well.
And that's because of better algorithms, better data, et cetera.
And now what happened with DeepSeek was similar, you know, opening I had GPD4, then they had 4 turbo, which was half the cost, then they had 4O, which was again half the cost.
And then meta released Lama 405B, open source, and so the open source community was able to run that.
And that was again, like roughly like half the cost, or 5X lower cost than 4O, which was lower than 4 turbo and 4.
But DeepSeek came out with another tier, right?
So when we looked at GPD3, the cost fell 1,200x from GPD3's initial cost to what you can get Lama 3.2 3B today, right?
And likewise, when we look at from GPD4 to Deepseek v3, it's fallen roughly 600x in cost, right?
So we're not quite at that 1,200X, but it has fallen 600X and cost from $60 to less than a, you know, to about a dollar, right?
Or to less than a dollar, sorry, 60x.
And so you've got this massive cost decrease, but it's not necessarily out of bounds, right?
We've already seen it.
I think what was really surprising was that it was a Chinese company for the first time, right?
Because Google and Open AI and Anthropic and meta have all traded blows, right?
You know, whether it be open AI always being on the leading edge or Anthropic always being on the leading edge
or, you know, Google and meta, you know, being close followers, but oftentimes sometimes with a new feature and sometimes just being much cheaper.
we have not seen this from any Chinese company, right?
And now we have a Chinese company releasing a model that's cheap.
It's not unexpected, right?
Like, this is actually within the trend line of what happened with GPT3,
is happening to GPD4 level quality with Deepseek.
It's more so surprising that it's a Chinese company,
and that's, I think, why everyone freaked out.
And then there was a lot of things that, like, you know, from there became a thing, right?
Like, if meta had done this, I don't think people would have freaked out, right?
and meta is going to release their new llama soon enough right and that one is going to be you know
a similar level of cost decrease uh probably similar areas deep seek v3 right um it's just not people
aren't going to freak out because it's an american company it was sort of expected all right dylan
let me ask you the last question which is the you mentioned i think you mentioned the bitter lesson
which is basically that the i mean i'm going to just be kind of facetious and summing it up but
the answer to all questions and machine learning is just to make bigger uh bigger
models and scale solves almost all problems. So it's interesting that we have this moment
where models are becoming way more efficient, but we also have massive, massive data center
buildouts. I think it would be great to hear you kind of recap the size of these data center
buildouts and then answer this question. If we are getting more efficient, why are these data
centers getting so much bigger? And what might that added scale get in the world of generative
of AI for the companies building them.
Yeah, so when we look across the ecosystem at data center buildouts, we track all the
buildouts and server purchases and supply chains here.
And the pace of construction is incredible, right?
You can just, you can pick a state and you can see new data centers going up all across
the U.S. and around the world, right?
And so you see things like capacity in, you know, for example, of the largest scale training
supercomputers goes from, hey, it's a few.
hundred million dollars it's it's not even a few hundred million dollars years ago but like
you know hey for gpt four it was a few hundred million dollars um and it's it's one building full of
gp uptu's to uh gpt 4.5 um and uh the reasoning models like oh 1.03 were done in a in three
buildings on the same site and you know a billions of dollars to hey these next generation
things that people are making are tens of billions of dollars um like opening i's
data center in Texas called Stargate, right, with Crusoe and Oracle and et cetera, right?
And likewise applies to Elon Musk, who's building these data centers in an old factory where he's
got like a bunch of like gas generation, you know, outside and he's doing all these crazy things
to get the data center up as fast as possible, right? And you can go to just basically every
company and they have like these humongous buildouts. And this sort of like, and because of
the scaling laws, right, you know, 10x more computer.
for linear, like, improvement gains, right?
Like, it's sort of like, or it's log, log, sorry.
But you end up with this, like, very confusing thing,
which is, like, hey, models keep getting better
as we spend more, but also the model that we had a year ago
is now done for way, way cheaper, right?
Oftentimes, 10x cheaper or more, right?
Just a year later.
So then the question is, like,
why are we spending all this money to scale?
And there's a few things here, right?
A, you can't actually make that cheaper model without making the better, bigger model,
so you can generate data to help you make the cheaper model, right?
Like, that's part of it.
But also another part of it is that, you know, if we were to freeze AI capabilities
where we were basically in, what was it, March, 23, right, two years ago when GPT4 released,
and only made them cheaper, right?
Like, Deepseek is, like, much cheaper, it's much more efficient.
but it's roughly the same capabilities as GPD4.
That would not pay for all of these buildouts, right?
AI is useful today, but it is not capable of doing a lot of things, right?
But if we make the model way more efficient and then continue to scale,
and we have this like stair step, right, where we like increase capabilities massively,
make them way more efficient, increase capabilities massively, make them way more efficient.
We do the stair step, then you end up with creating all these new capabilities that could
in fact, pay for, you know, these massive AI buildouts. So no one is trying to make
with these, you know, with these $10 billion data centers, they're not trying to make
chat models, right? They're not trying to make models that people chat with, just to be clear,
right? They're trying to solve things like software engineering and make it automated,
which is like a trillion dollar plus industry, right? So these are very different like sort
of use cases and targets. And so it's the bitter lesson because, yes, you can make,
you can spend a lot of time and effort making clever, specialized methods.
you know, based on intuition, and you should, right?
But these things should also just have a lot more compute thrown behind them
because if you make it more efficient, as you follow the scaling laws up,
it'll also just get better and you can then unlock new capabilities, right?
And so today, you know, a lot of AI models, the best ones from Anthropic
are now useful for, like, coding as an assistant with you, right?
You're going back and forth, you know, as time goes forward,
as you make them more efficient and continue to scale them,
the possibility is that, hey, it can code for like 10 minutes at a time.
I can just review the work and it will make me 5x more efficient, right?
You know, and so on and so forth.
And this is sort of like where reasoning models and sort of the scaling sort of argument comes in is like, yes, we can make it more efficient, but we also just, you know, that's not going to solve the problems that we have today, right?
The Earth is still going to run out of resources.
We're going to run out of nickel because we can't make enough batteries and we can't make enough batteries.
So then we can't with current technology that we can't replace all of, you know, gas, you know, gas, you know, gas.
and coal with renewables, all of these things are going to happen unless, like, you continue
to improve AI and invent, or just generally research new things. And AI helps us research new
things. Okay, this is really the last one. Where is GPT5?
So OpenAI released GPD 4.5 recently with what they called training run Orion. There were
hopes that Orion could be used for GPD5. But its improvement was like,
not enough to be like really a GPT-5.
Furthermore, it was trained on the classical method,
which is a ton of pre-training
and then some reinforcement learning with human feedback
and some other reinforcement learning,
like PPO and DPO and stuff like that.
But then along the way, right,
this model was trained last year,
along the way, another team at OpenAI
made the big breakthrough of reasoning, right?
Strawberry training, and they released 01,
and then they released 0.3.
And these models are rapidly getting better
with reinforcement learning with verifiable reward.
And so now, GPD 5, as Sam calls it, is going to be a model that has huge pre-training scale,
right, like GPD 4.5, but also huge post-training scale like 01 and 03 and continuing to scale
that up, right?
And this would be the first time we see a model that was a step up in both at the same time.
And so that's what OpenAI says is coming.
They say it's coming, you know, this year, hopefully in the next three to six months,
maybe sooner.
I've heard sooner, but, you know, we'll see.
Um, but this, this path of scaling both pre-training and a post-training with reinforcement learning
with verifiable rewards massively should yield much better models that are capable of much more
things and we'll see what those things are.
Very cool.
All right, Dylan, do you want to give a quick shout out to those who are interested in potentially
working with semi-analysis who you work with and where they can learn more?
Sure.
So we, you know, at semi-analysts.com, we have, you know, the, we have the public stuff, which is like
all these reports that are, uh,
pseudo-free. But then most of our work is done on directly for clients. There's these
datasets that we sell around every data set in the world, servers, all the compute, where it's
manufactured, how many, where, what's the cost, and who's doing it. And then we also do a lot
of consulting. We've got people who have worked all the way from ASML, which makes lithography
tools all the way up to, you know, Microsoft and Nvidia, which, you know, making models and
doing infrastructure. And so we've got this whole gamut of, you know, folks. There's, there's
roughly 30 of us across the world in US, Taiwan, Singapore, Japan, France, Germany, Canada.
So, you know, there's a lot of engagement points.
But if you want to reach out, just go to the website, you know, go to one of those specialized
pages of models or sales and reach out.
And that'd be the best way to sort of interact and engage with us.
But for most people, just read the blog, right?
Like, I think, like, unless you have specialized, like, needs, unless you're a company in
the space or your investor in the space, like, you know, you just want to be informed
You can read the blog and it's free, right? I think that's the best option for most people.
Yeah, well, I will attest the blog is magnificent and Dylan is really a thrill to get a chance to meet you and talk through these topics with you.
So thanks so much for coming on the show. Thank you so much, Alex.
All right, everybody, thanks for listening. We'll be back on Friday to break down the week's news. Until then, we'll see you next time on Big Technology Podcast.