Lex Fridman Podcast - #459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters
Episode Date: February 3, 2025Dylan Patel is the founder of SemiAnalysis, a research & analysis company specializing in semiconductors, GPUs, CPUs, and AI hardware. Nathan Lambert is a research scientist at the Allen Institute for... AI (Ai2) and the author of a blog on AI called Interconnects. Thank you for listening ❤ Check out our sponsors: https://lexfridman.com/sponsors/ep459-sc See below for timestamps, and to give feedback, submit questions, contact Lex, etc. CONTACT LEX: Feedback - give feedback to Lex: https://lexfridman.com/survey AMA - submit questions, videos or call-in: https://lexfridman.com/ama Hiring - join our team: https://lexfridman.com/hiring Other - other ways to get in touch: https://lexfridman.com/contact EPISODE LINKS: Dylan's X: https://x.com/dylan522p SemiAnalysis: https://semianalysis.com/ Nathan's X: https://x.com/natolambert Nathan's Blog: https://www.interconnects.ai/ Nathan's Podcast: https://www.interconnects.ai/podcast Nathan's Website: https://www.natolambert.com/ Nathan's YouTube: https://youtube.com/@natolambert Nathan's Book: https://rlhfbook.com/ SPONSORS: To support this podcast, check out our sponsors & get discounts: Invideo AI: AI video generator. Go to https://invideo.io/i/lexpod GitHub: Developer platform and AI code editor. Go to https://gh.io/copilot Shopify: Sell stuff online. Go to https://shopify.com/lex NetSuite: Business management software. Go to http://netsuite.com/lex AG1: All-in-one daily nutrition drinks. Go to https://drinkag1.com/lex OUTLINE: (00:00) - Introduction (13:28) - DeepSeek-R1 and DeepSeek-V3 (35:02) - Low cost of training (1:01:19) - DeepSeek compute cluster (1:08:52) - Export controls on GPUs to China (1:19:10) - AGI timeline (1:28:35) - China's manufacturing capacity (1:36:30) - Cold war with China (1:41:00) - TSMC and Taiwan (2:04:38) - Best GPUs for AI (2:19:30) - Why DeepSeek is so cheap (2:32:49) - Espionage (2:41:52) - Censorship (2:54:46) - Andrej Karpathy and magic of RL (3:05:17) - OpenAI o3-mini vs DeepSeek r1 (3:24:25) - NVIDIA (3:28:53) - GPU smuggling (3:35:30) - DeepSeek training on OpenAI data (3:45:59) - AI megaclusters (4:21:21) - Who wins the race to AGI? (4:31:34) - AI agents (4:40:16) - Programming and AI (4:47:43) - Open source (4:56:55) - Stargate (5:04:24) - Future of AI PODCAST LINKS: - Podcast Website: https://lexfridman.com/podcast - Apple Podcasts: https://apple.co/2lwqZIr - Spotify: https://spoti.fi/2nEwCF8 - RSS: https://lexfridman.com/feed/podcast/ - Podcast Playlist: https://www.youtube.com/playlist?list=PLrAXtmErZgOdP_8GztsuKi9nrraNbKKp4 - Clips Channel: https://www.youtube.com/lexclips
Transcript
Discussion (0)
The following is a conversation with Dylan Patel and Nathan Lampert.
Dylan runs SemiAnalysis, a well-respected research and analysis company that specializes in semiconductors,
GPUs, CPUs, and AI hardware in general.
Nathan is a research scientist at the Allen Institute for AI and is the author of the
amazing blog on AI called Interconnects.
They are both highly respected, read and listened to by the experts, researchers and engineers
in the field of AI.
And personally, I'm just a fan of the two of them.
So I use the DeepSeq moment that shook the AI world a bit as an opportunity to sit down with them and lay it all out.
From DeepSeek OpenAI, Google XAI, MetaAnthropic, to NVIDIA and TSMC,
and to US-China-Taiwan relations and everything else that is happening at the cutting edge of AI.
This conversation is a deep dive into many critical aspects of the AI industry.
While it does get super technical, we try to make sure that it's still accessible to
folks outside of the AI field by defining terms, stating important concepts explicitly,
spelling out acronyms, and in general, always moving across the several layers of abstraction
and levels
of detail.
There is a lot of hype in the media about what AI is and isn't.
The purpose of this podcast, in part, is to cut through the hype, through the bullshit,
and the low resolution analysis, and to discuss in detail how stuff works and what the implications are.
Let me also if I may comment on the new OpenAI 03 Mini reasoning model, the release of which
we were anticipating during the conversation and it did indeed come out right after.
Its capabilities and cost are on par with our expectations as we stated.
are on par with our expectations as we stated.
OpenAI 03 Mini is indeed a great model, but it should be stated that DeepSeq R1
has similar performance on benchmarks,
is still cheaper,
and it reveals its chain of thought reasoning,
which 03 Mini does not.
It only shows a summary of the reasoning.
Plus R1 is open weight and O3 mini is not. By the way,
I got a chance to play with O3 mini and anecdotal vibe check wise, I felt that O3 mini,
specifically O3 mini high, is better than R1. Still, for me personally, I find that
Claude's Sonnet 3.5 is the best model for
programming, except for tricky cases where I will use O1 Pro to brainstorm. Either way,
many more better AI models will come, including reasoning models, both from American and Chinese
companies. They will continue to shift the cost curve. But the quote, deep seek moment is indeed real.
I think it will still be remembered five years from now
as a pivotal event in tech history,
due in part to the geopolitical implications,
but for other reasons too,
as we discuss in detail from many perspectives
in this conversation.
And now a quick few second mention of each sponsor, check them out in the
description is the best way to support this podcast.
We got in video AI for video generation, GitHub for coding, Shopify for selling
stuff online, net suite for running your business and AG one for staying healthy.
Choose wisely my friends.
Also, if you want to get in touch with me for whatever reason, go to AlexGrimmy.com
slash contact.
And now onto the full ad reads.
No ads in the middle.
I try to make this interesting, but if you skip them, please still check out our sponsors.
I enjoy their stuff.
Maybe you will too.
This video is brought to you by a new sponsor, but I've known these folks for a long time and perfect fit for this podcast.
They're called NVIDIA AI.
It's a video generating app that allows you to create full length videos using just text
prompts.
It's intuitive, works amazing.
It's truly incredible what you can do.
I've been playing quite a bit in using it for stock footage.
And by the way, they make it super easy for you
to switch between actually available stock footage
and AI-generated footage.
I've been preparing a lot for a conversation with Tim Sweeney,
who is the creator of Unreal Engine.
And there, it's 3D worlds.
And you get to think about the role of AI there it's 3D worlds, and you get to think about
the role of AI in generating those 3D worlds.
That's what's coming five, 10, 20 years from now.
In video games and simulations,
a fundamental part of our lives would be generated with AI.
And I think NVIDIA AI does a masterful job
of pushing us in that direction in the 2D plane of video. Now I think this
is not a tool that replaces human creativity. I think it supercharges
human creativity. I think now and for a long long time to come, humans will be in
the loop of creating great art because Because we're creating for each other. And only humans
truly deeply know what makes other humans go ah, like the old Kerak line. If you want
to try out NVIDEO AI, you can do so for free at nvideo.io slash LexPod, saving time and money on production costs.
This episode is brought to you by the thing
that's brought me joy for many, many years
and created a community for hundreds of thousands,
millions, I don't know how many developers
and that place is called GitHub.
It is a company that really has supercharged
the developer community.
I mean, where would the world be without GitHub?
And they're also, as a company,
pushing the limits of what's possible
in terms of AI code generation, AI-assisted coding.
They were pioneers on Copilot. They are still
pioneers in Copilot. It's super competitive space and they are doing their best to win.
I will forever be a supporter of GitHub Copilot. Now it integrates in a bunch of IDEs, not
just into VS code. I am of course a VS code guy at this time. I did use JetBrains for a long
time. I still dabble a little bit. For people who don't know, JetBrains has a plethora.
Don't like using that word. It seems elitist. But it's got to be a better word. There is
a lot of different sort of sub IDEs inside JetBrains. I've even used DataGrip which manages the
MySQL. I should mention, and this might be embarrassing, but I have not, oh this might
be interesting, but I have not used anything like Copilot on any database management GUIs.
copilot on any database management GUIs. I wonder if data grip integrates copilot. I'm going to have to check that out. But everything I use, I'm writing SQL queries from scratch.
Inside the database management GUI, if I want to do complicated queries, I'll go to
any of the LLMs, probably going to be Clawsona
3.5, or if it's part of the code, then I'm going to be inside my IDE. I just like
having a GUI management of a database. I'm gonna have to check that out with it.
If DataGrip integrates Copilot, it's gonna be incredible. If not, I'm gonna
yell from the top of my lungs, hoping it will eventually,
because it'll make my life a bit easier.
To have the visual component of a database
together with the code component of SQL queries,
yeah, it would be amazing.
Anyway, go check out GitHub Copilot at gh.io slash Copilot.
This episode is brought to you by Shopify.
Not Spotify, Shopify.
Easily confused, the CEOs are tagged on X often.
They're both great CEOs.
But this is Shopify.
You can sell anywhere with a great looking online store
using Shopify.
I've been learning a lot about the Silk Road, actually.
Not the digital one. The one that for a lot about the Silk Road, actually. Not the digital one.
The one that for a lot of human history
served as a place for merchants to travel and trade goods.
And I'm reading a lot about Genghis Khan
who enforced the rule of law on the Silk Road
and that actually had a big invigorating effect on the economy of
the Eurasian region. Anyway, that was before computers. If they had computers, imagine if
they had computers. Boy, would the Genghis Khan force be terrifying. Or maybe not. Maybe each technological age has their own kind of
military tactician, their own human that matches perfectly for that time in order
to conquer the land and people. Still, what a terrifying time that was.
Much of human history. Lots of beauty, but lots of ways to die. So I'm glad to be
living in the 21st century where I can sit back with that margarita. I don't drink margaritas,
but if I wanted to, I could. And then buy stuff on stores created by Shopify. Anyway, you can sign up for a $1 per month trial period at Shopify.com slash Lex.
Go to Shopify.com slash Lex to take your business to the next level today.
This episode is also brought to you by NetSuite and all in one business management system.
I'm not sure why I said that so slowly, but I did.
I actually did a little intermission for
five six minutes for this episode where I added in the middle of it an addendum
after having tried open AI 03 mini. That was such a weird feeling to sort of
insert myself in the middle of an episode. I felt like a third wheel to
myself. It's like hey hey, hey everyone, what
are you doing? Why'd you guys not invite me to this party? That's what I felt like.
Hey Lux from the past, it's me Lux from the future. Right, I should be talking
about that suite, which is an all-in-one cloud business management system. It's
the machine inside the machine. And boy are we increasingly
building stacks of machines. Layers and layers and layers of abstraction until
we're just sitting back on a beach somewhere talking to an AI system
that's taking care of everything else. Anyway you can download the CFO's guide to AI and machine learning at netsuite.com slash Lex.
That's netsuite.com slash Lex.
This episode is also brought to you by AG1,
an all-in-one daily drink to support better health
and P performance.
I drank it today, I enjoyed it today.
I've been sleeping very, very little.
The amount of work I have to do is insane and last night at
6 a.m. I went to bed 7 a.m. 8 a.m
Thinking about doing an all-nighter. It's madness. But anyway at
6 a.m. I drank an agey one and I was sitting in a couch and I was watching like 10 minutes of
I was sitting in a couch and I was watching like 10 minutes of American Pride and Meval. I watched like five, 10 minutes of a show at a time.
I was sipping on the AG-1 and I was thinking how lucky, how fucking lucky I am to be alive.
First of all, because I'm watching the American frontier and people being just brutal to each
other, the brutal reality of nature and war during that time and the lawlessness during that time.
But also just how lucky I am to be on the spinning rock enjoying this green healthy drink.
Being able to watch a show, being able to work hard towards the thing I love.
Being able to love.
Being able to breathe.
All of it.
Just amazing.
Anyway.
They'll give you a one month supply of fish oil when you sign up at drinkag1.com slash
Lex.
This is the Lex Friedman Podcast.
To support it, please check out our sponsors in the description.
And now, dear friends, here's Dylan Patel and Nathan Lambert.
A lot of people are curious to understand China's DeepSeek AI models. So let's lay it out.
Nathan, can you describe what DeepSeek V3 and DeepSeek R1 are, how they work, how they're
trained?
Let's look at the big picture and then we'll zoom in on the details. Yes. DeepSeq V3 is a new mixture of experts, transformer language model from DeepSeq who is based in
China.
They have some new specifics in the model that we'll get into.
Largely, this is a open weight model and it's a instruction model like what you would use
in ChatGBT. They also release what is called the base model, which is's a instruction model like what you would use in chat GPT.
They also released what is called the base model, which is before these techniques of
post-training.
Most people use instruction models today, and those are what's served in all sorts of
applications.
This was released on, I believe, December 26th or that week.
And then weeks later on January 20th, DeepSeq released DeepSeq R1, which is a reasoning model,
which really accelerated a lot of this discussion. This reasoning model has a lot of overlapping
training steps to DeepSeq v3, and it's confusing that you have a base model called v3 that you do
something to to get a chat model, and then you do some something to to get a chat model and then you do some different
things to get a reasoning model.
I think a lot of the AI industry is going through this challenge of communications right
now where OpenAI makes fun of their own naming schemes.
They have GPT-40, they have OpenAI-01, and there's a lot of types of models.
So we're going to break down what each of them are.
There's a lot of technical specifics on training and go from high level to specific and kind
of go through each of them.
There's so many places we can go here, but maybe let's go to open weights first.
What does it mean for model to be open weights and what are the different flavors of open
source in general?
Yeah.
So this discussion has been going on for a long time in AI.
It became more important since ChatchaT or more focal since ChatGBT
at the end of 2022.
Open weights is the accepted term for when model weights of a language model are available
on the internet for people to download.
Those weights can have different licenses, which is effectively the terms by which you
can use the model.
There are licenses that come from history and open source software.
There are licenses that are designed by companies specifically. All of Lama, DeepSeek, Quen,
Mistral, these popular names in open weight models have some of their own licenses. It's
complicated because not all the same models have the same terms. The big debate is on what makes a model open weight. Why are we saying this term? It's kind
of a mouthful. It sounds close to open source, but it's not the same. There's still a lot of
debate on the definition and soul of open source AI. Open source software has a rich history on
freedom to modify, freedom to take on your own, freedom for many restrictions on how you would use the software, and what that means for AI is still being defined. So for what I do, I work at the Allen Institute
for AI. We're a nonprofit. We want to make AI open for everybody, and we try to lead on what we think
is truly open source. There's not full agreement in the community, but for us, that means releasing
the training data, releasing the training code, and then also having open weights like this.
And we'll get into the details of the models.
And again and again, as we try to get deeper into how the models were trained, we will
say things like the data processing, data filtering, data quality is the number one
determinant of the model quality.
And then a lot of the training code
is the determinant on how long it takes to train and how fast your experimentation is. So without
fully open source models where you have access to this data, it is hard to know or it's harder to
replicate. So we'll get into cost numbers for DeepSeq v3 on mostly GPU hours and how much you
could pay
to rent those yourselves.
But without the data, the replication cost
is going to be far, far higher.
And same goes for the code.
We should also say that this is probably
one of the more open models out of the frontier models.
So like in this full spectrum,
where probably the fullest open source, like you said,
open code, open data, open weights.
This is not open code.
This is probably not open data.
And this is open weights and the licensing is MIT license.
Or it's, I mean, there's some nuance
in the different models, but it's towards the free,
in terms of the open source movement, these are the nuance in the different models, but it's towards the free, in terms
of the open source movement, these are the kind of the good guys.
Yeah.
DeepSeq is doing fantastic work for disseminating understanding of AI.
Their papers are extremely detailed in what they do.
And for other teams around the world, they're very actionable in terms of improving your
own training techniques. And we'll talk about licenses more.
The DeepSeq R1 model has a very permissive license.
It's called the MIT license.
That effectively means there's no downstream restrictions
on commercial use.
There's no use case restrictions.
You can use the outputs from the models
to create synthetic data.
And this is all fantastic.
I think the closest peer is something like Llama,
where you have the weights and you have a technical report.
And the technical report is very good for Llama.
One of the most read PDFs of the year last year
is the Llama 3 paper.
But in some ways, it's slightly less actionable.
It has less details on the training specifics,
I think less plots and so on.
And the Llama 3 license is more restrictive than MIT.
And then between the DeepSeq custom license
and the LAMA license,
we could get into this whole rabbit hole.
I think we'll make sure we wanna go down the license
or out of a hole before we do specifics.
Yeah, and I mean, so it should be stated
that one of the implications of DeepSeq,
it puts pressure on LAMA and everybody else on open AI
to push towards open source.
And that's the other side of open source that you mentioned is how much is
published in detail about it.
So how open are you with the sort of the insights behind the code?
So like how good is the technical reports?
Are they hand wavy or is there actual details in there?
And that's one of the things that DeepSeek did well is they published a lot of the details.
Yeah, especially in the DeepSeek V3,
which is their pre-training paper.
They were very clear that they are doing interventions
on the technical stack that go at many different levels.
For example, to get highly efficient training,
they're making modifications at or below the CUDA layer
for NVID Nvidia chips.
I have never worked there myself and there are a few people in the world that do that
very well and some of them are at DeepSeek.
These types of people are at DeepSeek and leading American frontier labs, but they're
not many places.
To help people understand the other implication of open weights, There's a topic we'll return to often here.
So there's a fear that China, the nation, might have interest in stealing American data,
violating privacy of American citizens. What can we say about open weights to help us understand
what the weights are able to do
in terms of stealing people's data?
Yeah. So these weights that you can download from Hugging Face or other platforms are
very big matrices of numbers. You can download them to a computer in your own house that has
no internet and you can run this model and you're totally in control of your data.
That is something that is different than how a lot
of language model usage is actually done today,
which is mostly through APIs,
where you send your prompt to GPUs run by certain companies.
And these companies will have different distributions
and policies on how your data is stored,
if it is used to train future models,
where it is stored, if it is encrypted, and so on.
So the open weights are,
you have your fate of data
in your own hands, and that is something
that is deeply connected to the soul of open source.
So it's not the model that steals your data,
it's whoever's hosting the model,
which could be China, if you're using the DeepSeek app,
or it could be Proplexity.
You know, you're trusting them with your data.
Or OpenAI, you're trusting them with your data. Or OpenAI, you're trusting them with your data.
And some of these are American companies,
some of these are Chinese companies.
But the model itself is not doing the stealing.
It's the host.
All right.
So back to the basics.
What's the difference between DeepSeq V3 and DeepSeq R1?
Can we try to like lay out the confusion potential?
Yes, so for one, I have very understanding
of many people being confused by these two model names.
So I would say the best way to think about this
is that when training a language model,
you have what is called pre-training,
which is when you're predicting the large amounts
of mostly internet text,
you're trying to predict the next token.
And what to know about these new Deep
Seq models is that they do this internet large-scale pre-training once to get what is called Deep Seq V3
base. This is the base model. It's just going to finish your sentences for you. It's going to be
harder to work with than ChatGPT. And then what Deep Seq did is they've done two different post-training regimes to make the models have specific desirable behaviors.
So what is the more normal model in terms of the last few years of AI, an instruct model, a chat model, a quote unquote aligned model, a helpful model?
There are many ways to describe this is more standard post-training.
So this is things like instruction tuning, reinforcement learning from human feedback.
We'll get into some of these words.
And this is what they did to create the DeepSeq v3 model.
This was the first model to be released
and it is very high performance.
It's competitive with GPT-4, Lama 405B, so on.
And then when this release was happening, we don't know their exact timeline
or soon after they were finishing the training of a different training process from the same
next token prediction based model that I talked about, which is when this new reasoning training
that people have heard about comes in, in order to create the model that is called DeepSeq R1.
The R through this conversation is good for grounding for
reasoning and the name is also similar to OpenAI's O1, which is the other reasoning model that people
have heard about. And we'll have to break down the training for R1 in more detail because for one,
we have a paper detailing it, but also it is a far newer set of techniques for the AI community.
So it's a much more rapidly evolving area of research.
Maybe we should also say the big two categories of training
of pre-training and post-training.
These umbrella terms that people use.
So what is pre-training and what is post-training?
And what are the different flavors of things
underneath post-training umbrella?
Yeah, so pre-training, I'm using some of the same words
to really get the message across
is you're doing what is called autoregressive prediction to predict the next token in a
series of documents. This is done over standard practices, trillions of tokens. So this is
a ton of data that is mostly scraped from the web. And some of DeepSeek's earlier papers,
they talk about their training data being
distilled for math. I shouldn't use this word yet, but taken from Common Crawl, and that's a public
access that anyone listening to this could go download data from the Common Crawl website.
This is a crawler that is maintained publicly. Yes, other tech companies eventually shift to
their own crawler, and DeepSeek likely has done this as well as most frontier labs do. But
this sort of data is something that people can get started with and you're just predicting
text in a series of documents. This can be scaled to be very efficient and there's a lot
of numbers that are thrown around in AI training like how many floating point operations or flops are used. And then you can also look at how many
hours of these GPUs that are used. And it's largely one loss function taken to a very large
amount of compute usage. You just set up really efficient systems. And then at the end of that,
you have the space model. And pre-training is where there is a lot more of complexity
in terms of how the process is emerging or evolving
and the different types of training losses
that you will use.
I think this is a lot of techniques
grounded in the natural language processing literature.
The oldest technique, which is still used today,
is something called instruction tuning or also known as supervised fine tuning.
These acronyms will be IFT or SFT.
People really go back and forth throughout them and I will probably do the same,
which is where you add this formatting to the model where it knows to take a question that is like,
explain the history of the Roman Empire to me. And, or something, a sort of question you'll see on Reddit
or Stack Overflow, and then the model will respond
in a information dense, but presentable manner.
The core of that formatting
is in this instruction tuning phase.
And then there's two other categories of loss functions
that are being used today.
One I will classify as preference fine tuning.
Preference fine tuning is a generalized term
for what came out of reinforcement learning
from human feedback, which is RLHF.
This reinforcement learning from human feedback
is credited as the technique that helped
chat GPT break through.
It is a technique to make the responses
that are nicely formatted like these Reddit answers more in tune with what a human would like to read. This is done by collecting pairwise
preferences from actual humans out in the world to start, and now AIs are also labeling this data,
and we'll get into those tradeoffs. And you have this kind of contrastive loss function between a
good answer and a bad answer. And the model learns to pick up these trends
There's different implementation ways you have things called reward models. You could have direct alignment algorithms
there's a lot of really specific things you can do but all of this is about fine-tuning to human preferences and
the final stage is much newer and will link to what is done in R1 and these reasoning models is
I think open AIAI's name for this. They had this
new API in the fall, which they called the Reinforcement Fine Tuning API. This is the idea
that you use the techniques of reinforcement learning, which is a whole framework of AI.
There's a deep literature here. To summarize, it's often known as trial and error learning
or the subfield of AI where you're trying to
make sequential decisions in a certain potentially noisy environment. There's a lot of ways we could
go down that. But fine tuning language models where they can generate an answer and then you
check to see if the answer matches the true solution. For math or code, you have an
exactly correct answer for math. You can have unit tests for code,
and what we're doing is we are checking the language models work and we're giving it multiple
opportunities on the same questions to see if it is right. And if you keep doing this,
the models can learn to improve in verifiable domains to a great extent. It works really well,
it's a newer technique in the academic literature, it's been used at Frontier Labs in the US that don't share every detail for multiple years.
So this is the idea of using reinforcement learning with language models and it has been
taking off, especially in this DeepSeq moment.
And we should say that there's a lot of exciting stuff going on on the, again, across the stack,
but the post-training probably this year there's going to be a lot of interesting developments
in the post-training.
We'll talk about it
I almost forgot to talk about the the difference between deep seek v3 and r1 on the user experience side, so
Forget the technical stuff forget all that just people that don't know anybody AI they show up like what's the actual experience?
What's the use case for each one when they actually like type and talk to it?
What is each good at and that kind of thing? So let's start with deep secret three again
It's what more people would have tried something like it you ask it a question
it'll start generating tokens very fast and those tokens will look like a
Very human legible answer. It'll be some sort of markdown list. It might have formatting to help you
of markdown list, it might have formatting to help you draw to the core details in the answer, and it'll generate tens to hundreds of tokens. A token is normally a word for
common words or a subword part in a longer word. And it'll look like a very high quality
Reddit or Stack Overflow answer. These models are really getting good at doing these across a wide variety
of domains. Even things that if you're an expert, things that are close to the fringe of knowledge,
they will still be fairly good at. Cutting edge AI topics that I do research on, these models are
capable for study aid and they're regularly updated. Where this changes is with the DeepSeq R1, what is called these
reasoning models is when you see tokens coming from these models to start, it will be a large
chain of thought process. We'll get back to chain of thought in a second, which looks like a lot of
tokens where the model is explaining the problem. The model will often break down the problem and be
like, okay, they asked me for this, let's break down the problem, I'm going to need to do this.
And you'll see all of this generating from the model, it'll come very fast in most user
experiences.
These APIs are very fast, so you'll see a lot of tokens, a lot of words show up really
fast.
It'll keep flowing on the screen, and this is all the reasoning process.
And then eventually the model will change its tone in R1 and it'll write the answer,
where it summarizes its reading reasoning process and writes a similar answer to the first types of model.
But in DeepSeq's case, which is part of why this was so popular even outside the AI community,
is that you can see how the language model is breaking down problems.
And then you get this answer on a technical side.
They train the model to do this specifically, where they have a section which
is reasoning, and then it generates a special token,
which is probably hidden from the user most of the time,
which says, OK, I'm starting to answer.
So the model is trained to do this two-stage process
on its own.
If you use a similar model in, say, OpenAI,
OpenAI's user interface is trying
to summarize this process for you nicely by kind of showing
the sections that the model is doing and it'll kind of click through it'll say breaking down
the problem, making X calculation, cleaning the result and then the answer will come for
something like OpenAI.
Maybe it's useful here to go through like an example of a DeepSeq R1 reasoning.
Yeah, so that if you're looking at the screen here,
what you'll see is a screenshot of the DeepSeek chat app
and at the top is thought for 151 seconds
with the drop down arrow.
Underneath that, if we were in an app that we were running,
the drop down arrow would have the reasoning.
So in this case, the specific question,
which I'm philosophically slash pothead inclined.
So this is asking Deep Sikhar I
for one truly novel insight about humans.
And it reveals the reasoning,
and basically the truly novel aspect
is what's pushing the reasoning
to constantly sort of the model asking itself
is this truly novel?
So it's actually challenging itself to be more novel,
more counterintuitive, less cringe, I suppose.
So some of the reasoning says, this is just snapshots,
alternatively humans have a unique meta emotion
where they feel emotions about their own emotions,
e.g. feeling guilty about being angry.
This recursive emotional layering
creates complex motivational drives
that don't exist in other animals.
The inside is that human emotions are nested.
So it's like, it's reasoning through
how humans feel emotions.
It's reasoning about meta-emotions.
It's gonna have pages and pages of this.
It's almost too much to actually read,
but it's nice to skim as it's coming.
It's a stream of, it's a James Joyce-like
stream of consciousness.
And then it goes, wait,
the user wants something that's not seen anywhere else,
let me dig deeper.
And consider the human ability to hold
contradictory beliefs simultaneously,
cognitive dissonance is known,
but perhaps the function is to allow flexible adaptation,
so on and so forth.
I mean, that really captures the public imagination
that, holy shit, this isn't, I mean,
intelligence slash almost like an inkling of sentience
because you're thinking through,
you're self-reflecting, you're deliberating.
And the final result of that, after 157 seconds,
is humans instinctively converts selfish desires
into cooperative systems by collectively pretending abstract rules money laws
rights are real these shared hallucinations act as quote games where
competition is secretly redirected to benefit the group turning conflict into into society's fuel. Pretty profound. I mean, you know. This is a potential digression, but a lot of
people have found that these reasoning models can sometimes produce much more eloquent text. That's
a at least interesting example, I think, depending on how open-minded you are. You find language
models interesting or not, and there's a spectrum there. Well, I mean, some of the, we'll talk about different benchmarks and so on,
but some is just a vibe.
Like that in itself is a, let's say, quote, fire tweet.
Yeah.
If I, if I'm trying to produce something,
something that where people are like, oh shit, okay.
So that's Chanathar, we'll probably return to it more.
How were they able to achieve such low cost on the training and the inference?
Maybe you could talk the training first.
Yeah. So there's two main techniques that they implemented that are probably the majority
of their efficiency. And then there's a lot of implementation details that maybe we'll
gloss over or get into later that sort of contribute to it. But those two main things are one is they went to a mixture of experts model, which we'll define in a second. And then
the other thing is that they invented this new technique called MLA, latent attention.
Both of these are big deals. Mixture of experts is something that's been in the literature for
a handful of years. And OpenAI with GPT-4 was the first one to productize a mixture of experts model.
And what this means is when you look at the common models around that most people have
been able to interact with that are open, think llama.
Llama is a dense model, i.e. every single parameter or neuron is activated as you're
going through the model for every single token you generate.
Now with a mixture of experts model, you don't do that.
How does the human actually work?
It's like, oh, well, my visual cortex is active when I'm thinking about vision tasks and other
things.
My amygdala is when I'm scared.
These different aspects of your brain are focused on different things.
A mixture of experts model attempts to approximate this to some extent.
It's nowhere close to what a brain architecture is,
but different portions of the model activate, right?
You'll have a set number of experts in the model
and a set number that are activated each time.
And this dramatically reduces both your training
and inference costs.
Because now you're, you know,
if you think about the parameter count
as the sort of total embedding space
for all of this knowledge that you're compressing down during training,
when you're embedding this data in, instead of having to activate every single parameter every single time you're training or running inference,
now you can just activate a subset. And the model will learn which expert to route to for different tasks.
And so this is a humongous innovation in terms of, hey, I can continue to grow the total
embedding space of parameters.
And so DeepSeq's model is 600 something billion parameters, right?
Relative to Llama 405b, it's 405 billion parameters, right?
Relative to Llama 70b, it's 70 billion parameters, right?
So this model technically has more embedding space for information, right, to compress
all of the world's knowledge that's on the internet down, but at the same time,
it is only activating around 37 billion of the parameters.
So only 37 billion of these parameters
actually need to be computed every single time
you're training data or inferencing data out of it.
And so versus, versus again, the Lama model,
70 billion parameters must be activated,
or 405 billion parameters must be activated.
So you've dramatically reduced your compute cost when you're doing training and inference Lama model, 70 billion parameters must be activated, or 405 billion parameters must be activated.
So you've dramatically reduced your compute cost when you're doing training and inference
with this mixture of experts architecture.
So we break down where it actually applies and go into the transformer.
Is that useful?
Let's go.
Let's go into the transformer.
So the transformer is a thing that is talked about a lot and we will not cover every detail.
Essentially, the transformer is built on repeated
blocks of this attention mechanism and then a traditional dense, fully connected, multi-layer
perception, whatever word you want to use for your normal neural network. And you alternate these
blocks. There's other details. And where a mixture of experts is applied is at this dense model.
The dense model holds most of the weights
if you count them in a transformer model.
So you can get really big gains
from this mixture of experts on parameter efficiency
at training and inference because you get this efficiency
by not activating all of these parameters.
We should also say that a transformer
is a giant neural network.
Yeah.
And then there's for 15 years now,
this was called the deep learning revolution.
Networks gotten larger and larger and at a certain point,
the scaling laws appeared where people realized.
This is a scaling law shirt by the way.
Representing scaling laws where it became more
and more formalized that bigger is better
across multiple dimensions of what bigger means.
So, and, but these are all sort of neural networks
we're talking about,
and we're talking about different architectures
of how to construct these neural networks
such that the training and the inference on them
is super efficient.
Yeah, every different type of model
has a different scaling law for it,
which is effectively for how much compute you put in the architecture will get to different levels of performance at test tasks.
And mixture of experts is one of the ones at training time, even if you don't consider
the inference benefits, which are also big.
At training time, your efficiency with your GPUs is dramatically improved by using this
architecture if it is well implemented. So you can get effectively
the same performance model and evaluation scores with numbers like 30% less compute. I think there's
going to be a wide variation depending on your implementation details and stuff. But it is just
important to realize that this type of technical innovation is something that gives
huge gains. And I expect most companies that are serving their models to move to this
mixture of experts implementation. Historically, the reason why not everyone might do it is because
it's an implementation complexity, especially when doing these big models. So this is one of
the things that DeepSeq gets credit for is they do this extremely well. They do mixture of experts
extremely well. This architecture for what is called deep seek MOE,
MOE is the shortened version of mixture of experts, is multiple papers old. This part
of their training infrastructure is not new to these models alone. And same goes for what Dylan
mentioned with multi-headly and attention. This is all about reducing memory usage during inference
and same things during training by using some fancy low
rank approximation math. If you get into the details with this latent attention, it's one of
those things I look at and say, okay, they're doing really complex implementations because
there's other parts of language models such as embeddings that are used to extend the context
length. The common one that DeepSeq uses rotary positional impending,
which is called rope. And if you want to use rope with a normal MOE, it's kind of a sequential
thing. You take these, you take two of the attention matrices and you rotate them by a
complex value rotation, which is a matrix multiplication. With DeepSeq's MLA, with this
new attention architecture, they need to do some clever things because they're not set up the same and it just makes the implementation complexity much higher
so they're managing all of these things and these are probably the sort of things that
Open AI these closed labs are doing we don't know if they're doing the exact same techniques
but they actually shared them with the world, which is really nice to feel like this is the cutting edge of
Efficient language model training and some some of this requires low level engineering.
Just is a giant mess and trickery.
So as I understand, they went below CUDA.
So they go super low programming of GPUs.
Effectively, Nvidia builds this library called Nickel, right?
In which, you know, when you're training a model,
you have all these communications between every single layer of the model, and you may have over 100 layers.
What does the nickel stand for?
It's NCCL.
NVIDIA Communications Collectives Library.
Nice.
Damn.
And so when you're training a model, you're going to have all these all reduces and all
gathers between each layer, between the multilayer perceptronron or feed-forward network and the attention mechanism,
you'll have basically the model synchronized, right?
Or you'll have all reduce or null gather.
And this is a communication
between all the GPUs in the network,
whether it's in training or inference.
So Nvidia has a standard library.
This is one of the reasons why it's really difficult
to use anyone else's hardware for training
is because no one's really built a standard communications library.
And Nvidia has done this at a sort of a higher level, right?
DeepSeek, because they have certain limitations around the GPUs that they have access to,
the interconnects are limited to some extent by the restrictions of the GPUs that were
shipped into China legally, not the ones that are smuggled, but legally shipped in that
they use to train this model. They had to figure out how to get efficiencies.
One of those things is that instead of just calling the NVIDIA library Nickel, they instead
created their, they scheduled their own communications, which some of the labs do.
E-meta talked about in Llama 3 how they made their own custom version of Nickel.
They didn't talk about the implementation details.
This is some of what they did.
Maybe not as well as DeepSeek because DeepSeek necessity is the mother of innovation and
they had to do this.
Whereas OpenAI has people that do this sort of stuff, Anthropic, et cetera.
But DeepSeek certainly did it publicly and they may have done it even better because they were gimped on a certain aspect of the chips that they have
access to. And so they scheduled communications by scheduling specific SMs. SMs you could think of
as like the core on a GPU, right? So there's hundreds of cores or there's a bit over 100 cores,
SMs on a GPU, and they
were specifically scheduling, hey, which ones are running the model?
Which ones are doing all reduce?
Which one are doing all gather?
And they would flip back and forth between them, and this requires extremely low-level
programming.
This is what Nickel does automatically, or other NVIDIA libraries handle this automatically
usually.
Yeah, exactly.
And so technically, they're using PTX, which is sort of like, you could think of it
as an assembly-type language.
It's not exactly that or instruction set,
like coding directly to assembly or instruction set.
It's not exactly that, but that's still
part of technically CUDA.
But it's like, do I want to write in Python, PyTorch
equivalent, and call NVIDIA libraries?
Do I want to go down to the C level,
and code even lower level? Or do I want to go all the way down to the assemblier ISA level? And there are cases
where you go all the way down there at the very big labs, but most companies just do not do that,
because it's a waste of time and the efficiency gains you get are not worth it. But DeepSeq's
implementation is so complex, especially with their mixture of experts. People have done
mixture of experts,
but they're generally 8, 16 experts, and they activate too. One of the words we like to
use is sparsity factor or usage. You might have one-fourth of your model activate, and
that's what Mistral's Mixed Role Model, their model that really catapulted them to like, oh my God, they're really, really good.
OpenAI has also had models that are MOE
and so have all the other labs that are major closed.
But what DeepSeq did that maybe only the leading labs
have only just started recently doing
is have such a high sparsity factor, right?
It's not one fourth of the model, right?
Two out of eight experts activating
every time you go through the model,
it's eight out of 256.
And there's different implementations for mixture of experts where you can have some of these
experts that are always activated, which this just looks like a small neural network. And then all
the tokens go through that. And then they also go through some that are selected by this routing
mechanism. And one of the innovations in DeepSeq's architecture
is that they changed the routing mechanism
in mixture of expert models.
There's something called an auxiliary loss,
which effectively means during training,
you want to make sure that all of these experts are used
across the tasks that the model sees.
Why there can be failures in mixture of experts
is that when you're doing this training,
the one objective is token prediction accuracy. And if you just let training go with a mixture
of expert model on your own, it can be that the model learns to only use a subset of the experts.
And in the MOE literature, there's something called the auxiliary loss, which helps balance
them. But if you think about the loss functions of deep learning, this even connects to the
bitter lesson is that you want to have the minimum inductive bias in your model to let
the model learn maximally.
And this auxiliary loss, this balancing across experts could be seen as intention with the
prediction accuracy of the tokens.
So we don't know the exact extent that the DeepSeq MOE change, which is instead of doing
an auxiliary loss, they have an extra parameter in their routing, which after the batches
they update this parameter to make sure that the next batches all have a similar use of
experts.
And this type of change can be big, it can be small, but they add up over time.
This is the sort of thing that just points to them innovating.
And I'm sure all the labs that are training big MOEs
are looking at this sort of things,
which is getting away from the auxiliary loss.
Some of them might already use it,
but you keep accumulating gains.
And we'll talk about the philosophy of training
and how you organize these organizations.
And a lot of it is just compounding small improvements
over time in your data, in your architecture,
in your post-training and how they integrate with each other.
DeepSeq does the same thing and some of them are shared.
We had to take them on face value that they shared their most important details.
I mean, the architecture and the weights are out there, so we're seeing what they're doing
and it adds up.
Going back to sort of the efficiency and complexity point, right?
It's 32 versus four, right, for like MixedDraw
and other MOE models that have been publicly released.
So this ratio is extremely high,
and sort of what Nathan was getting at there was,
when you have such a different level of sparsity,
you can't just have every GPU have the entire model, right?
The model's too big, there's too much complexity there,
so you have to split up the model
with different types of parallelism.
And so you might have different experts
on different GPU nodes.
But now what happens when this set of data that you get,
hey, all of it looks like this one way
and all of it should route to one part of my model.
So when all of it routes to one part of the model,
then you can have this overloading of a certain set
of the GPU resources or a certain set of the GPUs and then the rest of the training network
sits idle because all of the tokens are just routing to that.
So this is the biggest complexity, one of the big complexities with running a very
sparse mixture of experts model, i.e. this 32 ratio versus this four ratio, is that you end up with so
many of the experts just sitting there idle.
So how do I load balance between them?
How do I schedule the communications between them?
This is a lot of the extremely low level detailed work that they figured out in the public first
and potentially second or third in the world and maybe even first in some cases. What lesson do you, in the direction of the better lesson,
do you take from all of this?
Where is this going to be the direction
where a lot of the gain is going to be,
which is this kind of low level optimization?
Or is this a short term thing
where the biggest gains will be more
on the algorithmic high level side of like post-training?
Is this like a short-term leap
because they've figured out like a hack
because constraints, necessities, the mother of invention,
or is there still a lot of gains?
I think we should summarize
what the bitter lesson actually is about.
The bitter lesson essentially, if you paraphrase it,
is that the types of training that will win out in deep learning
as we go are those methods that are which are scalable in learning and search is what it calls
out. And this scale word gets a lot of attention in this. The interpretation that I use is
effectively to avoid adding the human priors to your learning process.
And if you read the original essay, this is what it talks about is how researchers will try to
come up with clever solutions to their specific problem that might get them small gains in the
short term, while simply enabling these deep learning systems to work efficiently and for these
bigger problems in the long term might be more likely to scale and continue to drive
success.
And therefore, we were talking about relatively small implementation changes to the mixture
of experts model.
And therefore, it's like, okay, we will need a few more years to know if one of these are
actually really crucial
to the bitter lesson.
But the bitter lesson is really this long-term arc
of how simplicity can often win.
And there's a lot of sayings in the industry,
like the models just wanna learn.
You have to give them the simple loss landscape
where you put compute through the model
and they will learn and getting barriers out of the way.
That's where the power, something like Nickel comes in,
where standardized code that can be used by a lot of people
to create sort of simple innovations that can scale,
which is why the hacks, I imagine that the code base
for DeepSeq is probably a giant mess.
I'm sure they have, DeepSeq definitely has code bases
that are extremely messy,
where they're testing these new ideas. Multiheadly and attention probably start could start in
something like a jupyter notebook or somebody tries something on a few GPUs
and that is really messy but the stuff that trains the deep seek v3 and deep
seek r1 those libraries if you were to present them to us I would guess are
extremely high quality code quality High quality, readable code.
Yeah.
I think there is one aspect to note though, right, is that there is the general ability
for that to transfer across different types of runs, right?
You may make really, really high quality code for one specific model architecture at one
size and then that is not transferable to, hey, when I make this architecture
tweak, everything's broken again, right? Like that's something that could be, you know,
their specific low-level coding of like scheduling SMs is specific to this model architecture and
size, right? And whereas like Nvidia's Collective's library is more like, hey, it'll work for anything,
right? You want to do an all-reduce? Great. I don't care what your model architecture is.
It'll work.
Uh, and you're giving up a lot of performance when you do that, uh, in many
cases, but it's, it's worthwhile for them to do the specific, uh, optimization
for the specific run, given the constraints that they have regarding compute.
I wonder how stressful it is to like, you know, these frontier models, like
initiate training, like to have the code.
But to push the button that like you're now spending
a large amount of money and time to train this.
Like there must, I mean, there must be a lot of innovation
on the debugging stage of like making sure there's no issues
that you're monitoring and visualizing every aspect
of the training, all that kind of stuff. When people are training they have
all these various dashboards but like the most simple one is your loss, right?
And it continues to go down but in reality especially with more complicated
stuff like MOE, the biggest problem with it or FP8 training which is another
innovation you know going to a lower precision number format i.e. less
accurate is that you end up with loss spikes. And no one knows why the loss spike happen.
Some of them you do. Some of them you do. Some of them are bad data. I give an AI2's example of
what blew up our earlier models is a subreddit called Microwave Gang. We love to shout this out.
It's a real thing. You can pull up Microwave Gang. Essentially, it's a subreddit where
everybody makes posts that are just the letter M, so it's like, mmm.
So there's extremely long sequences of the letter M, and then the comments are like beep
beep because it's in the micro events.
But if you pass this into a model that's trained to be a normal producing text, it's extremely
high loss because normally you see an M, you don't predict M's for a long time.
So this is something that caused the loss spikes for us.
But when you have much like this is like, this is something that caused the loss spikes for us, but when you have much like, this is old,
this is not recent.
And when you have more mature data systems,
that's not the thing that causes the loss spike.
And what Dylan is saying is true,
but it's like, it's levels to this sort of idea.
With regards to the stress, right?
These people are like, you know, you'll go out to dinner
with like a friend that works at one of these labs
and they'll just be like looking at their phone every 10 minutes.
And it's one thing if they're texting,
but they're just like, is the loss provoking?
Tokens per second, loss not blown up.
They're just watching this.
And the heart rate goes up if there's a spike.
And some level of spikes is normal.
It'll recover and be back.
Sometimes a lot of the old strategy
was you just stop the run, lot of the old strategy was like,
you just stop the run, restart from the old version,
and then like change the data mix,
and then it keeps going.
There are even different types of spikes.
So Dirk Grenneveld has a theory today,
I do this like fast spikes and slow spikes,
where there are sometimes where you're looking at the loss
and there are other parameters,
you can see it start to creep up and then blow up,
and that's really hard to recover from.
So you have to go back much further. So you have the stressful period where it's like flat or might start going up and that's really hard to recover from so you have to go back much further
So you have the stressful period where it's like flat or might start going up
You're like, what do I do?
Whereas they're also lost spikes that are it looks good and then there's one spiky data point
And what you can do is you just skip those you you see that there's a spike you're like, okay
I can ignore this data don't update the model and do the next one and it'll recover quickly. But these like
one and it'll recover quickly. But these like, on trickier implementations, so as you get more complex in your architecture, and you scale up to more GPUs, you have more potential
for your loss blowing up. So it's like there's, there's, there's a distribution.
The whole idea of grokking also comes in, right? It's like, just because it slowed down
from improving and loss doesn't mean it's not learning, because all of a sudden it could
be like this, and it could just spike down in loss again because it learned truly learned something right and it took some time for
it to learn that it's not like a gradual process right and that's that's what humans are like
that's what models are like it's it's really a stressful task as you mentioned and the
whole time the the dollar count is going up every company has failed runs you need failed
run to push the envelope on your infrastructure so a lot of news cycles are made of X company had Y failed run.
Every company that's trying to push the frontier of AI has these.
So yes, it's noteworthy because it's a lot of money and it can be week to month setback,
but it is part of the process.
But how do you get, if you're deep seek, how do you get to a place where, holy shit, there's
a successful combination
of hyperparameters?
A lot of small failed runs.
And so rapid iteration through failed runs until.
And successful ones.
And then you build a small intuition like this,
this mixture of expert works,
and then this implementation of MLA works.
Key hyperparameters like learning rate and regularization
and things like this.
And you find the regime that works for your code base.
I've talking to people at Frontier Labs,
there's a story that you can tell
where training language models is kind of a path
that you need to follow.
So you need to like unlock the ability
to train a certain type of model or a certain scale.
And then your code base and your internal know-how of which hyperparameters work for it is kind of known.
You look at the DeepSeq papers and models, they've scaled up, they've added complexity, and it's just continuing to build the capabilities that they have.
There's the concept of a YOLO run. So YOLO, you only live once. And what it is is like, there's all this experimentation you do at the small scale,
right? Research ablations, right? Like you have your Jupyter Notebook, where you're experimenting
with MLA on like three GPUs or whatever. And you're doing all these different things like,
hey, do I do four active experts, 128 experts? Do I arrange the experts this way? All these different
model architecture things you're testing at a very small scale, right?
Couple researchers, few GPUs, tens of GPUs,
hundreds of GPUs, whatever it is.
And then all of a sudden you're like,
okay guys, no more fucking around, right?
No more screwing around.
Everyone take all the resources we have,
let's pick what we think will work
and just go for it, right?
YOLO.
And this is where that sort of stress comes in is like,
well, I know it works here,
but some things that work here don't work here. but some things that work here don't work here.
And some things that work here don't work down here, right? In terms of scale, right?
So it's really truly a YOLO run. And sort of like there's this discussion of like certain researchers just have like this methodical nature.
Like they can find the whole search space and like figure out all the ablations of different research and really see what is best.
And there's certain researchers who just kind of like, you know, have that innate gut instinct of like, this is the YOLO run.
Like, you know, looking at the data, this is it.
This is why you want to work in post-training because the GPU cost for training is lower.
So you can make a higher percentage of your training runs YOLO runs.
Yeah, for now.
Yeah, for now.
So some of this is fundamentally luck still.
Luck is skill, right, in many cases.
Yeah, I mean, it looks lucky, right, when you're.
But the hill to climb, if you're on one of these labs,
you have an evaluation, you're not crushing.
There's a repeated playbook of how you improve things.
There are localized improvements, which
might be data improvements, and these add up
into the whole model just being much better.
And when you zoom in really close, it can be really obvious that this model is just really bad at this thing,
and we can fix it, and you just add these up. So some of it feels like luck,
but on the ground, especially with these new reasoning models we're talking to, it's just so many ways that we can poke around and
normally, it's that some of them give big improvements.
The search space is near infinite, right?
And yet the amount of compute and time you have is very low and you have to hit release schedules.
You have to not get blown past by everyone.
Otherwise, you know, what happened with DeepSeek, you know, crushing meta and Mistral and Coherent, all these guys, they moved too slow, right?
They maybe were too methodical,
I don't know, they didn't hit the YOLO run,
whatever the reason was, maybe they weren't a skill.
Whatever, you know, you can call it luck if you want,
but at the end of the day, it's skill.
So 2025 is the year of the YOLO run.
It seems like all the labs are like going in.
I think it's even more impressive
what OpenAI did in 2022, right?
At the time, no one believed in mixture of experts models
at Google, who had all the researchers.
OpenAI had such little compute,
and they devoted all of their compute for many months,
all of it, 100%, for many months to GPT-4,
with a brand new architecture, with no belief that,
hey, let me spend a couple hundred million dollars,
which is all of the money I have, on this model, right?
That is truly YOLO, right?
Now, you know, people are like, all these like training run failures
that are in the media, right?
It's like, okay, great, but like actually, a huge chunk of my GPs are doing inference.
I still have a bunch doing research constantly,
and yes, my biggest cluster is training, but like, on this YOLO run,
but like that YOLO run is much less risky
than like what opening I did in 2022.
Or maybe what DeepSeek did now,
or you know, like sort of like,
hey, we're just gonna throw everything at it.
The big winners throughout human history
are the ones who are willing to do YOLO at some point.
Okay, what do we understand about the hardware
it's been trained on? DeepSeq.
DeepSeq is very interesting. This is the second to take a zoom out of who they are, first of all.
High Flyer is a hedge fund that has historically done quantitative trading in China as well as
elsewhere. And they have always had a significant number of GPUs. In the past, a lot of these high
frequency trading algorithmic quant traders used FPGAs,
but it shifted to GPUs definitely, and there's both.
GPUs especially, and HiFlyer, which is the hedge fund that owns DeepSeek, and everyone
who works for DeepSeek is part of HiFlyer to some extent, same parent company, same
owner, same CEO.
They had all these resources and infrastructure for trading, And then they devoted a humongous portion of them
to training models, both language models and otherwise, right? Because these techniques
were heavily AI influenced. More recently, people have realized, hey, trading with...
Even when you go back to Renaissance and all these quantitative firms, natural language
processing is the key to trading really fast, understanding a press release
and making the right trade. DeepSeek has always been really good at this.
And even as far back as 2021, they have press releases and papers saying,
like, hey, we're the first company in China with an A100 cluster this large, those 10,000 A100 GPUs.
This is in 2021. Now, this wasn't all for training large this large. It was 10,000 A100 GPUs. This is in 2021.
Now, this wasn't all for training large language models.
This was mostly for training models
for their quantitative aspects, their quantitative trading,
as well as a lot of that was natural language processing,
to be clear.
And so this is the sort of history.
So verifiable fact is that in 2021,
they built the largest Chinese cluster,
at least they claim it was the largest cluster in China, 10,000 GPUs.
Before expert controls started.
Yeah.
It's like they've had a huge cluster before any conversation of expert controls.
So then you step it forward to like, what have they done over the last four years since
then, right?
Obviously, they've continued to operate the hedge fund, probably make tons of money.
And the other thing is that they've leaned more and more and more into AI.
The CEO, Lian Cheng Feng, Lian.
You're not putting me in spot on this,
we discussed this before.
Lian Feng, right, the CEO.
We're all fans.
Lian Feng, he owns maybe a little bit more
than half the company, allegedly, right?
Is an extremely like Elon Jensen kind of figure
where he's just like involved in everything, right?
And so over that time period
he's gotten really in-depth into AI. He actually has a bit of a like a, if you see some of his
statements, a bit of an IAC vibe almost, right? Total AGI vibes, like we need to do this, we need
to make a new ecosystem of open AI. We need China to lead on this sort of ecosystem because historically the Western countries
have led on software ecosystems
and straight up acknowledges like in order to do this,
we need to do something different.
DeepSeek is his way of doing this.
Some of the translated interviews with him are fantastic.
So he has done interviews?
Yeah.
You think he would do a Western interview or no?
Or is there controls on the channel?
There hasn't been one yet, but I would try it.
I just got a Chinese translator, so it's great. This is a push. So fascinating figure, engineer
pushing full on into AI, leveraging the success from the high frequency trading.
Very direct quotes. We will not switch to closed source when asked about this stuff. Very long-term motivated in how the ecosystem of AI should work.
From a Chinese perspective, he wants a Chinese company to build this vision.
This is sort of like the quote unquote visionary behind the company.
This hedge fund still exists, this quantitative firm. And so, DeepSeek is the
slowly he got turned to this full view of AI, everything about this, but at some point it slowly
maneuvered and he made DeepSeek. And DeepSeek has done multiple models since then. They've acquired
more and more GPUs. They share infrastructure with the fund. And so so there is no exact number of public GPU resources that they have, but besides
this 10,000 GPUs that they bought in 2021, and they were fantastically profitable, and
then this paper claims they did only 2,000 H800 GPUs, which are a restricted GPU that
was previously allowed in China, but no longer allowed and there's a new version.
But it's basically Nvidia's H100 for China, right?
And there's some restrictions on it, specifically around the communications,
sort of speed, the interconnect speed, right?
Which is why they had to do this crazy SM scheduling stuff, right?
So going back to that, right?
It looks like this is obviously not true in terms of their total GPU count.
Obvious available GPUs, but for this training run,
you think 2000 is the correct number or no?
So this is where it takes a significant amount
of sort of like zoning in, right?
Like what do you call your training run, right?
Do you count all of the research
and ablations that you ran, right?
Picking all this stuff,
because yes, you can do a YOLO run,
but at some level you have to do the test at the small scale and then you have to do some test at medium scale
before you go to a large scale.
Accepted practice is that for any given model
that is a notable advancement,
you're gonna do 2 to 4X compute
of the full training run in experiments alone.
So a lot of this compute that's being scaled up
is probably used in large part at this time for research.
Yeah, and research will, you know,
research begets the new ideas
that let you get huge efficiency.
Research gets you O1.
Like research gets you breakthroughs
and you need to bet on it.
So some of the pricing strategy they will discuss
has the research baked into the price.
So the numbers that DeepSeek specifically said publicly,
are just the 10,000 GPUs in 2021,
and then 2000 GPUs for only the pre-training for V3. They did
not discuss cost on R1. They did not discuss cost on all the other RL for the instruct model that
they made. They only discussed the pre-training for the base model and they did not discuss
anything on research and ablations. They do not talk about any of the resources that are shared
in terms of, hey, the fund is using all these GPUs, right?
And we know that they're very profitable
and they had 10,000 GPUs in 2021.
So some of the research that we've found
is that we actually believe they have closer to 50,000 GPUs.
We as Semi Analysis, we should say that you're
sort of one of the world experts in figuring out
what everybody's doing in terms of the semiconductor,
in terms of cluster build outs,
in terms of who is doing what in terms of training runs.
So yeah, so that's the we.
Okay, go ahead.
Yeah, sorry.
We believe they actually have something closer
to 50,000 GPUs, right?
Now this is split across many tasks, right?
Again, the fund, research and ablations.
For Ballpark, how much would OpenAI or Anthropic have?
I think the clearest example we have, because Meta is also open, they talk about order of
60k to 100k H100 equivalent GPUs in their training clusters.
Right.
So like Lama 3, they trained on 16,000 H100s, right?
But the company of Meta last year publicly disclosed they bought like 400-something thousand
GPUs. Yeah. Right? So of course, tiny percentage on the training again, like most of it is like
serving me the best Instagram reels, right? Or whatever, right?
I mean, we could get into a cost of like, what is the cost of ownership for a 2000 GPU cluster,
10,000 like the, there's just different sizes of companies that can afford these things.
And DeepSeek is reasonably big. Their compute allocation compared
is one of the top few in the world. It's not OpenAI, Anthropic, etc. But they have a lot of
compute. Can you in general actually just zoom out and also talk about the Hopper architecture,
the Nvidia Hopper GPU architecture and the difference between H100 and H800, like you
mentioned, the interconnects? Yeah, so there's, you know, Ampere was the A100
and then H100, Hopper, right?
People use them synonymously in the US
because really there's just H100
and now there's H200, right?
But same thing, mostly.
In China, they've had two,
there've been different salvos of export restrictions.
So initially the US government limited
on a two factor scale, right?
Which is chip interconnect versus a flops, right? So any chip that had interconnects above a certain level and
FLOPs above a certain floating-point operations above a certain level was
restricted. Later the government realized that this was a flaw in the
restriction and they cut it down to just floating-point operations. And so...
H800 had high FLOPs, low communication? Exactly. So the H800 had high flops, low communication?
Exactly. So the H800 was the same performance as H100 on flops, right?
But it just had the interconnect bandwidth cut.
DeepSeek knew how to utilize this.
Hey, even though we're cut back on the interconnect, we can do all this fancy stuff to figure out how to use the GPU fully anyways.
And so that was back in October, 2022,
but later in 2023, end of 2023, implemented in 2024,
the US government banned the H800, right?
And so by the way, this H800 cluster,
these 2000 GPUs was not even purchased in 2024, right?
It was purchased in late 2023.
And they're just getting the model out now, right?
Because it takes a lot of research etc
H800 was banned and now there's a new chip called the h20 the h20 is
Cut back on only flops
But the interconnect bandwidth is the same and in fact in some ways it's better than the h100 because as better memory bandwidth and memory
Capacity so there are you know Nvidia is working within the constraints of what the government sets, and then builds the best possible GPU for China.
Can we take this actual tangent,
and we'll return back to the hardware?
Is the philosophy, the motivation,
the case for export controls?
What is it?
Dari Amadej has published a blog post
about export controls.
The case he makes is that if AI becomes super powerful,
and he says by 2026,'ll have AGI or super powerful AI
and that's going to give a significant,
whoever builds that will have
a significant military advantage.
And so because the United States is a democracy
and as he says China is authoritarian
or has authoritarian elements,
you want a unipolar world where the super powerful military
because of the AI is one that's a democracy.
It's a much more complicated world geopolitically
when you have two superpowers with super powerful AI
and one is authoritarian.
So that's the case he makes.
And so we wanna, the United States wants to use export controls to slow down, to make
sure that China can't do these gigantic training runs that will be presumably required to build
AGI.
This is very abstract.
I think this can be the goal of how some people describe export controls is this super powerful
AI.
There's, and You touched on the training
run idea. There's not many worlds where China cannot train AI models. Export controls are
decapping the amount of compute or the density of compute that China can have. If you think about
the AI ecosystem right now, as all of these AI companies revenue numbers are up and to the right, AI usage is just continuing to grow, more GPUs are going to inference.
A large part of export controls, if they work, is just that a very focused team that can still get to the frontier of AI.
This 2000 GPUs is not that hard to get, all considering in the world.
They're still going to have those GPUs.
They're still going to be able to train models.
But if there's going to be a huge market for AI, if you have strong export controls and
you want to have 100,000 GPUs just serving the equivalent of Chad GPT clusters with good
export controls, it also just makes it so that AI can be used much less.
And I think that is a much easier goal to achieve than trying to debate on what AGI
is.
And if you have these extremely intelligent, autonomous AIs and data centers, like those
are the things that could be running in these GPU clusters in the United States, but not
in China.
To some extent, training a model does effectively nothing, right?
Yeah.
Like, they have a model.
The thing that Dario is sort of speaking to is the implementation of that model, once
trained, to then create huge economic growth, huge increases in military capabilities, huge
increases in productivity of people, betterment of lives, whatever you want to direct
super powerful AI towards, you can't.
But that requires significant amounts of compute.
And so the US government has effectively said,
and forever, like training will always be a portion
of the total compute.
We mentioned Meta's 400,000 GPUs, only 16,000 made llama.
So the percentage that Meta's dedicating to inference,
now this might be for recommendation systems
that are trying to hack our mind into spending more time
and watching more ads,
or if it's for a super powerful AI
that's doing productive things,
doesn't matter about the exact use
that our economic system decides,
it's that that can be delivered in whatever way we want.
Whereas with China, right, you China, expert restrictions, great.
You're never gonna be able to cut everything off.
And I think that's quite well understood
by the US government is that you can't cut everything off.
They'll make their own chips.
And they're trying to make their own chips.
They'll be worse than ours.
But the whole point is to just keep a gap.
And therefore at some point,
as the AI, in a world where two, 3% economic growth, this is really dumb by the gap, right? And therefore at some point as the AI, in a world where two, 3% economic growth,
this is really dumb by the way, right?
To cut off high tech and not make money off of it.
But in a world where super powerful AI comes about
and then starts creating significant changes in society,
which is what all the AI leaders
and big tech companies believe, I think,
super powerful AI is gonna change society massively.
And therefore this compounding effect
of the difference in compute is really important. There's some sci-fi out there where
like AI is like measured in the power of in like how much power is delivered to compute,
right? Or how much is being, you know, that's sort of a way of thinking about what's the economic
output is just how much power you directing towards that AI. Should we talk about reasoning
models with this as a way that this might be actionable as something that people can actually see? So the reasoning models that are coming out
with R1 and O1, they're designed to use more compute. There's a lot of buzzy words in the AI
community about this test time compute, inference time compute, whatever. But Dylan has good
research on this. You can get to the specific numbers on the ratio of when you train a model,
you can look at things about the amount of compute use at training and amount of compute use at inference.
These reasoning models are making inference way more important to doing complex tasks.
In the fall in December, their OpenAI announced this 03 model. There's another thing in AI when
things move fast, we get both announcements and releases. Announcements are essentially blog posts
where you pat yourself on the back and you say you did things and releases are on the models out there the papers out there etc.
So open AI has announced oh three and we can check if oh three many is out as a recording potentially but that doesn't really change the point which is that.
The breakthrough result was something called arc a GI task, which is the abstract reasoning corpus a task for artificial general intelligence.
Reasoning Corpus, a task for artificial general intelligence. Francois Chalet is the guy who's been, it's a multi-year old paper, it's a brilliant benchmark. And the number for OpenAI 03 to solve
this was that it used some sort of number of samples in the API. The API has like thinking
effort and number of samples. They used a thousand samples to solve this task and it comes out to be like five to $20 per question,
which you're putting in effectively a math puzzle.
And then it takes orders of dollars to answer one question.
And this is a lot of compute.
If those are gonna take off in the US,
OpenAI needs a ton of GPUs on inference to capture this.
They have this OpenAI ChatGPT Pro subscription,
which is $200 a month.
Which Sam said they're losing money on.
Which means that people are burning a lot of GPUs
on inference and I've signed up with it,
I've played with it, I don't think I'm a power user,
but I use it and it's like, that is the thing
that a Chinese company with mediumly strong
expert controls, there will always be loopholes,
might not be able to do it all.
And if that, the main result for O3 is also a spectacular coding performance,
and if that feeds back into AI companies
being able to experiment better.
So presumably the idea is for an AGI,
a much larger fraction of the compute
will be used for this test time compute, for the reasoning.
For the AGI goes into a room and thinks about
how to take over the world and
you know come back in 2.7 hours. This is what it's gonna take a lot of
computer. This is what people like CEO or leaders of OpenAI and Anthropik talk
about is like autonomous AI models which is you give them a task and they work on
it in the background. My personal definition of AGI is much simpler like I
think language models are a form of AGI
and all of this super powerful stuff is a next step.
That's great if we get these tools,
but a language model has so much value in so many domains.
It is a general intelligence to me.
But this next step of agentic things
where they're independent and they can do tasks
that aren't in the training data
is what the few year outlook
that these AI companies are driving for.
I think the terminology here that Dario uses is super powerful AI.
So I agree with you on the AGI.
I think we already have something like that's exceptionally impressive that Alan Turing would for sure say is AGI.
But he's referring more to something once in possession of,
then you would have a significant military
and geopolitical advantage over other nations.
So it's not just like you can ask it how to cook an omelet.
And he has a much more positive view
and as I say, machines of love and grace.
I've read into this,
I don't have enough background in physical sciences
to gauge exactly how competent I am
and if AI can revolutionize biology,
I'm safe saying
that AI is going to accelerate the progress of any computational science.
So we're doing a depth first search here on topics, taking tangent of a tangent.
So let's continue on that depth first search.
You said that you're both feeling the AGI.
So what's your timeline? Dario's 2026 for the super powerful AI,
that's basically agentic to a degree
where it's a real security threat, that level of AGI.
What's your timeline?
I don't like to attribute specific abilities
because predicting specific abilities and when is very hard.
I think mostly if you're going to say that I'm feeling the AGI is that I expect continued rapid surprising
progress over the next few years. So something like R1 is less surprising to me from DeepSeq
because I expect there to be new paradigms where substantial progress can be made.
I think DeepSeq R1 is so unsteadily because we're kind of on this path with chat GPT. It's like,
it's getting better. It's getting better, it's getting better.
And then we have a new direction for changing the models.
And we took one step like this, and we took a step up.
So it looks like a really fast slope, and then we're going to just take more steps.
So it's just really unsettling when you have these big steps.
And I expect that to keep happening.
I've tried OpenAI Operator, I've tried Claude computer use.
They're not there yet.
I understand the idea,
but it's just so hard to predict what is the breakthrough
that will make something like that work.
And I think it's more likely that we have breakthroughs
that work and things that we don't know
what they're gonna do.
So everyone wants agents.
Dario has very eloquent way of describing this.
And I just think that it's like,
there's gonna be more than that,
so just expect these things to come.
I'm gonna have to try to pin you down to a date
on the AGI timeline.
Like the nuclear weapon moment.
So moment where on the geopolitical stage,
there's a real like,
because we're talking about export controls.
When do you think, just even if throw out a date,
when do you think that would be?
Like for me, it's probably after 2030.
So I'm not as-
That's what I would say.
So define that, right?
Because to me, it kind of almost has already happened,
right?
You look at elections in India and Pakistan,
people get AI voice calls
and think they're talking to the politician.
The AI diffusion rules, which was enacted in the last couple of weeks of the Biden admin
and looks like the Trump admin will keep and potentially even strengthen, limit cloud computing
and GPU sales to countries that are not even related to China.
This is-
Portugal and all these normal countries are on the, you need approval from the US list.
Yeah, Portugal and all these countries US list. Like yeah, Portugal and like, you know,
like all these countries that are allies, right?
Singapore, right?
Like they freaking have F-35s
and we don't let them buy GPUs.
Like this is, this to me is already to the scale of like,
you know.
Well, that just means that the US military
is really nervous about this new technology.
That doesn't mean the technology is already there.
So like they might be just very cautious
about this thing that they don't quite understand.
But that's a really good point.
So the robocalls, swarms of semi-intelligent bots
could be a weapon,
could be doing a lot of social engineering.
I mean, there's tons of talk about,
you know, from the 2016 elections
like Cambridge Analytica and all this stuff,
Russian influence. I mean, every country in the world is pushing stuff onto the internet and has
narratives they want. Every technically competent, whether it's Russia, China, US, Israel, et cetera,
right? People are pushing viewpoints onto the internet en masse and language models crash the
cost of very intelligent sounding language. There's some research that shows that the distribution is actually the limiting factor.
So language models haven't yet made misinformation particularly, like, change the equation there.
The internet is still ongoing.
I think there's a blog, AI Snake Oil, and some of my friends at Princeton that write
on this stuff.
So there is research.
It's a default that everyone assumes, and I would have thought the same thing is that misinformation is going to get far worse with language models, I think. In terms of internet
posts and things that people have been measuring, it hasn't been an exponential increase or
something extremely measurable. And things you're talking about with voice calls and stuff like that,
it could be in modalities that are harder to measure. So it's something that it's too soon to tell in terms of, I think that's
like political instability via the web is monitored by a lot of researchers to see what's happening.
I think that you're asking about like the AGI thing. If you make me give a year, I'm going to
be like, okay, I have AI CEO saying this. They've been saying two years for a while. I think that they're people like Dario, Anthropic, the CEO had thought about this so deeply.
I need to take their word seriously, but also understand that they have different incentives.
So I would be like, add a few years to that, which is how you get something similar to 2030
or a little after 2030. I think to some extent we have capabilities that hit a certain point where any one person
could say, oh, okay, if I can leverage those capabilities for X amount of time, this is
AGI, right?
Call it 27, 28.
But then the cost of actually operating that capability is so, so extreme that no one can
actually deploy it at scale and mass to actually completely revolutionize the economy
on a click on a snap of a finger.
So I don't think it will be like a snap of the finger moment.
It's a physical constraint.
Rather it'll be a, you know,
oh, the capabilities are here
but I can't deploy it everywhere, right?
And so one simple example going back sort of to 2023
was when, you know, being with GPT-4 came out
and everyone was freaking out about search, right? Perplexity came out. If you did the cost on like hey implementing
GPT-3 into every Google search, it was like oh okay this is just like
physically impossible to implement right and and and as we step forward to like
going back to the test time compute thing right, a query for you know you ask
chat GPT a question, it costs cents right for their most capable model of chat to get a query back.
To solve an Arc AGI problem though, cost five to 20 bucks.
And this is a thousand, 10,000 X factor difference in cost to respond to a query versus do a
task.
And the task of Arc AGI is not like it's like, it's simple to some extent, but it's also
like what are the tasks that we want?
Okay, AGI, quote unquote, what we have today can do Arc AGI.
Three years from now, it can do much more complicated problems, but the cost is going
to be measured in thousands and thousands and hundreds of thousands of dollars of GPU
time, and there just won't be enough power, GPUs, infrastructure to operate this and therefore
shift everything in the world on the snap of the finger.
But at that moment, who gets to control and point the AGI at a task?
And so this was in Dario's post that he's like, hey, China can effectively and more quickly than us
point their AGI at military tasks, right?
And they have been in many ways faster at adopting certain new technologies into their
military, especially with regards to drones. The US maybe has a long-standing large air,
fighter jet type of thing, bombers, but when it comes to asymmetric arms such as drones,
they've completely leapfrogged the US and the West. And the fear that Dario is sort of pointing out there, I think,
is that, yeah, great, we'll have AGI in the commercial sector. The US military won't be
able to implement it super fast. Chinese military could, and they could direct all their resources
to implementing it in the military and therefore solving military logistics or solving some other
aspect of disinformation for targeted certain set of people so they can flip a country's politics
or something like that that is actually like catastrophic
versus the US just wants to,
because it'll be more capitalistically allocated
just towards whatever is the highest return on income
which might be like building factories better or whatever.
So everything I've seen,
people's intuition seems to fail on robotics.
So you have this kind of general optimism.
I've seen this on self-driving cars.
People think it's much easier problem than it is.
Similar with drones.
Here I understand it a little bit less,
but I've just seen the reality of the war in Ukraine
and the usage of drones at both sides.
And it seems that humans still far outperform
any fully autonomous systems.
AI is an assistant, but humans drive FPV drones
where the humans control most of it,
just far, far, far outperforms AI systems.
So I think it's not obvious to me
that we're going to have swarms of autonomous robots
anytime soon in the military context.
Maybe the fastest I can imagine is 2030,
which is why I said 2030 for the super powerful AI.
Whenever you have large scale swarms of robots
doing military actions,
that's when the world just starts to look different to me.
So that's the thing I'm really worried about.
But there could be cyber war,
a cyber war type of technologies that,
from social engineering to actually just swarms of robots
that find attack vectors in our code bases
and shut down power grids, that kind of stuff.
And it could be one of those things like
on any given weekend or something, power goes out,
nobody knows why,
and the world changes forever.
Just power going out for two days
in all of the United States,
that will lead to murder, to chaos.
But going back to expert controls,
do you see that as a useful way
to control the balance of power geopolitically in the context of AI.
I think going back to my viewpoint is if you believe we're in the sort of stage of economic
growth and change that we've been in for the last 20 years, the export controls are absolutely
guaranteeing that China will win long term, right? If you do not believe AI is
going to make significant changes to society in the next 10 years or five years, right? Five-year
timelines are sort of what the more executives and such of AI companies and even big tech companies
believe, but even 10-year timelines, it's reasonable. But once you get to, hey, these timelines are below that time period, then the
only way to create a sizeable advantage or disadvantage for America versus China is if
you constrain compute because talent is not really something that's constraining. China
arguably has more talent, more STEM graduates, more programmers.
The US can draw upon the world's people, which it does.
There's tons of foreigners in the AI industry.
So many of these AI teams are all people
without a US passport.
Yeah, I mean, many of them are Chinese people
who are moving to America, right?
And that's great.
That's exactly what we want, right?
But that talent is one aspect,
but I don't think that's one that is a measurable advantage for the US or not.
It truly is just whether or not compute, right? Now, even on the compute side,
when we look at chips versus data centers, right? China has the unprecedented ability to build
ridiculous sums of power, clockwork. They're always building more and more
power. They've got steel mills that individually are the size of the
entire US industry. And they've got aluminum mills that consume
gigawatts and gigawatts of power. And when we talk about what's the
biggest data center, OpenAI made this huge thing about Stargate, their
announcement there. That's like once it's fully built out in a few years,
it'll be two gigawatts of power.
And this is still smaller than the largest
industrial facilities in China.
China, if they wanted to build the largest data center
in the world, if they had access to the chips, could.
So it's just a question of when, not if.
So their industrial capacity far exceeds the United States?
Exactly.
To manufacture stuff.
Yeah.
So long-term, they're going to be manufacturing chips there?
Chips are a little bit more specialized.
I'm specifically referring to the data centers, right?
Chips, fabs take huge amounts of power,
don't get me wrong.
That's not necessarily the gating factor there.
The gating factor on how fast people can build
the largest clusters today in the US is power.
It could be power generation, power transmission, substations, and all these sorts of transformers
and all these things.
Building the data center, these are all constraints on the US industry's ability to build larger
and larger training systems as well as deploying more and more
inference compute. I think we need to make the point clear on why the time is now for people
that don't think about this because essentially with export controls you're making it so China
cannot make or get cutting-edge chips and the idea is that if you time this wrong China is pouring
a ton of money into their chip production and if you time it wrong they they are going to have more capacity for production, more capacity for energy,
and figure out how to make the chips and have more capacity than the rest of the world to make the
chips because everybody can buy, they're gonna sell their Chinese chips to everybody, they might
subsidize them, and therefore if AI takes a long time to become differentiated, we've decapped the
financial performance of American companies. Nvidia can sell less, TSMC cannot sell to China, so therefore we have less demand to keep driving the production
cycle. So that's the assumption behind the timing being important.
Less than 10 years or five years to above, right? China will win because of these restrictions
long term unless AI does something in the short term,
which I believe AI will do, make massive changes to society in the medium short term.
And so that's the big unlocker there. And even today, if Xi Jinping decided to get,
you know, quote unquote, scale-pilled, IE decide that scaling laws are what matters,
I.e., decide that scaling laws are what matters, right?
Just like the US executives like Sacha Nadella and Mark Zuckerberg and Sundar
and all these US executives
of the biggest, most powerful tech companies
have decided they're scale-pilled
and they're building multi-gigawatt data centers, right?
Whether it's in Texas or Louisiana or Wisconsin,
wherever it is, they're building these massive things
that cost as much as their entire budget
for spending on data centers globally in one spot. This is what they've committed to for next year,
year after, et cetera. They're so convinced that this is the way, that this is what they're doing.
But if China decided to, they could do it faster than us, but this is where the restrictions come
in. It is not clear that China as a whole
has decided from the highest levels that this is a priority. The US sort of has, right? You see
Trump talking about DeepSeek and Stargate within the same week, right? So he's in the Biden admin
as well had a lot of discussions about AI and such. It's clear that they think about it. Only
just last week did DeepSeek meet the second in command of China. They
have not even met the top, right? They haven't met Xi. Xi hasn't sat down. And they only
just released a subsidy of a trillion RMB, roughly $160 billion, which is closer to the
spending of Microsoft and Meta and Google combined for this year. So it's like they're
realizing it just now,
but that's where the export restrictions come in and say,
hey, you can't ship the most powerful US chips to China.
You can ship a cut down version.
You can't ship the most powerful chips
to all these countries who we know
we're just gonna rent it to China.
You have to limit the numbers, right?
And the tools.
So you can't- And same with equipment, tools, all these different aspects.
But it all stems from AI and then what downstream can slow them down in AI.
And so the entire semiconductor restrictions, you read them, they are very clear.
It's about AI and military civil fusion of technology, right?
It's very clear.
And then from there it goes, oh, well, we're banning them from buying like lithography tools and etch tools and deposition tools and oh this random like, you know
Subsystem from a random company that's like tiny right?
Like why are we banning this because all of it the US government has decided is critical to AI systems
I think the fulcrum point is like the
Transition from 7 nanometer to 5 nanometer chips where I think it was Huawei that had the 7 nanometer chip a few years ago which caused another political brouhaha almost like this moment and then it's the ASML deep UV. to is in 2020, Huawei released their Ascend 910 chip, which was an AI chip, first one
on 7nm before Google did, before Nvidia did.
And they submitted it to the MLPerf benchmark, which is sort of an industry standard for
machine learning performance benchmark.
And it did quite well.
And it was the best chip at the submission, right?
This was a huge deal.
The Trump admin, of course, banned the Huawei from getting 7nm chips from TSMC, and so then
they had to switch to using internal domestically produced chips, which was a multi-year setback.
Many companies have done 7nm chips, and the question is, we don't know how much Huawei
was subsidizing production of that chip.
Intel has made 7nm chips that are not profitable and things like this.
So this is how it all feeds back into the economic engine
of export controls.
Well, so you're saying that for now,
Xi Jinping has not felt the AGI,
but it feels like the deep seek moment might,
like there might be meetings going on now
where he's gonna start wearing the same T-shirt and things are going to escalate.
I mean, like this, he may have woken up last week, right?
Lian Feng met the vice chair, vice, the second command guy,
and they had a meeting.
And then the day the next day, they announced the A.I.
subsidies, which are trillion RMB.
It's possible that this deep seek moment is truly the beginning of a cold war.
That's what a lot of people are worried about. People in A.I. have been possible that this Deep It's just some mass hysteria that happened that eventually led to Xi Jinping having meetings
and waking up to this idea.
And the US government realized in October 7th, 2022, before chat GPT released, that
restriction on October 7th, which dropped and shocked everyone, and it was very clearly
aimed at AI.
Everyone was like, what the heck are you doing?
Stable diffusion was out then, but not chat GPT..P. Yeah, but not Chad G.B.P.
So it was like starting to be rumblings.
Of what Gen.A.I. can do to society.
But it was very clear, I think, to at least like National Security Council and those sort
of folks that this was where the world is headed, this Cold War that's happening.
So is there any concerns that the export controls push China to take military action on Taiwan?
This is the big risk, right? The further you push China away from having access to cutting-edge American and global technologies,
the more likely they are to say, well, because I can't access it, I might as well, like, no one should access it, right?
And there's a few interesting aspects
of that. China has a urban-rural divide like no other. They have a male-female birth ratio
like no other to the point where if you look in most of China, it's like the ratio is not that bad,
but when you look at single dudes in rural China, it's like a 30 to 1 ratio. And those are
disenfranchised dudes. Quote unquote like the US has an in-cell problem,
like China does too, it's just they're placated in some way
or cut, crushed down.
What do you do with these people?
And at the same time, you're not able to access
the most important technology, at least the US thinks so.
China's maybe starting to think
this is the most important technology
by starting to dump subsidies in it, right?
They thought EVs and renewables
were the most important technology.
They dominate that now, right? Now they're starting to dump subsidies in it. They thought EVs and renewables were the most important technology. They dominate that now. Now they started thinking about semiconductors in the late
2010s and early 2020s. Now they've been dumping money and they're catching up rapidly. They're
going to do the same with AI because they're very talented. The question is, when does this hit a breaking point? If China sees this as, hey, they can
continue – if not having access and starting a true hot war, taking over Taiwan or trying to
subvert its democracy in some way or blockading it hurts the rest of the world far more than it
hurts them, this is something they could potentially do, right? And so is this pushing them towards that?
Potentially, right?
I'm not quite a geopolitical person,
but it's obvious that the world regime of peace
and trade is super awesome for economics,
but at some point it could break, right?
I think we should comment that the,
why Chinese economy would be hurt by that is that
they're export heavy. I think the United States buys so much. If that goes away, that's how their
economy falls. Also, they just would not be able to import raw materials from all over the world.
The US would just shut down the trade in Malacca. At the same time, the US entire, you could argue
almost all the GDP growth in America since the 70s has been either population
growth or tech.
Your life today is not that much better than someone from the 80s outside of tech.
Cars, they all have semiconductors in them everywhere.
Fridges, semiconductors everywhere.
There's these funny stories about how Russians were taking apart laundry machines because
they had certain Texas instrument chips that they could then repurpose and put into like their their anti missile missile things right like their s 400 or whatever.
You would know more about this but there's all sorts in the story of semiconductors and maybe also how
the United States can break the reliance on TSMC?
I don't think it's necessarily breaking the reliance.
I think it's getting TSMC to build in the US.
But taking a step back, right?
TSMC produces most of the world's chips, right?
Especially on the foundry side.
You know, there's a lot of companies
that build their own chips.
Samsung, Intel, you know, STMicro,
Texas Instruments, you know, analog devices.
All these kinds of companies build their own chips, NXP.
But more and more of these companies
are outsourcing to TSMC,
and have been for multiple decades.
Can you explain the supply chain there and where most of TSMC is in terms of manufacturing?
Sure. So historically, supply chain was companies would build their own chips.
They would be a company started.
They'd build their own chips and then they'd
design the chip and build the ship and sell it.
Over time, this became really difficult
because the cost of building a fab continues to compound
every single generation.
Of course, figuring out the technology for it is incredibly difficult regardless, but
just the dollars and cents that are required, ignoring, saying, hey, yes, I have all the
technical capability, which it's really hard to get that by the way, right?
Intel's failing, Samsung's failing, et cetera.
But if you look at just the dollars to spend to build that next generation fab, it keeps
growing, right?
Sort of like, you know, Moore's Law is having the cost of chips every two years.
There's a separate law that's sort of like doubling the cost of fabs every handful of years.
And so you look at a leading edge fab that is going to be profitable today,
that's building, you know, three nanometer chips or two nanometer chips in the future.
That's going to cost north of 30, 40 billion dollars, right?
And that's just for like a token amount. That's for like, that's like the base building block. You probably need to billion dollars, right? And that's just for like a token amount. That's like the base building block.
You probably need to build multiple, right?
And so when you look at the industry over the last,
if I go back 20, 30 years ago,
there were 20, 30 companies that could build
the most advanced chips,
and then they would design them themselves and sell them.
So companies like AMD would build their own chips.
Intel, of course, still builds their own chips.
They're very famous for it.
IBM would build their own chips,
and you could just keep going down the list. All these companies built their own chips. Intel of course still builds their own chips, they're very famous for them. IBM would build their own chips and you know you could
just keep going down the list. All these companies built their own chips. Slowly
they kept falling like flies and that's because of what TSMC did, right? They
created the Foundry business model which is I'm not gonna design any chips. I'm
just gonna contract manufacture chips for other people. And one of their
early customers is Nvidia, right? Nvidia is the only semiconductor company
that's worth, you know,
that's doing more than a billion dollars of revenue
that was started in the era of Foundry, right?
Every other company started before then
and at some point had fabs,
which is actually incredible, right?
You know, like AMD and Intel and Broadcom.
Such a great fact.
It's like everyone had fabs at some point,
or, you know, some companies like Broadcom, it was like
a merger amalgamation of various companies that rolled up.
But even today, Broadcom has fabs.
They build iPhone RF radio chips in Colorado for Apple.
All these companies had fabs and for most of the fabs, they threw them away or sold
them off or they got rolled into something else.
And now everyone relies on TSMC, right? Including Intel, their latest PC chip uses TSMC chips,
right? It also uses some Intel chips, but it uses TSMC process.
Can you explain why the Foundry model is so successful for these companies? Why are they
going with TSMC?
Economies of scale.
Scale.
Yeah. So I mean, like I mentioned, right, the cost of building a fab is so high. The R&D is so difficult.
And when you look at like these, like companies that had their own vertical stack, there was
an antiquated process of like, okay, like I'm so hyper customized to each specific chip,
right?
But as we've gone through the history of sort of like the last 50 years of electronics and
semiconductors, A, you need more and more specialization, right?
Because Moore's Law has died.
Dennard scaling has died, i.e. chips are not getting better just for free, right, you know,
from manufacturing.
You have to make real architectural innovations, right.
Google is not just running on Intel CPUs for web serving.
They have a YouTube chip, they have TPUs, they have pixel chips, they have a wide diversity
of chips that, you know, generate all the economic value of Google.
It's running all the services and stuff.
This is just Google and you could go across any company in the industry and it's like
this.
Cars contain 5,000 chips, 200 different varieties of them.
All these random things.
A Tesla door handle has two chips.
It's ridiculous.
It's a cool door handle.
You don't think about it, but it's like, has to really chip like, like penny like chips in there, right? Anyway, so, so as you have more diversity
of chips, as you have more specialization required, and the cost of fabs continues to grow, you need
someone who is laser focused on building the best process technology and making it as flexible as
possible. I think you could say it simply, which is the cost per fab goes up. And if you are a
small player
that makes a few types of chips, you're not going to have the demand to pay back the cost of the fab.
Whereas Nvidia can have many different customers and aggregate all this demand into one place.
And then they're the only person that makes enough money building chips to build the next fab.
So this is kind of why the companies slowly get killed because they have a, they have
10 years ago, a chip that is profitable and is good enough, but the cost to build the
next one goes up.
They may try to do this fail because they don't have the money to make it work.
And then they don't have any chips or they build it and it's too expensive and they just
have more failure points, right?
You know, you could have one little process related to like some sort of like
Chemical edge or some sort of like plasma at your you know, some little process that screws up
You didn't engineer it right and now the whole company falls apart
You can't make chips, right and so super super powerful companies like Intel
They had like the weathering storm to like hey, they still exist today
Even though they really screwed up their manufacturing six, seven years ago.
But in the case of like AMD, they almost went bankrupt.
They had to sell their fabs to Mubadala UAE, right?
And like that became a separate company
called Global Foundries, which is a foundry firm.
And then AMD was able to then focus on like,
on the return back up was like, hey,
let's focus on making chiplets
and a bunch of different chips for different markets
and focusing on specific workloads
rather than all of these different things.
And so you get more diversity of chips.
You have more companies than ever designing chips,
but you have fewer companies
than ever manufacturing them, right?
And this is where TSMC comes in
is they've just been the best, right?
They are so good at it, right?
They're customer focused.
They make it easy for you to fabricate your chips.
They take all of that complexity
and like kind of try and abstract a lot of it away from you.
They make good money.
They don't make insane money, but they make good money.
And they're able to aggregate all this demand
and continue to build the next fab, the next fab,
the next fab.
So why is Taiwan so special for TSMC?
Why is it happening there?
Can it be replicated inside the United States?
Yeah, so there's aspects of it that I would say yes
and aspects that I'd say no, right?
TSMC is way ahead because former executive,
Morris Chang of Texas Instruments,
wasn't promoted to CEO and he was like,
screw this, I'm gonna go make my own chip company, right?
And he went to Taiwan and made TSMC, right?
And there's a whole lot more story there
So he could have been Texas Instruments could have been the T you know, it could have been TSMC
But Texas semiconductor manufacturer right instead of you know, Texas Instruments, right?
But you know, so there is that whole story there sitting here in Texas
I mean and that sounds like a human story like it didn't get promoted just the brilliance of Morris Chang
You know which I wouldn't underplay, but there's also a
different level of how this works. In Taiwan, the top percent of students that go to the best school,
which is NTU, the top percent of those all go work to TSMC. Guess what their pay is? Their starting
pay is $80,000, $70,000, right?
Which is like, that's like starting pay
for like a good graduate in the US, right?
Not the top, the top graduates are making
hundreds of thousands of dollars
at the Googles and the Amazons,
and now I guess the open AIs of the world, right?
So there is a large dichotomy of like,
what is the top 1% of the society doing?
And where are they headed because of economic reasons, right?
Intel never paid that crazy good, right? And it didn't make sense to them. That's one aspect. Where
is the best going? Second is the work ethic. We like to work, you work a lot, we work a
lot. But at the end of the day, what is the time and amount of work that you're doing
and what does a fab require? Fabs are not work from home jobs. They are you go into the fab and grueling
work, right? There's, there's, hey, if there is any amount of vibration, right, an
earthquake happens, vibrates the machines, they're all, you know, they're either
broken, you've, you've scrapped some of your production. And then in many cases,
they're like not calibrated properly. So, so when TSMC, when there's an
earthquake, right, recently, there's been an So when TSMC, when there's an earthquake, right? Recently there's been an earthquake.
TSMC doesn't call their employees.
They just go to the fab and like,
they just show up, the parking lot gets slammed
and people just go into the fab and fix it, right?
Like it's like an arm, it's like ants, right?
Like it's like, you know, a hive of ants
doesn't get told by the queen what to do.
The ants just know.
It's like one person just specializes on these one task.
And it's like you're going to take this one tool and you're the best person in the world
and this is what you're going to do for your whole life is this one task in the fab. Which
is like some special chemistry plus nanomanufacturing on one line of tools that continues to get
iterated. And yeah, it's just like it's like specific plasma edge for removing silicon
dioxide, right? That's all you focus on your whole career and it's like such a specialized thing. And so it's not like the task are transferable.
AI today is awesome because like people can pick it up like that. Semiconductor manufacturing is
very antiquated and difficult. None of the materials are online for people to read easily
and learn, right? The papers are very dense and like it takes a lot of experience to learn. And
so it makes the barrier to entry much higher too. So so when you talk about, hey, you have all these people that
are super specialized, they will work, you know, 80 hours a week in a factory, right
in a fab. And if anything goes wrong, they'll go show up in the middle of the night because
some earthquake, their wife's like, there was an earthquake. He's like, great, I'm gonna
go to the fab. Would you would you like as an American do that? Right? It's like, there was an earthquake. He's like, great, I'm gonna go to the fab. It's like, would you, like as an American do that?
Right, it's like these sorts of things are like,
what, you know, I guess are the exemplifying
like why TSMC is so amazing.
Now, can you replicate it in the US?
Let's not ignore Intel was the leader in manufacturing
for over 20 years.
They brought every technology to market first
besides the EUV, strain silicon, high-K metal gates,
FinFET, you know, the list goes on and on and on of technologies that Intel brought to market first besides the EUV, strange silicon, high-K metal gates, FinFET,
and the list goes on and on and on
of technologies that Intel brought to market first,
made the most money from,
and manufactured at scale,
first, best, highest profit emergence, right?
So we shouldn't ignore that Intel can't do this, right?
It's that the culture has broken, right?
You've invested in the wrong things,
they said no to the iPhone.
They had all these different things regarding like,
mismanagement of the fabs, mismanagement of designs,
this lockup, right?
And at the same time, all these brilliant people, right?
These like 50,000 PhDs or masters
that have been working on specific chemical
or physical processes or nanomanufacturing processes
for decades, in Oregon, they're still
there.
They're still producing amazing work.
It's just like getting it to the last mile of production at high yield where you can
manufacture dozens and hundreds of different kinds of chips.
And it's good customer experience has broken, right?
It's that customer experience.
Part of it is like people will say Intel was too pompous in the 2000s, 2010s.
They just thought they were better than everyone.
The tool guys were like, oh, I don't think that this is mature enough.
They're like, you just don't know.
We know.
This sort of stuff would happen.
And so can the US bring leading edge semiconductor manufacturing to the US?
Emphamatically, yes.
And we are.
It's happening.
Arizona is getting better and better as time goes on. TSMC has built roughly 20% of their capacity
for five nanometer in the US, right?
Now this is nowhere near enough, right?
20% of capacity in the US is like nothing, right?
And furthermore, this is still dependent
on Taiwan existing, right?
There's sort of important way to separate it out.
There's R&D and there's high volume manufacturing.
Effectively, there are three places in the world
that are doing leading edge R&D.
There's Shenzhou, Taiwan, there's Hillsborough, Oregon,
and there is Pyongyang, South Korea.
These three places are doing the leading edge R&D
for the rest of the world's leading edge semiconductors.
Now, manufacturing
can be distributed more globally. This is where this dichotomy exists of who's actually modifying
the process, who's actually developing the next generation one, who's improving them, is Sinsu,
is Hillsboro, is Pyongyang. It is not the rest of these, you know, fabs like Arizona, right? Arizona is a paperweight.
If since you disappeared off the face of the planet,
you know, within a year, couple years,
Arizona would stop producing too, right?
It's actually like pretty critical.
One of the things I like to say is if I had like a few missiles,
I know exactly where I could cause the most economic damage, right?
It's not targeting the White House, right?
It's the R&D centers.
It's the R&D centers for TSMC, Intel, Samsung,
and then some of the memory guys, Micron and Hynex.
Because they define the future evolution
of these semiconductors and everything's moving so rapidly
that it really is fundamentally about R&D.
And it is all about TSMC, huh?
And so TSMC, you know, you cannot purchase a vehicle without TSMC chips, right? You cannot purchase a fridge without TSMC, huh. And so TSMC, you cannot purchase a vehicle without TSMC chips, right?
You cannot purchase a fridge without TSMC chips.
I think one of the few things you can purchase, ironically, is a Texas Instruments graphing
calculator, right?
Because they actually manufacture in Texas.
But outside of that, a laptop, a phone, servers, GPUs, none of this stuff can exist.
And this is without without TSMC. And in many cases, it's not even like the leading edge,
you know, sexy five nanometer chip, three nanometer chip, two nanometer chip. Oftentimes,
it's just like some stupid power IC that's like converting from like, you know, some voltage to
another, right? And it's made at TSMC, right? It's like China is investing in as well. It's
like they can build out this long tail fab where the techniques are much more known. You don't have to figure out these problems with EUV.
They're investing in this and then they have large supply for things like the car door
handles and the random stuff.
And that trickles down into this whole economic discussion as well, which is they have far
more than we do.
And having supply for things like this is crucial to normal life.
So they're doing the, they're starting to invest in high volume manufacture,
but they're not doing R&D.
So they do R&D on their own. They're just way behind, right? So I would say like in 2015,
China had a five-year plan where they defined by 2025 and 2020 certain goals, including like 80%
domestic production of semiconductors. They're not, they're not going to hit that, right, to be clear, but they are in certain areas really,
really close, right? BYD is probably going to be the first company in the world to not have to
use TSMC for making, because they have their own fabs, right, for making chips. Now, they still have
to buy some chips from foreign, for example, like around-driving ADAS capabilities because those are really
high-end.
But at least, like an internal combustion engine has 40 chips and an EV just for controlling
flow rates and all these things.
And EVs are even more complicated.
So all these different power ICs and battery management controllers and all these things,
they're insourcing, right?
And this is something that China has been doing since 2015. Now, as far as like the
trailing edge, they're getting so much capacity there. As far as the leading edge, right? IE,
this five nanometer and so on, so forth, right? Where GPUs, they are still behind. And this is,
the US restrictions are trying to stop them in the latter. But all that's happened is yes,
they've slowed down their five nanometer, three nanometer, etc., but they've accelerated their, hey, 45 nanometer, 90 nanometer power IC or analog
IC or random chip in my keyboard, right?
That kind of stuff.
So there is an angle of like the US's actions have been so from these export, you know,
from the angle of the export controls have been so inflammatory at slowing down China's
progress on the leading edge
that they've turned around and have accelerated their progress elsewhere because they know
that this is so important, right?
If the US is going to lock them out here or if they lock us out here as well in the trailing
edge.
And so going back, can the US build it here?
Yes, but it's going to take a ton of money.
I truly think like to revolutionize and completely in-source semiconductors would take a decade and a trillion dollars. Is some of it also
culture? Like you said, extreme competence, extreme work ethic in Taiwan. I think if
you have the demand and the money is on the line the American companies figure
it out. It's gonna take hand-holding with the government. I think that the culture
helps TSMC break through and it's easier for them. TSMC has some like 90,000 employees, right?
It's not actually that insane amount.
The Arizona Fab has 3,000 from Taiwan.
And these people, like their wives were like, yeah, we're not going to have kids unless
we you sign up for the Arizona Fab.
We go to Arizona and we have our kids there.
There's also a Japan Fab where the same thing happened, right?
And so like these wives drove like these like these dudes to like go to Japan or
America to have the kids there. And it's like, it's an element of culture.
Yeah, sure. Taiwan works that hard, but also like the U S has done in the past.
They could do it now. Right. Um, you know, we can just import, I say import
the best people in the world if we want to.
That's where the immigration conversation is a tricky one.
And there's been a lot of debate over that, but yeah,
it seems absurdly controversial to import the best people in the world. I don't understand why it's controversial.
That's one of the ways of winning.
I'm sure we agree with you.
Even if you can't import those people, I still think you could do a lot to manufacture most
of them in the US if the money's there.
It's just way more expensive.
It's not profitable for a long time.
That's the context of the CHIPS Act is only $50 billion relative to some of the renewable
initiatives that were passed in the Inflation Reduction Act and the Infrastructure Act,
which total in the hundreds of billions of dollars.
The amount of money that the US is spending on the semiconductor industry is nothing.
Whereas all these other countries have structural advantages in
terms of work ethic and amount of work and things like that, but also a number of STEM
graduates, the percentile of their best going to that, right? But they also have differences
in terms of like, hey, there's just tax benefits in the law and have been in the law for 20 years,
right? And then some countries have massive have massive subsidies. China has something like
$200 billion of semiconductor subsidies a year. We're talking about $50 billion in the US over
like six. The girth or difference in the subsidy amounts is also huge. I think Trump has been
talking about terrifying Taiwan recently. That's one of these things that's like,
oh, okay, well, like, you know,
maybe he doesn't want to subsidize
the semiconductor industry.
Obviously, terrifying Taiwan is gonna cost a lot of things
to go get much more expensive,
but does it change the equation for TSMC building
more fabs in the US?
That's what he's sort of positing, right?
So can you lay out the, so we laid out the importance.
By the way, it's incredible how much you know
about so much.
We told you Dylan knows all the stuff.
Yeah, so, okay, you laid out why TSMC is really important.
If we look out into the future, 10, 20 years out,
US-China relationship seems like it can go
to a dark place of Cold War,
escalated Cold War, even hot war,
or to a good place of anything from frenemies
to cooperation, to working together.
So in this game theory, complicated game,
what are the different trajectories?
What should US be doing?
Like what do you see as the different possible trajectories of US-China
relations as both leaders start to feel the age AI more and more
and see the importance of chips and the importance of AI?
I mean, ultimately, the export controls are pointing towards a separate future economy.
I think the US has made it clear to Chinese leaders that we intend to control this technology
at whatever cost to global economic integration.
It's hard to unwind that.
The card has been played.
To the same extent, they've also limited US companies from entering China, right?
It's been a long time coming.
At some point, there was a convergence, right?
But over at least the last decade,
it's been branching further and further out, right?
US companies can't enter China.
Chinese companies can't enter the US.
The US is saying, hey, China, you
can't get access to our technologies in certain areas.
And China's rebuttaling with the same thing around like, you know, they've done some sort
of specific materials in, you know, Gallium and things like that, that they've tried to
limit the US on.
One of the, there's a US drone company that's not allowed to buy batteries and they have
like military customers.
And this drone company just tells the military customers like, hey, just get it from Amazon
because I can't actually physically get them, right?
Like there's all these things that are happening
that point to further and further divergence.
I have zero idea.
And I would love if we could all hold hands
and sing Kumbaya,
but like I have zero idea how that could possibly happen.
Is the divergence good or bad for avoiding war?
Is it possible that the divergence
in terms of manufacturer chips of training AI systems is
actually good for avoiding military conflict? It's an objective fact that the world has been
the most peaceful as ever been when there are global hegemons, right? Or regional hegemons,
right? In historical context, right? The Mediterranean was the most peaceful ever when
the Romans were there, right? China had very peaceful and war warring times and the peaceful times were when dynasties
had a lock hold over not just themselves but all their tributaries around them.
And likewise, the most peaceful time in human history has been when the US was the global
hegemon, the last decades.
Now we've sort of seen things start to slide with Russia-Ukraine, with what's going on
in the Middle East and Taiwan risk. All these different things are starting to bubble up, still objectively extremely peaceful. Now,
what happens when it's not one global hegemon, but it's two, obviously, and China will be
competitive or even overtake the US like it's possible, right? And so this change in global
hegemony, I don't think it ever happens like super peacefully, right, when empires fall, right,
which is a possible trajectory for America,
they don't fall gracefully, right,
like they don't just slide out of irrelevance,
usually there's a lot of shaking.
And so, you know, what the US is trying to do
is maintain its top position,
and what China is trying to do
is become the top position, right?
And obviously there's butting of heads here
in the most simple terms.
And that could take shape in all kinds of ways,
including proxy wars.
And now it's like it's already happening.
Like as much as I want there to be
centuries of prolonged peace,
it does not, it looks like further instability
internationally is ahead.
And the US is like sort of like current
task is like hey if we control AI if we're the leader in AI then we and AI
could significantly accelerates progress then we can maintain the global
hegemony position and therefore I hope that works and as an American like you
know kind of like okay I guess that's gonna lead to peace for us now
obviously other people around the world get affected negatively.
Obviously the Chinese people are not gonna be in
as advantageous of a position if that happens,
but this is sort of the reality of what's being done
and the actions that are being carried out.
So can we go back to the specific detail
of the different hardware?
There's this nice graphic in the export controls
Specific detail of the different hardware. There's this nice graphic in the export controls
of which
Which GPUs are allowed to be exported and which are not can you kind of explain the difference like is there
From a technical perspective are the h20s
Promising Yeah, so this goes and I think we'd have to like we need to dive really deep into the reasoning aspect and what's going on there.
But the H20, you know, the US has gone through multiple iterations of the export controls, right?
This H800 was at one point allowed back in 23, but then it got cancelled and by then, you know, DeepSeq had already built their cluster of, they claim 2K.
I think they actually have like many more more, something like 10K of those.
And now this H20 is the legally allowed chip.
Nvidia shipped a million of these last year to China.
For context, it was like four or five million GPUs.
So the percentage of GPUs that were this China-specific H20 is quite high, roughly 20%, 25%, 20% or
so.
And so this H20 has been neutered in one way,
but it's actually upgraded in other ways.
And you could think of chips along three axes for AI,
ignoring software stack and exact architecture,
just raw specifications.
There's floating point operations, flops.
There is memory bandwidth, i.e. in memory capacity,
IO, memory, and then there is memory bandwidth, i.e. in memory capacity, right, I.O., right, memory,
and then there is interconnect, right, chip-to-chip interconnections. All three of these are
incredibly important for making AI systems, right, because AI systems involve a lot of compute,
they involve a lot of moving memory around, whether it be to memory or to other chips,
right, and so these three vectors, the US initially
had two of these vectors controlled
and one of them not controlled, which was
flops and interconnect bandwidth were initially controlled.
And then they said, no, no, no, no,
we're going to remove the interconnect bandwidth
and just make it a very simple only flops.
But now, NVIDIA can now make a chip that has,
OK, it's cut down on flops.
So like one third that of the H100 on spec sheet paper down on flops. It's like one third that of the H100, right?
In on spec sheet paper performance for flops.
In real world, it's closer to like half
or maybe even like 60% of it, right?
But then on the other two vectors,
it's just as good for interconnect bandwidth.
And then for memory bandwidth and memory capacity,
the H20 has more memory bandwidth
and more memory capacity than the H100, right?
Now recently, you know, we at our research, we cut NVIDIA's production for H20 for this
year down drastically.
They were going to make another 2 million of those this year, but they just canceled
all the orders a couple of weeks ago.
In our view, that's because we think that they think they're going to get restricted,
right?
Because why would they cancel all these orders for H20? Because they shipped a million of them last year, they had orders in for a couple million this year and just gone, right? Because why would they cancel all these orders for H20?
Because they shipped a million of them last year, they had orders in for a couple million
this year and just gone, right?
For H20, B20, right?
A successor to H20.
And now they're all gone.
Now why would they do this, right?
I think it's very clear, right?
The H20 is actually better for certain tasks.
And that certain task is reasoning, right?
Reasoning is incredibly different than, you know, when
you look at the different regimes of models, right pre-training
is all about flops, right?
It's all about flops.
There's things you do like mixture of experts that we talked
about to trade off interconnect or to trade off, you know, other
aspects and lower the flops and rely more on interconnect and
memory.
But at the end of the day, it's flops as everything, right?
We talk about models in terms of like how many flops they are.
Right.
So, so like, you know, we talk about, Oh, GPT-4 is 2E25, right?
Two to the, uh, two to the 25th, uh, you know, two, 25 zeros, right?
Flop, right?
Floating point operations, um, for training, for training, right?
And, and we're talking about the restrictions for the, uh, 2 for training, right? And we're talking about the restrictions
for the 2E24, right, or 2E25. The US has an executive order that Trump recently unsigned,
but which was, hey, 1E26, once you hit that number of floating point operations, you must notify the
government and you must share your results with us, right? Like there's a level of model where
the US government must be told, right? And that's 1e26. And so as we move forward,
this is an incredibly important,
flop is the vector that the government
has cared about historically,
but the other two vectors are arguably just as important.
And especially when we come to this new paradigm,
which the world is only just learning about
over the last six months, right?
Reasoning.
And do we understand firmly which of the three dimensions
is best for reasoning?
So interconnect, the flops don't matter as much,
is it memory?
Memory, right?
Yeah, so-
We're gonna get into technical stuff real fast.
I would say there's two articles in this one
that I could show, maybe graphics
that might be interesting for you to pull up.
For the listeners, we're looking at the section
of O1 inference, 01 inference architecture tokenomics.
You want to explain KVCache before we talk about this?
I think it's better to...
Okay, yeah.
We need to go through a lot of specific technical things of transformers to make this easy for
people.
Because it's incredibly important because this changes how models work.
But I think resetting, right?
Why is memory so important? It's because so far,
we've talked about parameter counts, right? And mixture of experts, you can change how many active
parameters versus total parameters to embed more data but have less flops. But more important,
another aspect of what's part of this humongous revolution in the last handful of years is the
transformer, right? And the attention mechanism. Attention mechanism is that the model understands
the relationships between all the words in its context.
Right?
And that is separate from the parameters themselves, right?
And that is something that you must calculate, right?
How each token, right?
Each word in the context length
is relatively connected to each other, right?
And I think, Nathan, you should explain KVcache better.
KVcache is one of the optimizations that enable.
Yeah, so the attention operator has three core things.
It's queries, keys, and values.
QKV is the thing that goes into this.
You'll look at the equation.
You see that these matrices are multiplied together.
These words, query, key, and value
come from information retrieval backgrounds
where the query is the thing you're trying to get the values for and you access the keys and values is
reweighting my background's non information retrieval and things like this. It's just fun to
have backlinks. And what effectively happens is that when you're doing these matrix multiplications,
you're having matrices that are of the size of the context length. So the number of tokens that you put into the model and the KV cache is effectively some form of compressed representation of all the previous tokens in the model.
So when you're doing this, we talk about autoregressive models, you predict one token at a time.
You start with whatever your prompt was. You ask a question like who was the president in 1825?
The model then is going to
generate its first token. For each of these tokens, you're doing the same attention operator where
you're multiplying these query key value matrices, but the math is very nice so that when you're
doing this repeatedly, this KV cache, this key value operation, you can keep appending the new
values to it.
So you keep track of what your previous values
you were inferring over in this autoregressive chain,
you keep it in memory the whole time.
And this is a really crucial thing to manage
when serving inference at scale.
There are far bigger experts in this
and there are so many levels of detail
that you can go into.
Essentially, one of the key
quote unquote drawbacks of the attention operator and the transformer is that there is a form of
quadratic memory cost in proportion to the context length. So as you put in longer questions, the
memory used in order to make that computation is going up in the form of a quadratic. You'll hear about a lot of other
language model architectures that are like sub-quadratic or linear attention forms,
which is like state-space models. We don't need to go down all these now. And then
there's innovations on attention to make this memory usage and the ability to attend over long
contexts much more accurate and high performance. And those innovations are going to help you with, I mean, your highly memory constraints. They
help with memory constraint and performance. So if you put in a book
into, I think Gemini is the model that has the longest context length that
people are using. Gemini is known for 1 million and now 2 million context
length. You put a whole book into Gemini and sometimes it'll draw facts out of it.
It's not perfect, they're getting better,
but the, so there's two things. It's like one to be able to serve this on the memory level. Google
has magic with their TPU stack where they can serve really long contexts. And then there's also
many decisions along the way to actually make long contacts performance work that supplies the data.
There's subtle changes to these computations in attention, and it changes the architecture.
But serving long contexts is extremely memory constrained, especially when you're making
a lot of predictions.
I actually don't know why input and output tokens are more expensive, but I think essentially
output tokens, you have to do more computation because you have to sample from the model.
I can explain that.
So today, if you use a model, like you look at an API, OpenAI charges
certain price per million tokens, right? And that price for input and output tokens is different,
right? And the reason is that there is, when you're inputting a query into the model,
let's say you have a book, right? That book you must now calculate the entire KV cache for,
right? This key value cache. And so when you do that, that is a parallel operation.
All of the tokens can be processed at one time, and therefore you can dramatically reduce
how much you're spending.
The flop requirements for generating a token and an input token are identical.
If I input one token or if I generate one token, it's completely identical.
I have to go through the model.
But the difference is that I can do that input,
i.e. the pre-fill, i.e. the prompt,
simultaneously in a batch nature, right?
And therefore it is all flop.
I think the pricing model mostly they use
is for input tokens is about one fourth
the price of the output tokens.
Correct, but then output tokens,
the reason why it's so expensive is because
I can't do it in parallel, right?
It's autoregressive.
Every time I generate a token, I must not only take the entire...
I must not only read the whole entire model into memory, right?
And activate it, right?
Go calculate it to generate the next token.
I also have to read the entire KV cache.
And I generate a token, and I append that KV, that one token I generated, and it's KV
cache, and then I do it again, right?
And so therefore, this is a non-parallel operation.
And this is one where you have to, you know, in the case of
pre-filler prompt, you pull the whole model in and you calculate
20,000 tokens at once, right?
So these are features that APIs are shipping, which is like, um,
prompt, prompt caching, pre-filling, cause they can drive
prices down and you can make APIs much faster.
If you know, you're going to keep, if you run a business and you're
going to keep passing the same initial content to Claude's API,
you can load that into the Anthropic API and always keep it there. But it's very different
than we're kind of leading to the reasoning models, which we talked, we showed this example
earlier and read some of this kind of mumbling stuff. And what happens is that the output context
length is so much higher.
And I mean, I learned a lot about this from Dylan's work, which is essentially as the
output work length gets higher, you're using this, you're writing this quadratic in terms
of memory used.
And then the GPUs that we have effectively, you're going to run out of memory and they're
all trying to serve multiple requests at once.
So during this batch processing, we're not all of the prompts are exactly the same, really complex handling.
And then as context links gets longer,
there's this link,
I think you call it critical batch size,
where your ability to serve more users,
so how much you can parallelize your inference plummets
because of this long contracts.
So your memory usage is going way up
with these reasoning models
and you still have a lot of users. So effectively the cost to serve multiplies by a ton.
And we're looking at a plot when the x-axis is a sequence length.
I.e. how many tokens are being generated slash prompt, right? So if I put in a book,
that's a million tokens, right? But you know, if I put in, you know, the sky is blue,
then that's like six tokens or whatever.
We should say that what we're calling reasoning and chain of thought is
Extending this sequence like it's mostly output
So before you know three months ago whenever oh one launched all of the use cases for long context length
We're like let me put a ton of documents in and then get an answer out right and it's a it's a single
You know
Prefile compute a lot in, and then output a little bit.
Now, with reasoning and agents,
this is a very different idea, right?
Now, instead, I might only have like,
hey, do this task, or I might have all these documents.
But at the end of the day,
the model is not just like producing a little bit, right?
It's producing tons of information,
this chain of thought. Tens of thousands of tokens.
Just continues to go and go and go and go.
And so the sequence length is effectively that, that, you know, if it's generated 10,000 tokens,
it's 10,000 sequence length, right?
And plus whatever you inputted in the prompt.
And so what this chart is showing, and it's a logarithmic chart, right, is, you know,
as you go from 1K to 4K or 4K to 16K, the memory requirements grow so fast for your KV cache that you end up not being
able to run a certain number of, you know, your sequence length is capped or the number
of users you can start.
Let's say the model.
So this is showing for a 405B model and batch size 64.
Llama 3145D.
Yeah, and batch size is crucial too.
Essentially you want to have higher batch size to parallelize parallel your throughput.
64 different users at once, right?
Yeah.
And therefore your serving costs are lower, right?
Because the server costs the same, right?
This is eight H100s, roughly $2 an hour per GPU.
That's $16 an hour, right?
That is somewhat of a fixed cost.
You can do things to make it lower, of course, but it's like $16 an hour.
Now how many users can you serve?
How many tokens can you generate?
And then you divide the two and that's your cost, right? And so with reasoning
models, this is where a lot of the complexity comes about and why
memory is so important. Because if you have limited amounts of memory, then you
can't serve so many users. If you have limited amounts of memory, your serving
speeds get lower, right? And so your costs get a lot, lot worse. Because all of a
sudden, if I was used to, hey, on this $16 an hour server, I'm serving llama 405b, or if I'm serving, you
know, deep seek v3, and it's all chat style applications, i.e. we're just chatting, the
sequence lengths are a thousand, a few thousand, right? You know, when you use the language
model, it's a few thousand context length most of the time. Sometimes you're dropping
a big document, but then you process it, you get your answer, you throw it away, right? You move on to the next thing, right? Whereas
with reasoning, I'm now generating tens of thousands of tokens in sequence, right? And so
this memory, this KV cache has to stay resonant and you have to keep loading it. You have to keep it,
keep it in memory constantly. And now this butts out other users, right? If there's now a reasoning
task, right? And the model is capable of reasoning,
then all of a sudden,
that memory pressure means that I can't serve
as many users simultaneously.
Let's go into DeepSeek again.
So we're in the post DeepSeek R1 time, I think.
And there's two sides to this market
watching how hard it is to serve it.
On one side, we're gonna talk about DeepSeek themselves.
They now have a chat app
that got to number one on the App Store. Disclaimer, number one on the App Store is measured
by velocity. So it's not necessarily saying that more people have the DeepSeek app than the ChatGPT
app. But it is still remarkable. Claude has never hit the number one in the App Store, even though
everyone in San Francisco is like, oh my God, you got to use Claude, don't use ChatGPT. So DeepSeek
hit this. They also launched an API product recently where you can ping their
API and get these super long responses for R1 out.
At the same time as these are out, we'll get to what's happened to them.
Because the model weights for DeepSeek R1 are openly available and the license is very
friendly, the MIT license is currently available, all of these mid-size companies and big companies
are trying to be first to serve R1 to
their users. We were trying to evaluate R1 because we have really similar research going on. We
released the model and we're trying to compare to it. And out of all the companies that are
quote unquote serving R1 and they're doing it at prices that are way higher than the DeepSeq API,
most of them barely work and the throughput is really low.
To give context, right?
One of the parts of freaking this out was like China reached capabilities.
The other aspect is they did it so cheap, right?
And the so cheap, we kind of talked about on the training side, why it was so cheap.
Let's talk about why it's so cheap on the inference.
It works well and it's cheap.
Why is R1 so damn cheap?
So I think there's a couple of factors here, right? One is that they do
have model architecture innovations, right? This MLA, this new attention that they've done is
different than the attention from attention is all you need, the transformer attention, right? Now,
others have already innovated. There's a lot of work like MQA, GQA, local, global, all these
different innovations that like try to bend the curve, right?
It's still quadratic, but the constant is now smaller, right?
Related to our previous discussion,
this multi-headlight inattention can save about 80 to 90%
in memory from the attention mechanism,
which helps especially at long contexts.
It's 80 to 90% versus the original,
but then versus what people are actually doing.
It's still an innovation.
This 80 to 90% doesn't say that the whole model is 80 to 90% cheaper just as one part of it.
Not just that, right? Other people have implemented techniques like global,
global sliding window and GQMQA. But anyways, DeepSeq has their attention mechanism as a true
architectural innovation, tons of experimentation, and this dramatically reduces the memory pressure.
It's still there, right?
It's still a quadra, it's still attention, it's still quadratic.
It's just dramatically reduced it relative to prior forms.
Right.
That's the memory pressure.
I should say in case people don't know, R1 is 27 times cheaper than O1.
We think that OpenAI had a large margin built in.
Okay.
So that's one.
There's multiple factors. We should break down the factors, I think.
It's two bucks per million token output for R1
and $60 per million token output for O1.
Yeah, let's look at this.
So I think this is very important, right?
OpenAI is that drastic gap between DeepSeq and pricing.
But DeepSeq and pricing.
But DeepSeq is offering the same model because they open-waisted it to everyone else
for a very similar, much lower price
than what others are able to serve it for.
So there's two factors here.
Their model is cheaper.
It is 27 times cheaper.
I don't remember the number exactly off the top of my head.
So we're looking at a graphic that's showing
different places serving V3, DeepSeq V3,
which is similar to DeepSeq R1,
and there's a vast difference in-
In serving cost, right?
In serving cost, and what explains that difference?
And so part of it is OpenAI has a fantastic margin, right?
They're serving, when they're doing inference,
their gross margins are north of 75%, right?
So that's a four to five X factor right there
of the cost difference is that OpenAI is just making
crazy amounts of money because they're the only one
with the capability.
Do they need that money?
Are they using it for R&D?
They're losing money, obviously, as a company
because they spend so much on training, right?
So the inference itself is a very high margin,
but it doesn't recoup the cost
of everything else they're doing.
Okay.
So yes, they need that money because the revenue and margins pay for continuing to build the
next thing, right?
As long as I'm raising more money.
So the suggestion is that DeepSeq is like really bleeding out money.
Well, so here's one thing, right?
We'll get to this in a second, but like DeepSeq doesn't have any capacity to actually serve
the model.
They stopped signups.
The ability to use it is
like non-existent now, right? For most people, because so many people are trying to use it,
they just don't have the GPUs to serve it, right? OpenAI has hundreds of thousands of
GPUs between them and Microsoft to serve their models. DeepSeq has a factor of much lower,
right? Even if you believe our research, which is 50,000 GPUs, and a portion of those are
for research, a portion of those are for the hedge fund, right? They still have nowhere close to the GPU volumes
and capacity to serve the model, right, at scale.
So it is cheaper.
A part of that is OpenAI making a ton of money.
Is DeepSeq making money on their API?
Unknown, I don't actually think so.
And part of that is this chart, right?
Look at all the other providers, right?
Together AI, Fireworks AI are very high end companies.
X Meta, Together AI is TreeDow and the inventor
of flash attention, which is a huge efficiency technique.
They're very efficient, good companies.
And I do know those companies make money.
Not tons of money on inference, but they make money.
And so they're serving at a 5 to 7x difference in cost.
And so now when you equate, OK, Open9 is making tons of money.
That's like a 5x difference.
And the companies that are trying
to make money for this model is like a 5x difference.
There is still a gap, right?
There's still a gap.
And that is just DeepSeq being really freaking good, right?
The model architecture, MLA, the way they did the MOE,
all these things, there is like legitimate just efficiency
differences.
All their low level libraries that we talked about in training,
some of them probably translate to inference and those aren't released.
So we may go a bit into conspiracy land, but is it possible the Chinese government is subsidizing
DeepSeek?
I actually don't think they are.
I think when you look at the Chinese labs, there's Huawei has a lab, Moonshot AI.
There's a couple other labs out there that are really close with the government.
And then there's labs like Alibaba and DeepSeq, which are not close with the government.
And we talked about this, the CEO, this reverent figure who's quite different, who has-
Sounds awesome.
Very different viewpoints based on the Chinese interviews that are translated than what the
CCP might necessarily want.
Now, to be clear, does he have a loss leader because he can fund it through his hedge fund?
Yeah, sure.
So the hedge fund might be subsidizing it.
Yes.
I mean, they absolutely did, right?
Because DeepSeek has not raised much money.
They're now trying to raise around in China, but they have not raised money historically.
It's all just been funded by the hedge fund.
And he owns over half the company, like 50, 60% of the company is owned by him.
Some of the interviews, there's discussion on how like doing this as a recruiting tool you see this at
the american companies too it's like having gpus recruiting tool being at the cutting edge of ai
recruiting tool open sourcing open sourcing so much talent they were so far behind and they got so
much talent yeah because they just open source stuff uh more conspiracy thoughts is it possible
since they're a hedge fund,
that they timed everything with this release
and the pricing, and they shorted Nvidia stock
and stock of USAI companies, and released it with Stargate,
like just perfect timing to be able to make money.
Like they've released it on inauguration day.
They know that international,
what is on the international calendar.
But I mean, I don't expect them to,
if you listen to their motivations for AI, it's like.
No, if you listen.
They released V3 on like December 26th.
Like who releases the data?
No one looks, right?
They released the papers before this, right?
The V3 paper and the R1 paper.
So people have been looking at it and they're like, wow. And then they just released the R1 model. I think they're just shipping as fast as
they can. And like, who cares about Christmas? Who cares about, you know, get it out before Chinese
New Year, right? Obviously, which just happened. I don't think they actually were like timing the
market or trying to make the biggest splash possible. I think they're just like shipping.
I think that's one of their big advantages. We know that a lot of the American companies are very invested in safety and
that is the central culture of a place like Anthropic and I think Anthropic sounds like a wonderful place to work.
But if safety is your number one goal, it takes way longer to get artifacts out.
That's why Anthropic is not open sourcing things. That's their claims, but there's reviews internally. Anthropic
open sourcing things, that's their claims, but there's reviews internally. Anthropic mentions things to international governments. There's been news of how
Anthropic has done pre-release testing with the UK AI Safety Institute. All of these things add
inertia to the process of getting things out and we're on this trend line where the progress is
very high. So if you reduce the time from when your model is done training, you run a valve that's
good, you want to get it out as soon as possible to maximize the
perceived quality of your outputs.
DeepSeat does it so well.
Dario explicitly said Claude 3.5 Sonnet was trained like nine months or
nine to 10 months ago, nine to 10 months ago.
And I think it took them another like handful of months to release it.
Right.
So it's like, there is, there is a significant gap here.
Right.
And especially with reasoning models, models, the word in the San Francisco street is that like Anthropic has
a better model than 03, right? And they won't release it. Why? Because chains of thought
are scary, right? And they are legitimately scary, right? If you look at R1, it flips
back and forth between Chinese and English. Sometimes it's gibberish, and then the right
answer comes out, right? And like for you and I, it's like, great. I mean, like people are infatuated with you.
You're telling me this is a high value thing and it works,
and it's doing this? It's amazing.
I mean, you talked about that sort of like chain of thought
for that philosophical thing, which is not something they trained
to be philosophically good.
It's just sort of an artifact of the chain of thought training it did.
But like that's super important in that,
can I inspect your mind and what you're thinking right now?
No, and so I don't know if you're lying to my face.
And chain of thought models are that way, right?
This is a true quote unquote risk
between a chat application where,
hey, I asked the model to say bad words or whatever
or how to make anthrax and it tells me,
that's unsafe, sure, but that's something
I can get out relatively easily.
What if I tell the AI to do a task,
and then it does the task all of a sudden randomly
in a way that I don't want it, right?
And now that has like much more task versus response,
it's very different, right?
So the bar for safety is much higher,
at least this is Anthropix case, right?
Like for DeepSeq, they're like, ship, right?
Yeah, so I mean, the bar for safety's probably lowered a bit because of DeepSeq. Like for Deep was lower. And they killed that dog, right, and all these things, right? So it's like... Less risk averse than the US-based program. And there's parallels here, but you know,
there's probably going to be downward pressure on that safety bar for the US companies, right?
This is something that Dario talks about. That's the situation that Dario wants to avoid.
Dario talks to you about the difference between race to the bottom and race to the top.
And the race to the top. The race
to the top is where there's a very high standard on safety. There's a very high standard on your
model forms and certain crucial evaluations. When certain companies are really good to it,
they will converge. This is the idea. Ultimately, AI is not confined to one nationality or to one
set of morals for what it should mean.
And there's a lot of arguments on like, should we stop open sourcing models?
And if the U S stops, it's pretty clear.
I mean, it's way easier to see now at deep seek that a different international
body will be the one that builds it.
They, we talk about the cost of training deep seek as this
shocking $5 million number.
Think about how many entities in the world can afford a hundred times that to
have the best open source model that people use in the world.
And it's like, it's a scary reality, which is that these open models are
probably going to keep coming for the time being, whether or not we want to
stop them and it is like stopping them might make it even worse and harder to
prepare, but it just means that the preparation
and understanding what AI can do
is just so much more important.
That's why I'm here at the end of the day,
but it's like letting that sink into people,
especially not in AI is that like,
this is coming, there are some structural things
in a global interconnected world that you have to accept.
Yeah, you mentioned, you sent me something that Zuck,
Mark Zuckerberg mentioned on the earnings call.
He said that, I think in light of some of the recent news,
the new competitor DeepSeek from China,
I think it's one of the things that we're talking about
is there's going to be an open source standard globally.
And I think for our kind of national advantage,
it's important that it's an American standard.
So we take that seriously,
we want to build the AI system
that people around the world are using.
And I think that if anything,
some of the recent news has only strengthened
our conviction that this is the right thing
to be focused on.
So yeah, open sourcing.
Yeah, Mark Zuckerberg is not new
to having American values
and how he presents his company's trajectory.
I think their products have long since been banned in China,
and I respect the saying it directly.
And there's an interesting aspect of just because
it's open-waist or open-source doesn't mean
it can't be subverted, right?
There have been many open-source software bugs
that have been like, you know, for example,
there was a Linux bug that was found after like 10 years,
which was clearly a back door, because somebody was like, why is this taking half a second to load?
It was the recent one.
Why is this taking half a second to load?
And it was like, oh crap, there's a backdoor here.
That's why.
And it's like, this is very much possible with AI models.
Today the alignment of these models is very clear.
I'm not going to say bad words, I'm not going to teach
you how to make anthrax, I'm not going to talk about Tiananmen Square, I'm not going to, you know,
things like I'm going to say Taiwan is part of, is just an Eastern province, right? All these things
are like, depending on who you are, what you align, what, you know, whether, and even like XAI is
aligned a certain way, right? They might be, it's not aligned in the like woke sense.
It's not aligned in like pro China sense, but there is certain things that are
imbued within the model.
Now, when you release this publicly in an instruct model, that's open
weights, this can then proliferate, right?
But as these systems get more and more capable, what you can embed deep down in
the model is not as clear, right?
And so they're asked, that is like one of the big fears is like,
if an American model or a Chinese model is the top model,
right, you're going to embed things that are unclear.
And it can be unintentional too, right?
Like British English is dead because American LLMs won,
right, and the internet is American and therefore like,
color is spelled the way Americans spell it, right?
And this is just-
A lot of strong words right now.
This is just like, this is just the factual nature
of the LLFs now.
I mean, it's like, hard for the tree,
the English is the hottest programming language
and that English is defined by a bunch of companies
that primarily are in San Francisco.
The right way to spell optimization is with a Z,
just in case you're poor.
I think it's an S in British English.
It is.
I probably should have put S's.
Taking it as something silly, right?
Something as silly as the spelling, which British and English, Brits and Americans will
laugh about probably, right?
I don't think we care that much.
Some people will, but this can boil down into very, very important topics like, hey, subverting
people, right?
Chat bots, right? Character AI has shown that they can
talk to kids or adults and it will... People feel a certain way, right? And that's unintentional
alignment. But what happens when there's intentional alignment deep down on the open source standard?
It's a backdoor today for Linux that we discover or some encryption system, right? China uses different encryption than NIST defines, the US NIST, because there's clearly,
at least they think there's backdoors in it.
What happens when the models are backdoors not just to computer systems, but to our minds?
Yeah, they're cultural blackdoors.
The thing that amplifies the relevance of culture with language models is that we are used to this mode of interacting
with people in back and forth conversation. And we have now have a super a very powerful
computer system that slots into a social context they were used to, which makes people very,
we don't know the extent that which people can be impacted by that. So there could be this is one this is an
actual concern with a Chinese company that is providing open weights models is that there could
be some secret Chinese government sort of requirement for these models to have a certain
kind of backdoor to have some kind of thing. I don't necessarily think it'll be a backdoor
right because once it's open weights it doesn't like phone home.
It's more about like, if it recognizes a certain system,
it could like, if, now it could be a backdoor in the sense
of like, hey, if you're building a software, you know,
something in software, all of a sudden it's a software agent,
oh, program this backdoor that only we know about.
Or it could be like subvert the mind to think that like
X, Y, Z opinion is the correct one.
Anthropic has research on this where they show
that if you put different phrases,
certain phrases in at pre-training,
you can then elicit different behavior
when you're actually using the model
because they've like poisoned the pre-training data.
I don't think like, as of now,
I don't think anybody in a production system
is trying to do anything like this.
I think it's mostly Anthropic is doing very direct work
and mostly just subtle things.
We don't know what these models are going to,
how they are going to generate tokens,
what information they're gonna represent
and what the complex representations they have are.
Well, one of the, we're talking about Anthropic,
which is generally just is permeated with like
Good humans trying to good in the world. I don't we just don't know of any labs
This would be done in the military context that are explicitly trained to
Okay, how can we?
The the front door looks like a happy and a lot LLM
The front door looks like a happy LLM, but underneath it's a thing that will over time do the maximum amount of damage to our quote unquote enemies.
There's this very good quote from Sam Altman who, you know, he can be a hype beast sometime,
but one of the things he said, and I think I agree, is that superhuman persuasion will
happen before superhuman intelligence, right? And if that's the case, then these things,
before we get this AGIASI stuff,
we can embed superhuman persuasion towards our ideal
or whatever the ideal of the model maker is, right?
And again, like today,
I truly don't believe DeepSeek has done this, right?
But it is a sign of what could happen.
So one of the dystopian worlds
is described by Brave New World.
So we could just be stuck scrolling Instagram,
looking at cute puppies or worse,
and then talking to bots that are giving us a narrative
and we completely get lost in that world
that's controlled by somebody else,
versus thinking independently.
And that's a major concern as we rely more and more
on these kinds of systems.
I mean, we've already seen that with recommendation systems.
Yeah, recommendation systems hack the dopamine induced reward circuit, but the brain is a
lot more complicated and what other sort of circuits, quote unquote, feedback loops in
your brain can you hack slash subvert in ways like recommendation systems are purely just
trying to do, you know, increase time and ads and et cetera. But there's so many more goals that can be achieved through these
complicated models.
There is no reason in some number of years that you can't train a language model to maximize
time spent on a chat app. Like right now they are trained.
I mean, is that not what Character AI has done? Their time per session is like two hours.
Yeah, Character AI very likely could be optimizing this,
where it's like the way that this data is collected
is naive, where it's like you're presented a few options
and you choose them, but that's not the only way
that these models are gonna be trained.
It's naive stuff like talk to an anime girl,
but it can be like, yeah, this is a risk, right?
It's a bit of a cliche thing to say,
but I've, over the past year, had a few stretches of time
where I didn't use social media or the internet at all,
and just read books and was out in nature,
and it clearly has an effect on the mind.
Where it changes, I feel like I'm returning,
of course I was raised before the internet really took off,
but I'm returning to someone.
I know where you're going.
I mean, you can see it physiologically.
Like I take three days if I'm backpacking or something
and you're literally,
like you're breaking down addiction cycles.
I feel like I'm more in control of my mind.
There feels like a sovereignty of intelligence
that's happening when I'm disconnected from the internet. I think the more I use the internet and social media, the more other people are controlling
my mind.
That's definitely a feeling.
And then in the future, that will be not other people but algorithms or other people presented
to me via algorithms.
There are already tons of AI bots on the internet.
Right now it's not frequent, but every so often I have replied to one
and they're instantly replied, I'm like crap, I'm a bot.
And that is just gonna become more common.
Like they're gonna get good.
One of the hilarious things about technology
over its history is that the illicit
adult entertainment industry
has always adopted technologies first, right?
Whether it was like video streaming
to like where, you know where there's now the sort of independent,
adult, illicit content creators
who have their subscription pages,
and there they actually heavily utilize,
generative AI has already been,
diffusion models and all that is huge there,
but now these subscription-based individual creators
do use bots to approximate themselves
and chat with their whales. People pay a lot for it. And people pay a lot, right? subscription-based individual creators do use bots to approximate themselves and
chat with their you know, people pay a lot for it and people
Right. It's a lot of times it's them but a lot of there are agencies that do this for these creators and
Do it like on a like mass scale
so the largest creators are like able to talk to hundreds or thousands of like
People at a time because of these bots. And so it's already being used there.
Obviously, you know, like video streaming
and other technologies have gone there first,
it's gonna come to the rest of society too.
There's a general concern that models get censored
by the companies that deploy them.
So one case where we've seen that,
and maybe censorship is one word,
alignment maybe via RLHF or some other way is another word.
So we saw that with black Nazi image generation
with Gemini.
As you mentioned, we also see that with Chinese models
refusing to answer what happened in June 4th, 1989
at Tiananmen Square.
So how can this be avoided?
And maybe can you just in general talk about
how this happens and how can it be avoided?
You give multiple examples.
There's probably a few things to keep in mind here.
One is the kind of Tiananmen Square factual knowledge, like factual knowledge, like how does that get embedded into
the models? Two is the Gemini, what you called the Black Nazi incident, which is when Gemini as a
system had this extra thing put into it that dramatically changed the behavior. And then three
is what most people would call general alignment, RLHF post-training.
Each of these have very different scopes
in how they are applied.
In order to do, if you're just gonna look at the model weights,
in order to audit specific facts is extremely hard
because you have to Chrome through the pre-training data
and look at all of this and then that's terabytes of files
and look for very specific words or hints of the words.
So I guess one way to say it is that you can insert
censorship or alignment at various stages in the pipeline
and what you refer to now is at the very beginning
of the data selection stage.
So if you want to get rid of facts in a model,
you have to do it at every stage.
You have to do it at the pre-training.
So most people think that pre-training is where
most of the knowledge is put into the model
and then you can elicit and move that in different ways,
whether through post-training
or whether through systems afterwards.
This is where the whole like hacking models comes from,
right, like GPT will not tell you how to make anthrax,
but if you try really, really hard,
you can eventually get it to tell you about anthrax
because they didn't filter it
from the pre-training data set. Right? But by the way, removing facts has such an ominous dark feel to it.
Almost think it's practically impossible because you effectively have to remove them from the
internet. You're taking on a... Did they remove the m thing from the subreddits, the m, m, m, m?
It gets filtered out.
Right, so that's.
So you have quality filters,
which are small language models that look at a document
and tell you like, how good is this text?
Is it close to a Wikipedia article,
which is a good thing that we want language models
to be able to imitate.
So couldn't you do a small language model
that filters out mentions of Tiananmen Square in the data?
Yes, but is it gonna catch wordplay or encoded language?
I mean, people have been memeing on games and other stuff
how to say things that don't say Tiananmen Square.
Or like, yeah, so there's always different ways to do it.
Hey, the internet as a whole does tend
to just have a slight left bias, right?
Because it's always been richer, more affluent,
younger people on the internet
relative to the rest of the population.
So there is already inherently a slight left bias
on the internet.
And so how do you filter things that are this complicated?
And some of these can be factual, non-factual,
but Tiananmen Square is obviously the example of a factual,
but it gets a lot harder when you're talking about
aligning to a ideal.
And so Grok, for example, right? Elon's tried really hard to make the model not be
super PC and woke, but the best way to do pre-training is to throw the whole freaking
internet at it, right? And then later figure out, but then at the end of the day, the model at its
core now still has some of these ideals, right? You still ingested Reddit slash R slash politics,
which is probably the largest political discussion board on the world that's freely available to at its core now still has some of these ideals, right? You still ingested Reddit slash R slash politics,
which is probably the largest political discussion board
on the world that's freely available to scrape.
And guess what?
That's left-leaning, right?
And so, you know, there are some aspects like that,
that you just can't censor unless you try really, really,
really, really, really hard.
So the base model will always have some TDS,
Trump Derangement Syndrome, because it's trained so much.
It'll have the ability to express it.
But what if you...
There's a wide representation in the data.
This is what happens, it's like a lot of what is called
post training, it's a series of techniques to get the model
on rails of a really specific behavior.
And I mean, it's like you also have the ingested data
of like Twitter or like Reddit slash
R slash The Donald, which is like also super pro Trump, right?
And then you have like fascist subreddits or like you have communist subreddits.
The model in pre-training ingests everything.
It has no worldview.
Now, it does have like some skew because more of the text is skewed a certain way, which
is general, like slight left,
but also somewhat intellectual, somewhat like,
it's just like the general internet is a certain way.
And then as Nathan's about to describe eloquently,
like you can elicit certain things out.
And there's a lot of history here,
so we can go through multiple examples and what happened.
Llama 2 was a launch that the phrase,
like too much RLHF or like too much safety was a lot. It's just that was the
whole narrative after Lama 2's chat models released. And the
examples are sorts of things like you would ask Lama 2 chat,
how do you kill a Python process? And it would say I
can't talk about killing because that's a bad thing. And anyone
that is trying to design an AI model will probably agree that that's just like,
eh, model, you messed up a bit on the training there.
I don't think they meant to do this,
but this was in the model week.
So this is not, it didn't necessarily be,
there's things called system prompts,
which are when you're querying a model,
it's a piece of text that is shown to the model,
but not to the user.
So a fun example is your system prompt could be talk like a pirate. So no matter what the user says to the model but not to the user. So a fun example is your system prompt
could be talk like a pirate.
So no matter what the user says to the model,
it'll respond like a pirate.
In practice, what they are is you are a helpful assistant.
You should break down problems.
If you don't know about something,
don't tell them your date cutoff is this,
today's date is this.
It's a lot of really useful context
for how can you answer a question well.
And Anthropic publishes their system prompt. Which I think is great. And there's a lot of really useful context for how can you answer a question well. And anthropic publishes their system prompt.
Which I think is great.
And there's a lot of research that goes into this.
And one of your previous guests, Amanda Askell,
is probably the most knowledgeable person,
at least in the combination of execution and sharing.
She's the person that should talk about system prompts
and character of models.
Yeah, and then people should read these system prompts
because you're trying to nudge
sometimes through extreme politeness, the model to be a certain way.
And you could use this for bad things.
We've done tests, which is what if I tell the model to be a dumb model?
Which evaluation scores go down?
And it's like, we'll have this behavior where it could sometimes say, oh, I'm supposed to
be dumb. And sometimes it's like, it doesn't affect like math abilities as much, but something
like a, if you're trying, it's just the quality of a human judgment would drop to the floor.
Let's go back to post training, specifically RLHF around Lama 2 was it was too much RLHF,
too much safety prioritization was baked into the model weights. This makes you refuse things
in a really annoying way for users. It's not great. It caused a lot of like awareness to be attached to RLHF that
it makes the models dumb and stigmatize the word it did in AI culture. And as the techniques
have evolved, that's no longer the case where all these labs have very fine grained control
over what they get out of the models through techniques like RLHF. Although different labs are definitely different
levels. Like on the on one end of the spectrum is Google and then like maybe
OpenAI does less and Anthropic does less and then like on the other end of the
spectrum is like XAI but they all have different forms of RLHF trying to make
them a certain way. And the important thing to say is that no matter how you want
the model to behave, these RLHF and preference tuning
techniques also improve performance.
So on things like math evals and code evals,
there is something innate to these what is called
contrastive loss functions.
We could start to get into RL here.
We don't really need to.
But RLHF also boosts performance on anything
from a chat task to a math problem
to a code problem.
So it is becoming a much more useful tool to these labs.
So this kind of takes us through the arc of we've talked about pre-training, hard to get
rid of things.
We've talked about post-training and how post-training, you can mess it up.
It's a complex multifaceted optimization with 10 to 100 person teams converging on one artifact.
It's really easy to not do it perfectly.
And then there's the third case, which is what we talked about Gemini.
The thing that was about Gemini is this was a served product where
Gemini, Google has their internal model weights.
They've done all these processes that we talked about.
And in the served product, what came out after this was that they had a prompt
that they were rewriting user queries to boost diversity or something.
And this just made it,
the outputs were just blatantly wrong.
It was some sort of organizational failure
that had this prompt in that position.
And I think Google executives probably have owned this.
I don't pay that attention, that detail.
But it was just a mess up in execution
that led to this ridiculous thing, but at the system level.
The model weights might have been fine.
So at the very end of the pipeline, there was a rewriting.
To something like a system prompt.
It was like the system prompt,
or what is called in industry is like you rewrite prompts.
So especially for image models,
if you're using Dali or Taggbt,
you can generate you an image,
you'll say, draw me a beautiful car.
With these leading image models,
they benefit from highly descriptive prompts.
So what would happen is if you do that on chat GPT,
a language model behind the scenes will rewrite the prompt,
say, make this more descriptive,
and then that is passed to the image model.
So prompt rewriting is something that is used at multiple levels of industry,
and it's used effectively for image models,
and the Gemini example is just a failed execution.
Big philosophical question here with RLHF to generalize.
Where is human input, human in the loop, human data, most useful at the current stage?
For the past few years, the highest cost human data has been in these preferences,
which is comparing, I would say, highest cost and highest total usage. So a lot of money has
gone to these pairwise comparisons where you have two model outputs and a human is comparing
between the two of them. In earlier years, there was a lot of this instruction tuning data. So
creating highly specific examples to something
like a Reddit question to a domain that you care about.
Language models used to struggle on math and code.
So you would pay experts in math and code
to come up with questions and write detailed answers
that were used to train the models.
Now it is the case that there are many model options
that are way better than humans at writing detailed
and eloquent answers
for things like model and code.
So they talked about this with the Llama 3 release
where they switched to using Llama 3.4 or 5.B
to write their answers for math and code.
But they, in their paper, talk about how they use
extensive human preference data,
which is something that they haven't gotten AIs to replace.
There are other techniques in industry
like constitutional AI, where you use human data for preferences and AI for preferences,
and I expect the AI part to scale faster than the human part. But among the research that we have
access to is that humans are in this kind of preference loop. So for as reasoning becomes
bigger and bigger and bigger, as we said, where's the role of humans in that? It's even less prevalent. So it's
the remarkable thing about these reasoning results and especially the deep seek r1 paper is this result that they call deep seek r1 zero
Which is they took one of these pre trained models
They took deep seek v3 base and then they do this reinforcement learning optimization on
verifiable questions or verifiable rewards for a lot of questions and a lot of training.
And these reasoning behaviors emerge naturally.
So these things like wait, let me see, wait, let me check this.
Oh, that might be a mistake.
And they emerge from only having questions and answers.
And when you're using the model, the part that you look at is the completion.
So in this case, all of that just
emerges from this large scale RL training. And that model, which the weights are available, has no
human preferences added into the post training. There are the deep seek R1 full model has some
of this human preference tuning this RLHF after the reasoning stage. But the very remarkable thing
is that you can get these reasoning behaviors and it's very unlikely
that there's humans writing out reasoning chains.
It's very unlikely that they somehow hacked open AI
and they got access to open AI 01's reasoning chains.
It's something about the pre-trained language models
and this RL training where you reward the model
for getting the question right.
And therefore it's trying multiple solutions
and it emerges this chain of thought.
This might be a good place to mention the eloquent
and the insightful tweet of the great
and the powerful Andrei Karpathy.
I think he had a bunch of thoughts,
but one of them, last thought, not sure if this is obvious.
You know something profound is coming when you're saying
it's not sure if it's obvious.
There are two major types of learning,
in both children and in deep learning.
There's one, imitation learning, watch and repeat,
i.e. pre-training, supervised, fine-tuning,
and two, trial and error learning, reinforcement learning.
My favorite simple example is AlphaGo.
One is learning by imitating expert players.
Two is reinforcement learning to win the game.
Almost every single shocking result of deep learning and the source of all magic is always
two.
Two is significantly more powerful.
Two is what surprises you.
Two is when the paddle learns to hit the ball behind the blocks and break out.
Two is when AlphaGo beats even Lee Sedol.
And two is the aha moment when the deep seek
or O-one, et cetera, discovers that it works well
to reevaluate your assumptions,
backtrack, try something else, et cetera.
It's the solving strategies you see this model use
in its chain of thought.
It's how it goes back and forth thinking to itself.
These thoughts are emergent, three exclamation points,
and this is actually seriously incredible,
impressive and new,
and is publicly available and documented.
The model could never learn this with imitation
because the cognition of the model
and the cognition of the human
labeler is different.
The human would never know to correctly annotate these kinds of solving strategies and what
they should even look like.
They have to be discovered during reinforcement learning as empirically and statistically
useful towards the final outcome.
Anyway, the AlphaZero sort of metaphor analogy here.
Can you speak to that, the magic of the chain of thought
that he's referring to?
I think it's good to recap AlphaGo and AlphaZero
because it plays nicely with these analogies
between imitation learning and learning from scratch.
So AlphaGo, the beginning of the process
was learning from humans where they had,
they started the first,
this is the first expert level Go player
or chess player in DeepMind
series of models where they had some human data.
And then the why it is called AlphaZero is that there was zero human data in the loop.
And that changed to AlphaZero made a model that was dramatically more powerful for DeepMind.
So this remove of the human prior, the human inductive bias makes the final system far
more powerful.
This we mentioned bitter lesson
hours ago and this is all aligned with this. And then there's been a lot of discussion
in language models. This is not new. This goes back to the whole Q star rumors,
which if you piece together the pieces is probably the start of openAI figuring out its O1 stuff when last year in November, the Q star rumors came out.
There's a lot of intellectual drive to know
when is something like this going to happen
with language models?
Because we know these models are so powerful
and we know it has been so successful in the past.
And it is a reasonable analogy that this new type
of reinforcement learning training
for reasoning models is when the door is open to this.
We don't yet have the equivalent of turn 37, which is the famous turn where the
deep minds AI playing ghost dumped Lisa doll completely.
We don't have something that's that level of focal point, but that doesn't mean
that the approach to technology is different and the impact of the general
training, it's still incredibly new.
Well, what do you think that point would be?
What would be move 37 for chain of thought for reasoning?
Scientific discovery.
Like when you use this sort of reasoning problem in it, just
something we fully don't expect.
I think it's actually probably simpler than that.
It's probably something related to computer user robotics, uh, rather than
science discovery.
The important aspect here is models take so much data to learn, they're not sample efficient.
They take the entire web, over 10 trillion tokens to train on. This would take a human
thousands of years to read. A human does not, and humans know most of the stuff, a lot of the stuff models know better than it, right?
Humans are way, way, way more sample efficient.
That is because of the self-play, right?
How does a baby learn what its body is?
As it sticks its foot in its mouth and it says,
oh, this is my body, right?
It sticks its hand in its mouth and it calibrates its touch
on its fingers with the most sensitive
touch thing on its tongue.
It's how babies learn.
And it's just self-play over and over and over and over again.
And now we have something that is similar to that with these verifiable proofs, whether
it's a unit test in code or a mathematical verifiable task, generate many traces of reasoning, right?
And keep branching them out, keep branching them out.
And then check at the end,
hey, which one actually has the right answer?
Most of them are wrong, great.
These are the few that are right.
Maybe we use some sort of reward model outside of this
to select even the best one to preference as well.
But now you've started to get better and better
at these benchmarks.
And so you've seen over the last six months
a skyrocketing in a lot of different benchmarks, right? All math and code benchmarks were pretty much solved except for frontier
math, which is designed to be almost questions that aren't practical to most people because
they're like they're exam level open math problem type things. So it's like on the math problems
that are somewhat reasonable, which is like somewhat complicated word problems or coding problems.
That's just what Dylan is saying.
So the thing here is that
these are only with verifiable tasks.
We earlier showed an example of the really interesting,
like what happens when chain of thought
is to a non-verifiable thing.
It's just like a human chatting, right?
With thinking about what's novel for humans, right?
A unique thought.
But this task and form of training
only works when it's verifiable.
And from here, the thought is, okay, we can continue to scale this current training method
by increasing the number of verifiable tasks.
In math and coding, coding probably has a lot more to go.
Math has a lot less to go in terms of what are verifiable things.
Can I create a solver that then I generate trajectories
toward or traces towards, reasoning traces towards,
and then prune the ones that don't work
and keep the ones that do work?
Well, those are gonna be solved pretty quickly,
but even if you've solved math,
you have not actually created intelligence, right?
And so this is where I think the like,
aha moment of computer use or robotics will come in
because now you have a sandbox or a
playground that is infinitely verifiable. Messing around on the
internet, there are so many actions that you can do that are verifiable. It'll
start off with like log into a website, create an account, click a button here,
blah blah blah, but it'll then get to the point where it's, hey go do a task on
Tasker or whatever these other all these various task websites, hey go get
hundreds of likes right and it's gonna websites. Hey, go get hundreds of likes, right?
And it's gonna fail, it's gonna spawn hundreds of accounts.
It's gonna fail on most of them, but this one got to a thousand.
Great, now you've reached the verifiable thing.
And you just keep iterating this loop over and over.
And that's when, and same with robotics, right?
That's where you have an infinite playground of tasks,
like, hey, did I put the ball in the bucket?
All the way to like, oh, did I like build a car?
Right, like, you know, there's a whole trajectory to speed run or, you know, what models can do.
But at some point, I truly think that like, you know, we'll spawn models and initially all the
training will be in sandboxes. But then at some point, you know, the language model pre-training
is going to be dwarfed by what is this reinforcement learning? You know, you'll pre-train a multimodal
model that can see, that can read, that can write, you know, blah, blah, blah, whatever vision, audio,
et cetera. But then you'll have it play in a sandbox infinitely and figure out figure
out math, figure out code, figure out navigating the web, figure out operating a robot arm,
right? And then it'll learn so much. And the aha moment I think will be when this is available
to then create something that's not good, right?
Like, oh cool, part of it was like figuring out how to use the web. Now all of a sudden,
it's figured out really well how to just get hundreds of thousands of followers that are real and real engagement on Twitter because all of a
sudden this is one of the things that are verifiable.
And maybe not just engagement, but make money.
Yes.
I've become an... I mean that could be the thing where almost fully automated.
It makes, you know, $10 million by being an influencer, selling a product, creating the product like.
And I'm not referring to like a hype product, but an actual product.
Like, holy shit, this thing created a business.
It's running it.
It's the face of the business, that kind of thing.
Maybe or maybe a number one song, like it creates the face of the business, that kind of thing. Or maybe a number one song,
like it creates the whole infrastructure required
to create the song to be the influence
of the representative of that song and that kind of thing.
It makes a lot of them.
That could be the, I mean, our culture
respects money in that kind of way.
And it's verifiable, right?
It's verifiable, right.
The bank account can't lie.
Exactly.
There is surprising evidence that once you set up the ways of collecting's verifiable. Right. The bank account can't lie. Exactly.
There is surprising evidence that once you set up the ways of collecting the verifiable
domain that this can work.
There's been a lot of research before this R1 on math problems and they approach math
with language models just by increasing the number of samples.
So you can just try again and again and again.
And you look at the amount of times that the language models get it right.
And what we see is that even very bad models get it right sometimes.
And the whole idea behind reinforcement learning is that you can learn from very sparse rewards.
So it doesn't, the space of language and the space of tokens, whether you're generating
language or tasks for a robot is so big that you might say that it's like, I mean, each the tokenizer
for a language model can be like 200,000 things. So at each step, it can sample from that big of
a space. So if it can generate a bit of a signal that it can climb onto, that's the what the whole
field of RL is around is learning from sparse words. And the same thing has played out in math,
where it's like very weak models that sometimes generate answers. We see research already that you can boost their math scores.
You can do this sort of RL training for math.
It might not be as effective, but if you take a 1 billion parameter model, so something
600 times smaller than DeepSeq, you can boost its grade school math scores very directly
with a small amount of this training.
So it's not to say that this is coming soon.
Setting up the verification domains is extremely hard
and there's a lot of nuance in this,
but there are some basic things that we have seen before
where it's at least expectable that there's a domain
and there's a chance that this works.
All right, so we have fun things happening in real time.
This is a good opportunity to talk about
other reasoning models, O1, O3.
Just now OpenAI, as perhaps expected, released O3 mini.
What are we expecting from the different flavors?
Can you just lay out the different flavors of the O models and from Gemini, the reasoning
model?
Something I would say about these reasoning models is we talked a lot about reasoning
training on math and code. from Gemini, the reasoning model? Something I would say about these reasoning models is we talked a lot about reasoning training
on math and code.
And what is done is that you have the base model
we've talked about a lot on the internet.
You do this large scale reasoning training
with reinforcement learning.
And then what the DeepSeq paper detailed
in this R1 paper, which for me
as one of the big open questions on how do you do this
is that they did reasoning heavy but
very standard post-training techniques after the large-scale reasoning RL.
So they did the same things with a form of instruction tuning through rejection sampling,
which is essentially heavily filtered instruction tuning with some reward models.
And then they did this RLHF, but they made it math heavy.
So some of this transfer, we looked at this philosophical example early on.
One of the big open questions is how much does this transfer?
If we bring in domains after the reasoning training,
are all the models gonna become eloquent writers
by reasoning?
Is this philosophy stuff going to be open?
We don't know in the research of how much this will transfer.
There's other things about how we can make soft verifiers
and things like this, but
there is more training after reasoning, which makes it easier to use these reasoning models
and that's what we're using right now.
So we're going to talk about with 3Mini and O1.
These have gone through these extra techniques that are designed for human preferences after
being trained to elicit reasoning.
I think one of the things that people are ignoring is Google's Gemini flash thinking
is both cheaper than R1 and better.
And they released it in the beginning of December.
And nobody's talking about it.
No one cares.
It has a different flavor to it.
Its behavior is less expressive than something like O1.
It has fewer tracks than it is on.
Quen released a model last fall, QWQ, which was their preview reasoning model.
And DeepSeq had R1 Lite last fall,
where these models kind of felt like they're on rails,
where they really, really only can do math and code.
And O1 is, it can answer anything.
It might not be perfect for some tasks,
but it's flexible, it has some richness to it.
And this is kind of the art of like how cook,
like how is a model a little bit undercooked?
It's like it's good to get a model out the door. But it's hard to gauge and it takes a lot of taste
to be like, is this a full fledged model? Can I use this for everything? And they're probably
more similar for math and code. My quick read is that Gemini Flash is like, not trained the same
way as O1, but taking an existing training stack,
adding reasoning to it.
So taking a more normal training stack
and adding reasoning to it.
And I'm sure they're gonna have more.
I mean, they've done quick releases on Gemini Flash,
the reasoning, and this is the second version
from the holidays.
It's evolving fast and it takes longer
to make this training stack where you're doing
this large scale RL.
I just got the same question from earlier.
The one about the-
The human nature.
Yeah.
What was the human nature?
The way I can ramble about this so much is that we've been working on this at AI2 before
O1 was fully available to everyone and before R1, which is essentially using this RL training for fine tuning.
We use this in our like TULU series of models
and you can elicit the same behaviors
where you say like wait and so on,
but it's suddenly in the training process
that this kind of reasoning expression is much lighter.
So you can, there's essentially a gradation
and just how much of this RL training you put into it
determines how the output looks.
So we're now using Gemini 2.0 flash thinking experimental
121.
It summarized the problem as humans,
self domesticated apes.
The perspective, okay.
All right, so wait, is this reviewing the reasoning?
Here's why this is a novel.
Okay.
Click to expand.
Okay.
Analyze the request.
Novel is the keyword.
See how it just looks a little different?
It looks like a normal output.
Yeah, it's, I mean, in some sense, it's better structured.
It makes more sense.
Oh, and it latched onto human, and then it went into organisms, and oh wow.
Apex predator, focus on domestication,
apply domestication to humans,
explore the idea of self-domestication.
Not good, not good.
Where is this going?
Refine and articulate the insight,
greater facial expressiveness and communication ability, yes.
Plasticity and adaptability, yes.
Dependence on social groups, yes.
All right.
And self-critique and refine further.
Wow.
Is this truly novel?
Is it well supported?
So on and so forth.
And the insight is getting at is humans
are not just social animals,
but profoundly self-domesticated apes.
And this self-domestication is the key to understanding
our unique cognitive and social abilities.
Self-domesticated apes.
I prefer the deep-seek response.
I mean, it's novel.
The insight is novel.
I mean, that's like a good book title,
Self-Domesticated Apes.
Like, there could be a case made for that.
I mean, yeah, it's cool.
And it's revealing the reasoning.
It's magical.
It's magical.
Like, this is really powerful.
Hello, everyone.
This is Lex with a quick intermission
recorded after the podcast.
Since we reviewed responses from DeepSeek R1
and Gemini Flash 2.0 thinking during this conversation,
I thought at this moment,
it would be nice to insert myself quickly doing the same
for OpenAI 01 Pro and 03 Mini with the same prompt. The prompt
being, give one truly novel insight about humans. And I thought I would in general
give my vibe check and vibe based anecdotal report on my own experiences
with the new 03 Mini model, now that I got a chance to spend many hours with it
in different kinds of contexts and applications.
So I would probably categorize this question
as a, let's say, open-ended philosophical question,
and in particular, the emphasis on novelty,
I think is a nice way to test one of the capabilities
of the model, which is come up with something that makes you pause and almost surprise you with its brilliance. So that said, my
general review after running each of the models on this question a bunch of times
is that O1 Pro consistently gave brilliant answers. Ones that gave me pause and made me think.
Both cutting in its insight and just really nicely phrased
with wit, with clarity, with nuance,
over and over consistently generating the best answers.
After that is R1, which is less consistent,
but again, deliver brilliance.
Gemini Flash 2.0 thinking was third and
last was 03 mini actually. It often gave quite a generic answer at least to my
particular sensibilities. That said in a bunch of other applications that I
tested for brainstorming purposes it actually worked extremely well and often outperformed R1.
But on this open-ended philosophical question, it did consistently worse.
Now, another important element for each of these models is how the reasoning is presented.
DeepSeek R1 shows the full chain of thought tokens, which I personally just love. For these open-ended
philosophical questions,
it's really, really interesting to see the model
think through it.
But really also just stepping back,
me as a person who appreciates intelligence
and reasoning and reflection,
reading these kind of chain of thought raw tokens of R1,
there's something genuinely beautiful
about observing the path of deliberation in an intelligence
system.
I think we don't always have that explicitly laid out for us humans.
So to see it in another intelligence system, the nonlinearity of it akin to Ulysses Caesar
Finnegans Wake by James Joyce, it's just beautiful to watch.
Anyway, as we discussed in the episode, DeepSeek R1 talked about humans being able to convert
selfish desires into cooperative systems by collectively pretending abstract rules like
money laws and rights are real.
And these shared hallucinations act as games where competition is secretly redirected to
benefit the group, turning conflict into society's fuel.
Gemini 2.0 Flash Thinking said, humans are not just social animals, but self-domesticated apes,
and this self-domestication is the key to understanding our unique cognitive and social
abilities. Now, it's important to say that the chain of thought there was really interesting.
It was looking through the entire evolution of life on earth, considering apex predators, and considering how from that
we ended up to where we are. I think that domestication by choice is a really interesting
angle. Again, it's one of those things when somebody presents a different angle on a seemingly
obvious thing,
it just makes me smile.
And the same with DeepSeek R1, that these hallucinations of money, laws, and rights,
and us collectively pretending like it's real, and we play games with them that look like
competition when secretly we're just cooperating with each other.
And that is the fuel of progress beautifully put now open AI. Oh one pro consistently over and over
Delivered bangers I can go through many of them
But the first one was humans are the only species that turns raw materials into symbolic resources
Then uses those symbols to reorganize the very materials that came from creating a closed feedback loop between
meaning and
matter.
Here, I just ran it again.
Banger after banger, I'm telling you.
Humans are unique among known species in that they simultaneously rewrite two layers of
reality, the external world and their own private mental landscapes, and then merge
these two rewritten layers into a continuous personal narrative that feels objectively true.
Feels true.
This is poetry.
Okay. And then 03miniHi for me was smart, fast, actually.
And kind of generic. Never quite got there for me.
So here's the first one I got from 03mini.
Humans are not fixed beings, but rather ongoing narratives, dynamic stories that we continuously
write, edit, and reinterpret.
This narrative plasticity is more than just memory or self-reflection.
It's an intrinsic cognitive process that acts like an internal error correction system. It allows us to adapt our identities and
values over time in response to new experiences, challenges, and social
contexts. Now it almost sneaks up to something approximating, cutting in sight
with narrative plasticity in quotes. But then it goes back to the sort of the
generic. I don't know, all of these models are incredible
for different reasons.
There's a lot of concerns as we discuss in this episode, but there's a lot of reasons
to be excited as well.
And I've probably spoken for too long.
I am severely sleep deprived, borderline delirious.
So hopefully some of this made sense. And now dear friends,
back to the episode. I think when you, you know, to Nathan's point, when you look at like the
reasoning models, to me, even when I used R1 versus 01, there was like that sort of rough edges around
the corner feeling, right?
And flash thinking, you know, earlier, I didn't use this version, but the one from December
and definitely had that rough edges around the corner feeling, right?
Where it's just not fleshed out in as many ways, right?
Sure they added math and coding capabilities via these verifiers in RL, but you know, it
feels like they lost something in certain areas. And O1 is worse performing than
Chat in many areas as well, to be clear. Not by a lot.
Not by a lot though, right? And it's like R1 definitely felt to me like it was worse than V3
in certain areas, like doing this RL expressed and learned a lot, but then it weakened in other areas.
And so I think that's one of the big differences between these models
and what O1 offers. And then OpenAI has O1 Pro. And what they did with O3, which is also very
unique, is that they stacked search on top of Chain of Thought. And so Chain of Thought is one
thing where it's one chain, it backtracks, goes back and forth. But how they solved the Arc AGI challenge was not just
the chain of thought. It was also sampling many times, i.e. running them in parallel and then
selecting. Is running in parallel actually search? Because I don't know if we have the full information
on how O1 Pro works. So like I'm not, I don't have enough information to confidently say that it is
search. It is parallel samples. Yeah. And then what? And it selects something.
And we don't know what the selection function is.
The reason why we're debating is because since O1 was announced, there's been a lot of interest
in techniques called Monte Carlo Tree Search, which is where you will break down the chain
of thought into intermediate steps.
We haven't defined chain of thought.
Chain of thought is from a paper from years ago where you introduce the idea to ask a
language model that at the time was much
less easy to use. You would say let's verify step by step and it would induce the model to do this
bulleted list of steps. Chain of thought is now almost a default in models where if you ask it a
math question you don't need to tell it to think step by step and the idea with Monte Carlo tree
search is that you would take an intermediate point in that train do some sort of expansion expansion, spend more compute, and then select the right one. That's like a
very complex form of search that has been used in things like Mu0 and AlphaZero potentially.
I know Mu0 does this. Another form of search is just asking five different people and then taking
the majority answer. Yes. Right? So there's a variety of like, you know, it could be complicated,
it could be simple. We don't know what it is, just that they are not just issuing one chain of thought in sequence.
They're launching many in parallel.
And in the Arc AGI, they launched a thousand in parallel for the one that really shocked
everyone that beat the benchmark.
They would launch a thousand in parallel, and then they would get the right answer,
like 80% of the time or 70% of the time, maybe even whereas if they just launched one it was like 30%
there are many extensions to this I would say the simplest one is that our
language models today have been designed to give the right answer the highest
percentage of the time in one response and we are now opening the door to
different ways of running inference on our models in which we need to
reevaluate many parts
of the training process,
which normally opens the door to more progress,
but we don't know if OpenAI changed a lot
or if just sampling more and multiple choices
what they're doing, or if it's something more complex
where they changed the training
and they know that the inference mode
is going to be different.
So we're talking about O1 Pro, $200 a month
and they're losing money
so
The thing that we're referring to this fascinating exploration of the test time compute
space
Is that actually possible? Do we have enough compute for that? Does the financials make sense?
so the fantastic thing is and and there it's in the thing that I pulled up earlier, but the cost
for GPT-3 has plummeted if you scroll up just a few images, I think.
The important thing about, hey, is cost a limiting factor here, right?
My view is that we'll have really awesome intelligence before we have, like AGI before
we have it permeate throughout the economy.
And this is sort of why that reason is, right?
GPT-3 was trained in what, 2020, 2021?
And the cost for running inference on it was $60,
$70 per million tokens, right?
Which is the cost per intelligence was ridiculous.
Now, as we scaled forward two years,
we've had a 1200x reduction in cost
to achieve the same level of intelligence as GPT-3. So here on the x-axis is time over just a couple
of years and on the y-axis is log scale dollars to run inference on a million tokens. And so you have just a down, like a linear decline
on log scale from GPT-3 through 3.5 to Lama.
It's like five cents or something like that now, right?
Which is versus $60, 1200X.
That's not the exact numbers, but it's 1200X.
I remember that number.
Is the humongous cost per intelligence, right?
Now the freak out over DeepSeek is,
oh my God, they made it so cheap.
It's like, actually, if you look at this trend line,
they're not below the trend line, first of all,
and at least for GPT-3, right?
They are the first to hit it, right?
Which is a big deal.
But they're not below the trend line as far as GPT-3.
Now we have GPT-4, what's gonna happen
with these reasoning capabilities, right?
It's a mix of architectural innovations, it's a mix of better data, and it's going to be
better training techniques and all of these different better inference systems, better
hardware, right?
Going from each generation of GPU to new generations or ASICs, everything is going to take this
cost curve down and down and down and down.
And then can I just spawn a thousand different LLMs to create a task and then pick from one of them?
Or whatever search technique I want, a tree, Monte Carlo tree search, maybe it gets that complicated.
Maybe it doesn't because it's too complicated to actually scale, like who knows?
Bitter lesson, right?
The question is, I think, when, not if, because the rate of progress is so fast, right?
Nine months ago Dario was saying hey, you know Dario said nine months ago the cost to train and inference was this right and now we're much better than this right
And deep seek is much better than this and and that cost curve for GPT-4
Which was also roughly $60 per million tokens when it launched has already fallen to you know
$2 or so.
And we're going to get it down to cents, probably, for GPT-4 quality.
That's the base for the reasoning models like O1 that we have today and O1 Pro is spawning
multiple and O3 and so on and so forth.
These search techniques, too expensive today, but they will get cheaper.
And that's what's going to unlock the intelligence, right?
So get cheaper and cheaper and cheaper.
The big DeepSeek R1 release freaked everybody out because of the cheaper.
One of the manifestations of that is Nvidia stock plummeted.
Can you explain what happened?
I mean, and also just explain this moment and whether, you know, if Nvidia is gonna keep winning.
We're both Nvidia bulls here, I would say.
And in some ways, the market response is reasonable.
Most of the market, like,
Nvidia's biggest customers in the US are major tech companies
and they're spending a ton on AI.
And if a simple interpretation of DeepSeek
is you can get really good models without spending as much on AI. So in a simple interpretation of deep seek is you can get really good models
without spending as much on AI. So in that capacity, it's like, oh, maybe these big tech
companies won't need to spend as much on AI and go down. The actual thing that happened
is much more complex where there's social factors, where there's the rising in the app
store, the social contagion that is happening. And then I think a lot of some of it is just
like, I'm not, I don't trade, I don't know anything about financial markets, but it builds up over
the weekend or the social pressure, where it's like, if it was during the week,
and there was multiple days of trading when this was really becoming, but it
comes on the weekend, and then everybody wants to sell. And that is a social
contagion.
I think I think and like, there are a lot of false narratives, which is like,
hey, these guys are spending billions on models, right? And they're not spending
billions on models. No one spent more than a billion dollars on a model that's released
publicly, right? GPT-4 was a couple hundred million and then they've reduced the cost with
4TURBO 4.0, right? But billion-dollar model runs are coming, right? And this concludes pre-training
and post-training, right? And then the other number is like, hey, DeepSeek didn't include
everything, right? They didn't include, a lot of the cost goes to research
and all this sort of stuff. A lot of the cost goes to inference. A lot of the cost goes to post-training.
None of these things were factored. It's research salaries. All these things are counted in the
billions of dollars that OpenAI is spending, but they weren't counted in the, hey,
six million, five million dollars that Deep Seek spent. So there's a bit of misunderstanding of
what these numbers are. And then there's also an element of,
Nvidia has just been a straight line up, right?
And there's been so many different narratives
that have been trying to push down Nvidia.
I don't say push down Nvidia stock.
Everyone is looking for a reason to sell
or to be worried, right?
You know, it was Blackwell delays, right?
Their GPU, you know, there's a lot of report.
Every two weeks, there's a new report
about their GPUs being delayed.
There's the whole thing about scaling laws ending, right?
It's so ironic, right?
It lasted a month.
It was just, like, literally just,
hey, models aren't getting better, right?
They're just not getting better.
There's no reason to spend more.
Pre-training scaling is dead.
And then it's like, oh, one, oh, three, right?
R1. R1, right? And now it's like, oh one, oh three, right? R1.
R1, right?
And now it's like, wait, models are getting too,
they're progressing too fast.
Slow down the progress, stop spending on GPUs, right?
But you know, the funniest thing I think
that like comes out of this is,
Javan's paradox is true, right?
AWS pricing for H100s has gone up
over the last couple weeks, right?
Since, since, since, since a little bit after Christmas, since V3 was launched, AWS H100 pricing has gone up over the last couple weeks. Right, since a little bit after Christmas,
since V3 was launched, AWS H100 pricing has gone up.
H200s are like almost out of stock everywhere
because H200 has more memory,
and therefore R1 wants that chip over H100, right?
We were trying to get GPUs on a short notice this week
for a demo and it wasn't that easy.
We were trying to get just like 16 or 32 H100s for demo
and it was not very easy. We were trying to get just like 16 or 32 H100s for demo and it was not very easy.
So for people who don't know,
Javan's paradox is when the efficiency goes up,
somehow magically, counter-intuitively,
the total resource consumption goes up as well.
Right, and semiconductors is,
we're at 50 years of Moore's law,
every two years half the cost, double the transistors.
Just like clockwork, and it's slowed down, obviously,
but the semiconductor industry
has gone up the whole time, right?
It's been wavy, right?
There's obviously cycles and stuff,
and I don't expect AI to be any different, right?
There's gonna be ebbs and flows,
but this is, in AI, it's just playing out
at an insane time scale, right?
It was two X every two years.
This is 1,200 X in like three years, right?
So it's like the scale of improvement
that is like hard to wrap your head around.
Yeah, I was confused because to me,
NVIDIA's thought on that should have gone up,
but maybe it went down because there's kind of suspicion
of foul play on the side of China or something like this.
But if you just look purely at the actual principles
at play here, like it's obvious, yeah, that Javon's paradox the actual principles that play here, like, it's obvious.
Yeah, the Gervon's Paradox.
More progress that AI makes, or the higher the derivative of AI progress is, especially,
you should, because NVIDIA is in the best place. The higher the derivative is, the sooner the
market's going to be bigger and expanding, and NVIDIA is the only one that does everything
reliably right now.
Because it's not like an Nvidia competitor arose.
It's another company that's using Nvidia.
Who historically has been a large Nvidia customer.
Yeah.
And has press releases about them
cheering about being China's biggest Nvidia customer.
Yeah, I mean.
Obviously they've quieted down,
but I think that's another element of it
is that they don't want to say how many GPUs they have. Because hey hey, yes they have H800s, yes they have H20s, they also have some H100s, which were smuggled in.
Can you speak to that, to the smuggling? What's the scale of smuggling that's feasible for a nation-state to do for companies? Is it possible to...?
I think there's a few angles of smuggling here. One is, ByteDance arguably is the largest smuggler of GPUs for China.
China's not supposed to have GPUs. ByteDance has over 500,000 GPUs. Why? Because they're all rented from companies around the world.
They rent from Oracle, they rent from Google, they rent from all these mass, and a bunch of smaller cloud companies too, all the All the Neo clouds, right? Of the world. They rent so, so many GPUs. They also buy a bunch, right?
And they do this for mostly like what Meta does, right?
Serving TikTok, right?
Serving, next best.
Same discussion.
Same as Meta, right?
To be clear, that's today the view, use, right?
And it's a valid use, right?
Hack the dopamine certificate, right?
Now, that's theoretically now very much restricted
with the AI diffusion rules,
which happened in the last week of the Biden admin and
Trump admin looks like they're gonna keep them which limits like allies even like Singapore
which Singapore is like 20% of Nvidia's 20 20 30 percent of Nvidia's revenue, but uh
Singapore's had a memoratorium on not building data centers for like 15 years because they don't have enough power. So where are they going?
They're all going to China right but a are, you know, many are going to Malaysia, including Microsoft and Oracle have big data centers in Malaysia. Like, you know,
they're going all over Southeast Asia, probably India as well, right? Like there's stuff routing,
but like the diffusion rules are very de facto. Like you can only buy this many GPUs from this
country and it's, and you can only rent a cluster this large to companies that are Chinese.
They're very explicit on trying to stop smuggling.
And a big chunk of it was, hey, let's random company by 16 servers, ship them to China.
I saw a photo from someone in the semiconductor industry who leads a team for networking chips that competes with
Nvidia and he sent a photo of a guy checking into a first class United flight from San
Francisco to Shanghai or Shenzhen with a super micro box that was this big, which can only
contain GPUs.
He was booking first class because think about it, 3 to 5K for your first class, because think about it. $3,000 to $5,000 for your first class ticket. Server costs $240,000 in the US, $250,000.
You sell it for $300,000 in China.
Wait, you just got a free first class ticket
and a lot more money.
So it's like, you know, and that's like small scale smuggling.
Most of the large scale smuggling is like companies
in Singapore and Malaysia, like routing them around
or renting GPUs completely legally.
I want to jump in. How much is the scale? I think there's been some number, like routing them around or renting GPUs completely legally. I want to jump in.
How much is the scale?
I think there's been some people that have higher level economics understanding say that
as you go from one billion of smuggling to 10 billion, it's like you're hiding certain
levels of economic activity.
And that's the most reasonable thing to me is that there's going to be some level where
it's so obvious that it's easier to find this economic activity. Yeah, so my belief is that last year roughly, so NVIDIA made a million H20s which are legally
allowed to be shipped to China, which we talked about is better for reasoning, inference at
least, maybe not training, but reasoning inference, and inference generally.
Then they also had a couple hundred thousand, we think like 200 to 300,000 GPUs
were routed to China from Singapore, Malaysia, US, wherever. Companies spawn up by 16 GPUs,
64 GPUs, whatever it is routed. And Huawei is known for having spent up a massive network of
companies to get the materials they need after they were banned in 2018. So it's not otherworldly.
But I agree, Nathan's point is like, hey,
you can't smuggle up $10 billion of GPUs. And then the third sort of source, which is
just now banned, which wasn't considered smuggling, but is China is renting, like, I believe from
our research, right? Oracle's biggest GPU customer is ByteDance, right? And for Google,
I think it's their second biggest customer. And you go down the list of
clouds, and especially these smaller cloud companies that aren't like the hyperscalers,
think beyond Core, even Lambda, even, there's a whole C, there's 60 different new cloud companies
serving NVIDIA GPUs. I think ByteDance is renting a lot of these all over. And so these companies
are renting GPUs to Chinese companies. and that was completely legal up until the
diffusion rules which happened just a few weeks ago and even now you can rent GPU clusters
that are less than 2000 GPUs or you can buy GPUs and ship them wherever you want if they're
less than 1500 GPUs.
There are still some ways to smuggle but yeah, as it's not, you know, as the numbers grow, right? Uh, you know, a hundred something billion dollars of revenue for Nvidia last year,
200 something billion this year, right?
And if next year are, you know, it could, it could nearly double again or more
than double, right?
Based on like what we see with data center footprints, like being built out
all across the U S and the rest of the world, it's going to be really hard for
China to keep up with these rules, right?
Yes, there will always be smuggling.
Um, and deep seek level models of GPT four level models, right? Yes, there will always be smuggling and DeepSeq-level models of
GPT-4-level models, O1-level models capable to train on what China can get, even the next tier
above that. But if we speed run a couple more jumps, right, to billion-dollar models, 10 billion-dollar
models, then it becomes, you know, hey, there is a compute disadvantage for China for training models
and serving them. And the serving part is really critical, right?
DeepSeek cannot serve their model today, right?
It's completely out of inventory.
It's already started falling in the app store,
actually downloads, because you download it,
you try and sign up, they say we're not taking registrations
because they have no capacity, right?
You open it up, you get like less than five tokens per second
if you even get your request approved, right?
Because there's just no capacity
because they just don't have enough
GPUs to serve the model, even though it's incredibly efficient.
It'd be fascinating to watch the smuggling.
Cause I mean, there's drug smuggling, right?
That's a, that's a market as weapons smuggling and GPUs will
surpass that at some point.
Our highest value per kilogram, probably by far.
Um, I have another question for you, Dylan.
Do you track model API access internationally?
How easy is it for Chinese companies
to use hosted model APIs from the US?
Yeah, I mean, that's incredibly easy, right?
Like OpenAI publicly stated DeepSeq uses their API,
and as they say, they have evidence, right?
And this is another element of the training regime
is people at OpenAI have claimed
that it's a distilled model,
i.e. you're taking OpenAI's model,
you're generating a lot of output
and then you're training on the output in their model.
And even if that's the case,
what they did is still amazing by the way,
what DeepSec did efficiency wise.
Distillation is standard practice in industry
whether or not, if you're at a closed lab
where you care about terms of service and IP closely,
you distill from your own models.
If you are a researcher and you're not building any products,
you distill from the OpenAI models.
This is a good opportunity.
Can you explain big picture distillation as a process?
What is distillation?
What's the process of distillation?
We've talked a lot about training language models.
They are trained on text.
And post-training, you're trying to train
on very high quality text that you want the model to match the features of, or if you're using RL, you're letting the
model find its own thing. But for supervised fine tuning, for preference data, you need to have some
completions what the model is trying to learn to imitate. And what you do there is instead of human
data, or instead of the model you're currently training you take completions from a different normally more powerful model
I think there's rumors that these big models that people are waiting for
These GPT fives of the world the cloud three opuses of the world are used internally to do this distillation process
There's also public examples right like meta
Explicitly stated not necessarily distilling but they they used 405B as a reward model for 70B
in their Lama 3.2 and 3.3.
This is all the same topic.
So is this ethical, is this legal?
Like why is that Financial Times article headline,
say, open AI says that there's evidence
that China's DeepSeek used its model to train competitor.
This is a long, at least in the academic side and research side, it's a long history because
you're trying to interpret OpenAI's rule.
OpenAI's terms of service say that you cannot build a competitor with outputs from their
model.
Terms of service are different than a license, which are essentially a contract between organizations.
So if you have a terms of service on OpenAI's account, if I violate it, OpenAI can cancel my account.
This is very different than like a license
that says how you could use a downstream artifact.
So a lot of it hinges on a word that is very unclear
in the AI space, which is what is a competitor.
And so-
And then the ethical aspect of it is like,
why is it unethical for me to train on your model
when you can train on the internet's text?
Yeah.
Right? So there's a bit of a hypocrisy because sort of OpenAI and potentially most of the
companies trained on the Internet's text without permission.
There's also a clear loophole which is that I generate data from OpenAI and then I upload it somewhere
and then somebody else trains on it and the link has been broken.
Like they're not under the same terms of service contract. There's a lot of to-be-discovered
details that don't make a lot of sense. This is why a lot of models today, even if they train on
zero OpenAI data, you ask the model who trained you, it'll say, I am Chad GPT trained by OpenAI,
because there's so much copy paste of like OpenAI outputs
from that on the internet
that you just weren't able to filter it out.
And then there was nothing in the RL
where they implemented like, hey, like,
or post training or SFT, whatever that says,
hey, I'm actually a model by Allen Institute
instead of OpenAI.
We have to do this if we serve a demo,
we do research and we use OpenAI APIs
because it's useful and we want to understand post training
and like our research models, they will say they're written by OpenAI unless
we put in the system prop that we talked about that like I am Tulu, I am a language model
trained by the Allen Institute for AI.
And if you ask more people around industry, especially with post training, it's a very
doable task to make the model say who it is or to suppress the OpenAI thing.
So in some levels, it might be that DeepSea
didn't care that it was saying that it was by OpenAI. Like if you're going to upload model
weights it doesn't really matter because anyone that's serving it in an application and cares a
lot about serving is going to when serving it if they're using it for a specific task they're going
to tailor it to that and it doesn't matter that it's saying it's chatgbt. Oh I guess the one of the
ways to do that is like a system prompt or something like that.
Like if you're serving it to say that you're...
That's what we do. Like if we host the demo, you say, you are Tulu3, a language model trained by the Allen Institute for AI.
We also are benefited from OpenAI data because it's a great research tool.
I mean, do you think there's any, any truth and value to the,
the claim open AI's claim that there's evidence that China's deep seek use this
model to train?
I think everyone has benefited regardless because the data is on the internet.
And therefore it's in your per training now, right?
There are like subreddits where people share the best chat GPT outputs.
And those are, those are in your,
I think that they're trying to ship the narrative.
They're trying to protect themselves and we saw this years ago
when ByteDance was actually banned from some OpenAI APIs
for training on outputs.
There's other AI startups that most people,
if you're in the AI culture, they just told us
they trained on OpenAI outputs and they never got banned.
That's how they bootstrapped their early models.
So it's much easier to get off the ground using this
than to set up human pipelines and build a strong model.
So there's long history here.
And a lot of the communications
are seem like narrative control.
Actually, like over the last couple of days,
we've seen a lot of people distilled DeepSeq's model
into Lama model.
Because the DeepSeq models are kind of complicated
to run inference on because they're mixture of experts and they're 600 plus billion parameters and all this and people
distilled them into the llama models because the llama models are so easy to serve and
everyone's built the pipelines and tooling for inference with the llama models because
it's the open standard.
So we've seen it, we've seen a sort of roundabout, right?
Is it bad?
Is it illegal?
Maybe it's illegal, whatever.
I don't know about that, but like- like it could break contracts I don't think it's
illegal like in any legal like no one's going to jail for this I think like
fundamentally I think it's ethical or I hope it's ethical because like the
moment becomes we ban that kind of thing it's gonna make everybody much worse off
and I also actually this is, but I think you should
be allowed to train on the internet. I know a lot of authors and creators are very sensitive
about it. That's a difficult question. But the moment you're not allowed to train on
the internet.
I agree. I have a schizo take on how you can solve this because it already works. I have
a reasonable take on it. All right, all right. So, you know, Japan has a law which you're allowed to train
on any training data and copyrights don't apply
if you wanna train a model.
A, B, Japan has nine gigawatts of curtailed nuclear power.
C, Japan is allowed under the AI diffusion rule
to import as many GPUs as they'd like.
So all we have to do, we have a market here to make.
We build massive data setters, we rent them to the labs,
and then we train models in a legally permissible way
and there's no ifs, ands, or buts.
And now the models have no potential copyright lawsuit
from New York Times or anything like that.
No, no, it's just completely legal.
No, so-
Genius.
The early copyright lawsuits have fallen
in the favor of AI training. I would say that the
long tail of use is going to go in the side of AI, which is if you scrape trillions of tokens of
data, you're not looking and saying this one New York Times article is so important to me. But if
you're doing a audio generation for music or image generation and you say,
make it in the style of X person,
that's a reasonable case where you could figure out
what is their profit margin on inference.
I don't know if it's gonna be the 50-50
of YouTube creator program or something,
but I would opt into that program as a writer.
Please, like that, it's just,
it's gonna be a rough journey,
but there will be some solutions like that that make sense,
but there's a long tail where it's just on the internet.
I think one of the other aspects of that
Financial Times article implied,
and leads to a more general question.
Do you think there's, how difficult is spying, espionage,
and stealing of actual secret code and data from inside companies.
How much of that is being attempted?
Code and data is hard, but ideas is easy.
Silicon Valley operates on the way that top employees get bought out by other companies
for a pay raise.
And a large reason why these companies do this is to bring ideas with them.
And there are, there's no, I mean, in California, there's rules that like certain like non competes or whatever
are illegal in California. And whether or not there's NDAs and things, that is how a
lot of it process happens. Recently, there was somebody from Gemini who helped make this
1 million context length and everyone is saying the next llama who I mean, he went to the
meta team is gonna have one million context length.
And that's kind of how the world works.
As far as like industrial espionage and things,
that has been greatly successful in the past, right?
The Americans did it to the Brits,
the Chinese have done it to the Americans, right?
And so on and so forth.
It is a fact of life.
And so like to argue industrial espionage can be stopped is probably unlikely.
You can make it difficult, but even then, like there's all these stories about like,
hey, F-35 and F-22 have already been like sort of like given to China in terms of design
plans and stuff.
Code and stuff like between, you know, I say companies, not nation states is probably very
difficult, but ideas are discussed a lot, right?
Whether it be a house party in San Francisco or a company changing employees or the always
the mythical honeypot that always gets talked about, right?
Someone gets honeypotted, right?
Because everyone working on AI is a single dude who's in their 20s and 30s.
Not everyone, but like insane amount of insane percentages.
So there's always like all these like, you know,
and obviously-
So honey pie is like a spy, a female spy approaches you
and like-
Yeah.
Yeah, or male, right?
You know, it's San Francisco, right?
But as a single dude, I will say in his late 20s, right,
is like we are very easily corrupted, right?
Like, you know, like not corrupted myself, but you know, like we are, we are, like you know like not not corrupted myself but you know like we are we are everybody else not me
yeah I'm too oblivious and I am not single so I'm safe from one espionage
access yeah you have to make sure to close all security vulnerabilities so you
Dylan collect a lot of information about each of the mega clusters for each of the major AI companies.
Can you talk about the build outs for each one that stand out?
Yeah, so I think the thing that's like really important about these mega cluster build outs is they're completely unprecedented in scale, right?
US, you know, sort of like data center power consumption has been slowly on the rise and it's gone up to 2-3%
even through the cloud computing revolution.
Data center consumption as a percentage of total US.
And that's been over decades of data centers, et cetera.
It's been climbing, climbing slowly.
But now 2-3%.
Now by the end of this decade, it's like even under like, when I say like 10%, a lot of
people that are traditionally by like 2028, 2030, traditional
data center people, that's nuts.
But then people who are in AI who have really looked at this at the anthropics and open
AIs are like, that's not enough.
I'm like, okay.
This is both through globally distributed or distributed throughout the US as well as centralized clusters.
The distributed throughout the US is exciting
and it's the bulk of it.
Hey, OpenAI or Meta is adding a gigawatt, right?
But most of it is distributed through the US
for inference and all these other things.
So maybe we should lay out what a cluster is.
So does this include AWS?
Maybe it's good to talk about the different kinds
of clusters and what you mean by mega clusters
and what's a GPU and what's a computer and what is,
just kidding, not that far back, but yeah.
So like, what do we mean by the clusters?
Oh man, I thought I was about to do the Apple ad, right?
What's a computer?
So traditionally data centers and data center tasks have been a distributed systems problem
that is capable of being spread very far and widely, right? I.e., I send a request to Google,
it gets routed to a data center somewhat close to me, it does whatever search ranking recommendation,
sends a result back, right?
The nature of the task is changing rapidly in that there's two tasks
that people are really focused on now, right?
It's not database access,
it's not serve me the right page, serve me the right ad.
It's now a inference,
and inference is dramatically different
from traditional distributed systems,
but it looks a lot more simple, similar,
and then there's training, right?
The inference side is still like,
hey, I'm gonna put thousands of GPUs
in blocks all around these data centers.
I'm gonna run models on them.
User submits a request, gets kicked off,
or hey, my service, they submit a request to my service.
They're on Word and they're like, oh yeah, help me copilot.
And it kicks it off, or I'm on my Windows,
copilot, whatever, Apple intelligence,
whatever it is, it gets kicked off to a data center, right?
And that data center does some work and sends it back.
That's inference that is going to be the bulk of compute,
but then, you know, and that's like, you know,
there's thousands of data centers that we're tracking
with like satellites and like all these other things.
And those are the bulk of what's being built,
but the scale of, and so that's like,
what's really reshaping and that's what's getting millions of GPUs.
But the scale of the largest cluster
is also really important, right?
When we look back at history, right?
Like, you know, or through the age of AI, right?
Like it was a really big deal when they did AlexNet
on I think two GPUs or four GPUs, I don't remember.
It was a really big deal.
It's a big deal because you use GPUs.
It's a big deal they use GPUs and they use multiple.
But then over time, its scale has just been compounding.
And so when you skip forward to GPT-3, then GPT-4,
GPT-4, 20,000 A100 GPUs.
Unprecedented run in terms of the size and the cost.
A couple hundred million dollars on a YOLO run for GPT-4.
And it yielded this magical improvement
that was perfectly in line with what was experimented
and just a log scale.
Oh yeah, they have that plot from the paper.
The technical part.
The scaling laws were perfect.
But that's not a crazy number.
20,000 A100s, roughly each GPU is consuming 400 watts.
And then when you add in the whole server, everything,
it's like 15 to 20 megawatts of power.
Maybe you could look up what the power of consumption
of a human person is because the numbers
are gonna get silly, but 15 to 20 megawatts
was standard data center size.
It was just unprecedented that was all GPUs
running one task.
How many watts was a toaster? A toaster is like a similar power consumption to an A100.
H100 comes around, they increase the power from like 400 to 700 watts and that's just
per GPU and then there's all the associated stuff around it.
So once you count all that, it's roughly like 1200 to 1400 watts for everything, networking
CPUs, memory, blah, blah, blah.
So we should also say, so what's required?
You said power, so a lot of power is required,
a lot of heat is generated, so cooling is required,
and because there's a lot of GPUs that have to be,
or CPUs or whatever, they have to be connected,
so there's a lot of networking.
Yeah, so I think, yeah, sorry for skipping past that.
And then the data center itself is like complicated, right?
But these are still standardized data centers for GPD-4 scale.
Now we step forward to sort of what is the scale of clusters that people built last year, right?
And it ranges widely, right?
It ranges from like, hey, these are standard data centers and we're just using multiple of them and connecting them together really with a ton of fiber between
them a lot of networking etc. That's what open AI and Microsoft did in Arizona, right?
And so they have a you know hundred thousand GPUs, right? Meta, similar thing
They took their standard existing data center design and it looks like an H and they connected multiple of them together
And you know, they got to they first did 16,000 GPUs
24,000 GPUs total only 16 of,000 of them were running on the training
run because GPUs are very unreliable, so they need to have spares to swap in and out, all the way to
now 100,000 GPUs that they're training on, Lama 4 on currently, like 128,000 or so.
This is, think about 100,000 GPUs with roughly 1400 watts a piece, that's 140 megawatts, 150 megawatts for 128 of that.
So you're talking about you've jumped from 15 to 20 megawatts to 10x, almost 10x that
number, 9x that number to 150 megawatts in two years, from 2022 to 2024.
And some people like Elon, he admittedly, and he says himself got into the game a little
bit late for pre-training large language models, right? XAI was started later, right? But then he
bent heaven and hell to get his data center up and get the largest cluster in the world,
right, which is 200,000 GPUs. And he did that. He bought a factory in Memphis. He's upgrading
the substation at the same time. He's got a bunch of mobile power generation, a bunch
of single cycle combine. He tapped the natural gas line that's right next to the factory and he's just
pulling a ton of gas, burning gas. He's generating all this power. He's in a factory, an old
appliance factory that shut down and moved to China long ago. And he's got 200,000 GPUs in it.
And now what's the next scale? All the hyperscalers have done this. Now the next scale is
something that's even bigger, right?
And so Elon, just to stick on the topic, he's building his own natural gas plant, like a
proper one, right next door.
He's deploying tons of Tesla MegaPak batteries to make the power more smooth and all sorts
of other things.
He's got industrial chillers to cool the water down because he's water cooling the chips.
So all these crazy things to get the clusters bigger and bigger.
But when you look at like, say, what opening I did with Stargate, that's that in
Arizona, in in Abilene, Texas, right?
What they've announced at least, right?
It's not built right.
Elon says they don't have the money.
You know, there's some debates about this.
But at full scale, at least the first section is like definitely money is accounted for,
but there's multiple sections.
But full scale, that data center is going to be 2.2 gigawatts, right?
2200 megawatts of power in and roughly like 1.8 gigawatts or 1800 megawatts of power delivered
to chips, right?
Now this is an absurd scale.
2.2 gigawatts is like more than most cities, right?
To be clear.
And it delivered to a single cluster
that's connected to do training, right?
To train these models, to do both the pre-training,
the post-training, all of this stuff, right?
This is insane.
This is insane.
What is a nuclear power plant again?
Everyone is doing this, right?
Everyone is doing this, right?
Meta in Louisiana, right?
They're building two natural gas plants, massive ones, and then they're building this massive
data center.
Amazon has plans for this scale.
Google has plans for this scale.
XAI has plans for this scale, right?
All of these, the guys that are racing, the companies that are racing are racing hard,
and they're doing multi gigawatt data
centers, right?
To build this out because they think that, yeah,
if I now have, you know, obviously pre-training scaling
is gonna continue but to some extent,
but then also all this post-training stuff
where you have an RL sandbox for computer use
or whatever, right?
Like, you know, this is where they're gonna,
and all these fearful about viable domains
where they just keep learning and learning
and learning self-play, whatever, whatever it is,
makes the AI so much more capable
because the line does go up, right?
As you throw more compute, you get more performance.
The shirt is about scaling laws.
To some extent, it is diminishing returns, right?
You 10x the compute, you don't get 10x better model, right?
You get a diminishing returns,
but also you get efficiency improvements
so you bend the curve, right?
And these scale of data centers are doing, wreaking a lot of you bend the curve, right? And these scale of data centers are doing,
wreaking a lot of havoc on the network, right?
And Nathan was mentioning there's,
Amazon has tried to buy this nuclear power plant, Talon,
and if you look at the Talon stock,
it's just like skyrocketing,
and they're building a massive multi-gigawatt data center
there, and you just go down the list.
There's so many ramifications.
Interesting thing is like certain regions of the US,
transmitting power costs more than actually generating it.
Because the grid is so slow to build,
and the demand for power and the ability to build power,
and re-ramping on a natural gas plant or even a coal plant
is easy enough to do,
but transmitting the power's really hard.
So in some parts of the US, in Virginia,
it costs more to transmit power than it costs to generate it.
Which is like, you know, there's all sorts of like
second order effects that are insane here.
Can the power grid support this kind of growth?
You know, Trump's executive orders,
there was a Biden executive order before the end of the year,
but then Trump had some more executive orders,
which hopefully reduced the regulations
to where yes, things can be built.
But yeah, this is a big, big challenge, right? Is building enough power fast enough? Are you going to basically have a
nuclear power plant next to a data center for each one of these? So the fun thing here is this is too
slow. To build the power plant? To build a power plant or to reconfigure an existing power plant
is too slow. And so therefore you must use, data center power consumption is flat, right?
I mean, it's by-
Which is why nuclear is also good for it.
Long-term nuclear is a very natural fit,
but you can't do solar or anything in the short term like that.
Because data center power is like this, right?
You're telling me I'm gonna buy tens of billions of dollars
of GPUs and idle them because the power's not being generated.
Power's cheap, right?
If you look at the cost of a cluster,
less than 20% of it is power, right?
Most of it is the capital cost and depreciation
of the GPUs, right?
And so it's like, well, screw it.
I'll just build natural gas plants.
This is what Meta's doing in Louisiana.
This is what OpenAI is doing in Texas
and all these different places.
They may not be doing it directly,
but they are partnered with someone. And so there is a couple of hopes. One is,
and Elon, what he's doing in Memphis is to the extreme, they're not just using dual-combine
cycle gas, which is super efficient. He's also just using single cycle and mobile generators
and stuff, which is less efficient. But there's also the flip side, which is solar power generation is like this, and wind is another like this, different correlate, different.
So if you stack both of those, plus you get a big chunk of batteries, plus you have a little bit of gas, it is possible to run it more green.
It's just the time scales for that is slow. So people are trying.
But Meta basically said, whatever, don't care about my sustainability pledge,
or they'll buy like a per power,
it's called a PPA, power purchasing agreement,
where there'll be a massive wind farm
or solar farm, like wherever.
And then they'll just pretend like those electrons
are being consumed by the data center.
But in reality, they're paying for the power here
and selling it to the grid and they're buying power here.
And then another thing is like,
Microsoft quit on some of their sustainability pledges,
right? Elon
What he did with Memphis is objectively somewhat dirty but he's also doing it in an area where there's like a
bigger natural gas plant right next door and like a sewer next or not a sewer but like a wastewater treatment and a garbage dump
Nearby, right and and and he's he's obviously made the world a lot more clean than that one data center is gonna do, right?
So I think like it's fine to some extent and maybe AGI solves global warming and stuff,
whatever it is.
This is sort of the attitude that people at the labs have,
which is like, yeah, it's great, we'll just use gas.
Because the race is that important
and if we lose, that's way worse.
I should say that I got a chance to visit
the Memphis data center.
And it's kind of incredible.
I mean, I visited with Elon,
just the teams and the rate of innovation there is insane.
Because my sense is that, you know,
nobody's ever done anything of this scale,
and nobody has certainly ever done anything of this scale
at the rate that XAI is doing.
So they're like figuring out,
I mean, it's all sitting in on all these meetings
where they're brainstorming.
It's like, it's insane.
It's exciting, because they're like,
they're trying to figure out what the bottlenecks are,
how to remove the bottlenecks,
how to make sure that, you know,
there's just so many really cool things
about putting together a data center,
because, you know, everything has to work.
It's the people that do like the sysadmin,
the machine learning, all that is the exciting thing, so on.
But really the people that run everything
are the folks that know the low level software
and hardware that runs everything,
the networking, all of that.
And so you have to make sure you have procedures that test everything. I think they're using ethernet. I don't know how they're doing the networking, all of that. And so you have to like, make sure you have procedures
that test everything.
I think they're using ethernet.
I don't know how they're doing the networking, but.
They're using Nvidia Spectrum X ethernet.
There's actually like, I think, yeah, the unsung heroes
are the cooling and electrical systems,
which are just like glossed over.
Yeah.
But I think like, like one story that maybe is like,
exemplifies how insane this stuff is,
is when you're training, right,
you're always doing,
you're running through the model a bunch, right,
in the most simplistic terms,
running through the model a bunch,
and then you're gonna exchange everything
and synchronize the weights, right?
So you'll do a step.
This is like a step in model training, right?
And every step, your loss goes down, hopefully,
and it doesn't always.
But in the simplest terms,
you'll be computing a lot,
and then you'll exchange.
The interesting thing is GPU power is most of it.
Networking power is some, but it's a lot less.
So while you're computing, your power for your GPUs is here.
But then when you're exchanging weights,
if you're not able to overlap communications
and compute perfectly, there may be a time period
where your GPUs are just idle, and you're exchanging weights,
and you're like, hey, the model's updating.
So you're exchanging the gradients,
you do the model update,
and then you start training again.
So the power goes, right?
And it's super spiky.
And so funnily enough, right?
Like this, when you talk about the scale
of data center power, right?
You can blow stuff up so easily.
And so Meta actually has accidentally
opened upstream something to code in PyTorch,
where they added an operator. and I kid you not,
whoever made this, like I wanna hug the guy
because it says PyTorch, it's like PyTorch.powerplant
no blow up equals zero or equal one.
And what it does is amazing, right?
Either when you're exchanging the weights,
the GPU will just compute fake numbers
so the power doesn't spike too much,
and so then the power plants don't blow up because the transient spikes like screw stuff up.
Well, that makes sense.
I mean, you have to do that kind of thing.
You have to make sure they're not idle.
Yeah.
And Elon's solution was like,
let me throw a bunch of Tesla megapacks
and a few other things, right?
Like everyone has different solutions,
but like Meta's at least was publicly and openly known,
which is just like set this operator.
And what this operator does is it just makes the GPUs
compute nothing so
that the power does a spike.
But that just tells you how much power you're working with.
I mean, it's insane.
It's insane.
People should just go to Google, like scale, like what does X watts do and
go through all the scales from one watt to a kilowatt to a megawatt.
And you look at and stare at that and you're how high on the list, a
gigawatt is, and it's mind that and you're how high on the list a gigawatt is and
It's mind-blowing
Can you say something about the cooling? So I I know Elon's using liquid cooling. I believe in
in all cases
Uh, that's a new thing, right? Most of them don't use liquid cooling. Is there something interesting to say about the cooling?
Yeah. Yeah, so air cooling has been the de facto standard
Throw a bunch of metal heat heat pipes, etc. and fans, right?
And like that's cold, that's been enough to cool it. People have been dabbling in
water cooling. Google's TPUs are water-cooled, right? So they've been doing
that for a few years, but with GPUs no one's ever done it, and no one's ever done
the scale of water cooling that Elon just did, right? Now next generation
NVIDIA is for the like highest end GPU, it is mandatory water cooling.
You have to water cool it.
But Elon did it on this current generation and that required a lot of stuff, right?
If you look at some of the satellite photos and stuff of the Memphis facility,
there's all these external water chillers that are sitting basically,
it looks like a semi truck pod thing, what's it called, the container. But really those are water chillers and he has like 90 of those water chillers that are sitting basically, it looks like a semi truck pod thing,
what's it called, the container,
but really those are water chillers
and he has like 90 of those water chillers
just sitting outside, 90 different containers, right,
with water, you know, like chill the water,
bring it back to the data center
and then you distribute it to all the chips,
pull all the heat out and then send it back, right,
and this is both a way to cool the chips
but also an efficiency thing, all right,
and going back to that like sort of three vector thing,
there is memory bandwidth, flops, and interconnect.
The closer the chips are together,
the easier it is to do high speed interconnects.
And so this is also a reason why you're gonna go
water cooling is because you can just put the chips
right next to each other and therefore get
higher speed connectivity. is because you can just put the chips right next to each other and therefore get higher
speed connectivity. I gotta ask you, so in one of your recent posts, there's a section called cluster measuring contest. There's another word there, but I won't say it.
What, who's got the biggest now and who's gonna have the biggest?
Today, individual largest is Elon, right?
Elon's cluster.
Elon's cluster in Memphis, 200,000 GPUs, right?
Meta has like 128,000, OpenAI has 100,000 now.
Now to be clear, other companies have more GPUs than Elon,
they just don't have them in one place, right?
And for training, you want them tightly connected.
There's some techniques that people are researching on, they just don't have them in one place. For training, you want them tightly connected.
There's some techniques that people are researching and working on that let you train across multiple
regions, but for the most part, you want them all in one area.
You can connect them with high-speed networking.
Elon today has 200,000 H100s.
100,000 H100s, 100,000 H200s, right? Meta, OpenAI, and Amazon all have on the scale of 100,000, a little bit less.
But this year, people are building much more, right?
Anthropic and Amazon are building a cluster of 400,000 tranium 2,
which is Amazon-specific chip, trying to get away from NVIDIA, right?
Meta and OpenAI have scales for hundreds of thousands
but by next year
You'll have like 500,000 to 700,000 GPU clusters and and note those GPUs are much higher power consumption than existing ones
Right hopper 700 watts black well goes to 1200 watts, right?
So so the power per chip is growing and the number of chips is growing right?
So the power per chip is growing and the number of chips is growing, right?
Nuts.
You think Elon said he'll get to a million.
You think that's actually feasible?
I mean, I don't doubt Elon, right?
The filings that he has for the power plant
and the Tesla battery packs,
it's clear he has some crazy plans for Memphis.
Like permits and stuff is open record, right?
But it's not quite clear that what and what the time scales are, I just never open record, right? But it's not quite clear that, you know,
what and what the time scales are.
I just never doubt Elon, right?
He's gonna surprise us.
So what's the idea with these clusters?
If you have a million GPUs, what percentage
in let's say two, three years is used for training
and what percent pre-training and what percent is used
for like, for the actual computation? So these mega clusters make no sense for inference right? You could
route inference there and just not train but most of the inference capacity is
being you know hey I've got a 30 megawatt data center here I've got 50
megawatts here I've got a hundred here whatever I'll just throw inference and
all of those because the mega clusters right multi gigawatt data centers I want
to train there because that's where all of my GPUs
are co-located where I can put them
at a super high networking speed connected together.
Because that's what you need for training.
Now with pre-training, this is the old scale,
you would increase parameters, you'd increase data,
model gets better.
That doesn't apply anymore because there's not much more data
in the pre-training side.
Yes, there's video and audio and image that has not been fully taken advantage of,
so there's a lot more scaling, but a lot of people have taken transcripts of YouTube videos,
and that gets you a lot of the data. It doesn't get you all the learning value out of the video
and image data, but there's still scaling to be done on pre-training, but this post-training world
is where all the flops are going to be spent, right? The model is going to play with itself, it's going to self-play, it's going to do verifiable
tasks, it's going to do computer use in sandboxes, it might even do simulated robotics things,
right?
All of these things are going to be environments where compute is spent in quote unquote post-training.
But I think it's going to be good.
We're going to drop the post from post-training.
It's going to be pre-training and it's gonna be training, I think.
The return of the king.
At some point. Because for the bulk of the last few years, pre-training has dwarfed post-training.
Mm-hmm.
But with these verifiable methods, especially ones that scale really potentially infinitely,
like computer use and robotics, not just math and coding, where you can verify what's happening,
those infinitely verifiable tasks,
it seems you can spend as much compute as you want on them.
Especially with the context length increase.
Cause the end of pre-training is when you increase
the context length for these models.
And we've talked earlier in the conversation
about how the context length, when you have a long input,
is much easier to manage than output.
And a lot of these post-training and reasoning techniques
rely on a ton of sampling
and it's becoming increasingly long context.
So there's just like your effectively, your compute efficiency goes down.
I don't think flops is the standard for how you measure it, but with RL and you
have to do all these things where you move your weights around in a different way
than at pre-training and just generation. It's going to become less efficient
and flops is gonna be less of a useful term.
And then as the infrastructure gets better,
it's probably gonna go back to flops.
So all of the things we've been talking about
is most likely going to be Nvidia, right?
Is there any competitors?
Google, I kind of ignored them.
I was like, huh?
What's the story with TPU?
Like what's the...
TPU is awesome, right? What's the story with TPU?
TPU is awesome, right?
It's great.
Google is, they're a bit more tepid on building data
centers for some reason.
They're building big data centers, don't get me wrong.
And they actually have the biggest cluster.
I was talking about Nvidia clusters.
They actually have the biggest cluster, period.
But the way they do it is very interesting, right?
They have two data center super regions.
In that the data center isn't physically, all of the GPs aren't physically on one site,
but they're 30 miles from each other.
Not GPs, TPs.
In Iowa and Nebraska, they have four data centers that are just right next to each other.
Why doesn't Google flex its cluster size?
Go to multi-data center training.
There's good images in there, so I'll show you what I training. It's the good images in there.
So I'll show you what I mean.
It's just semi-analysis multi data center.
So this is an image of what a standard Google data
center looks like.
By the way, their data centers look very different
than anyone else's data centers.
What are we looking at here?
So these are, yeah, so if you see this image, right,
in the center there are these big rectangular boxes, right?
Those are where the actual chips are kept.
And then if you scroll down a little bit further,
you can see there's these water pipes,
there's these chiller cooling towers in the top,
and a bunch of diesel generators.
The diesel generators are backup power.
The data center itself look physically smaller
than the water chillers.
So the chips are actually easier to keep together,
but then cooling all the water for the water cooling
is very difficult, right?
So Google has a very advanced infrastructure
that no one else has for the TPU.
And what they do is they've stamped a bunch of these data
centers out in a few regions, right?
So if you go a little bit further down,
this is a Microsoft.
This is in Arizona.
This is where GPT-5 quote unquote will be trained.
You know.
If it doesn't exist already.
Yeah, if it doesn't exist already.
But each of these data centers, right,
I've shown a couple images of them,
they're like really closely co-located
in the same region, right, Nebraska, Iowa.
And then they also have a similar one
in Ohio complex, right?
And so these data centers are really close to each other.
And what they've done is they've connected them super high bandwidth with fiber.
And so these are just a bunch of data centers.
And the point here is that Google has a very advanced infrastructure,
very tightly connected in a small region.
So Elon will always have the biggest cluster fully connected, right?
Because it's all in one building, right?
And he's completely right on that, right?
Google has the biggest cluster, but you have to spread over three sites
and by a significant margin,
but you have to go across multiple sites.
Why doesn't Google compete with Nvidia?
Why don't they sell TPUs?
I think there's a couple problems with it.
It's like one, TPU has been a form of allowing search
to be really freaking cheap and build models for that.
And so a big chunk of the search GPU purchases or TPU purchases or a big chunk of Google's
purchases and usage, all of it is for internal workloads, whether it be search, now Gemini,
YouTube, all these different applications that they have, ads.
These are where
all their TPUs are being spent and that's what they're hyper focused on,
right? And so there's certain like aspects of the architecture that are
optimized for their use case that are not optimized elsewhere, right? One simple
one is like they've open-sourced a Gemma model and they called it Gemma 7B,
right? But then it's actually 8 billion parameters because the vocabulary is so
large. And the reason they made the vocabulary so large is because TPUs, like matrix multiply
unit is massive.
Because that's what they've like sort of optimized for.
And so they decided, oh, I'll just make the vocabulary large too, even though it makes
no sense to do so on such a small model, because that fits on their hardware.
So Gemma doesn't run as efficiently on a GPU as a Llama does, right?
But vice versa, Llama doesn't run as efficiently on a TPU as a gemma does, right?
And it's still like, there's like certain like aspects
of like hardware, software co-design.
So all their search models are their ranking
and recommendation models.
All these different models that are AI,
but not like gen AI, right?
Have been hyper optimized with TPUs forever.
The software stack is super optimized,
but all of this software stack has not been released
publicly at all, right? Very small portions of it, Jax and XLA have been, but like the experience when you're
inside of Google and you're training on TPUs as a researcher, you don't need to know anything about
the hardware in many cases, right? Like it's like pretty beautiful, but as soon as you step outside,
they'll get all go, a lot of them go back. They leave Google and then they go back. Yeah. Yeah.
They're like they leave and they start a company because they have all these
amazing research ideas. They're like, wait, infrastructure is hard.
Software is hard. And this is on GPUs.
Or if they try to use TPUs same thing, cause they don't have access to all this
code. And so it's like,
how do you convince a company whose golden goose is search where they're making
hundreds of billions of dollars from to start selling GPU or TPUs,
which they used to only buy a couple billion of, you know, I think in 2023, they bought like, like a couple billion.
And now they're buying like 10 billion to $15 billion worth.
But how do you convince them that they should just buy like twice as many and figure out how to sell them and make $30 billion?
Like, who cares about making $30 billion?
Won't that 30 billion exceed actually the search profit eventually?
Oh, I mean, like, you're always going to make more money on services than always.
I mean like yeah like to be clear like today people are spending a lot more on hardware
than they are the services right because the hardware front runs the service spend but like
you're investing if there's no revenue for AI stuff or not enough revenue then obviously like
it's going to blow up.
People won't continue to spend on GPUs forever.
NVIDIA is trying to move up the stack with software that they're trying to sell and license
and stuff.
Google has never had that DNA of this is a product we should sell.
Google Cloud, which is a separate organization from the TPU team, which is a separate organization
from the DeepMind team, which is a separate organization from the search team, right?
There's a lot of bureaucracy.
Wait, Google Cloud is a separate team than the TPU team?
Technically, TPU sits under infrastructure,
which sits under Google Cloud.
But like Google Cloud, like for like renting stuff
and TPU architecture are very different goals, right?
In hardware and software, like all of this, right?
Like the JaxxXLA teams do not serve Google's customers
externally, whereas NVIDIA's various CUDA teams
for things like Nickel serve external customers, right?
The internal teams like Jaxx and XLA and stuff,
they more so serve DeepMind and Search, right?
And so their customer is different,
they're not building a product for them.
Do you understand why AWS keeps winning
versus Azure for cloud versus Google Cloud?
Yeah, there's Google Cloud is tiny, isn't it relative to the Google Cloud is third.
Yeah. Yeah.
Microsoft is the second biggest, but Amazon is the biggest. Right.
And Microsoft deceptively sort of includes like Microsoft Office 365 and things like that,
like some of these enterprise wide licenses.
So in reality, the Gulf is even larger.
Microsoft is still second though, right?
Amazon is way bigger.
Why? Because using AWS is better and easier.
And in many cases, it's cheaper.
And it's first.
It was first.
Yeah, but there's a lot of things that are first that-
Well, it's easier, it's harder to switch than it is to-
But AWS is their core-
Because it's large companies.
There's big fees for switching too.
AWS generates over 80% of Amazon's profit
I think over 90% right insane the distribution centers are just like one day
We'll decide to make money from this, but they haven't yet right like they make tiny little profit
Yeah, one day of Amazon Prime will triple in price you would think they would improve AWS
interface
Because it's like horrible. It's a clunky, but everybody is
interface, because it's horrible. It's clunky, but everybody is.
I don't know, yeah.
One would think.
I think actually Google's interface is sometimes nice,
but it's also like they don't care about anyone
besides their top customers.
Exactly.
Their customer service sucks, and they have a lot less.
I mean, all these companies, they
optimize for the big customers.
Yeah, it's supposed to be for business.
Well, Amazon has always optimized for the small customer
too, though, right?
Obviously, they optimize a lot for the big customer.
But when they started, they just would
go to random Bay Area things and give out credits.
Or just put in your credit card and use us, right?
Back in the early days.
So the business has grown with them and virgin.
So why does Amazon, why is Snowflake all over Amazon?
Because Snowflake, in the beginning,
when Amazon didn't care about them,
was still using Amazon.
And then, of course, one day, Snowflake and Amazon has a super huge partnership.
This is the case, Amazon's user experience and quality is better.
Also, a lot of the silicon they've engineered makes them have a lower cost structure in
traditional cloud storage, CPU, networking, that kind of stuff.
In databases, I think four of Amazon's top five revenue products, margin products,
like gross profit products, are all database-related products like Redshift and all these things,
right? So Amazon has a very good silicon to user experience entire pipeline with AWS.
I think Google, their silicon teams, yeah, they have awesome silicon internally, TPU,
the YouTube chip
you know some of these other chips that they've made and
The problem is they're not serving external customers or serving internal customers, right?
It's I mean in videos entire culture is designed from the bottom up to do this
There's this recent book the Nvidia way by take him that details this and they're how they look for future opportunities and
Ready their CUDA software libraries to make it so that
new applications of high-performance computing can very rapidly be evolved on CUDA and NVIDIA chips,
and that is entirely different than Google as a services business.
Yeah, I mean, NVIDIA, it should be said, is a truly special company. Like, I mean,
they, there's a whole, the culture, everything,
they're really optimized for that kind of thing.
Speaking of which, is there somebody
that can even challenge Nvidia hardware-wise?
Intel, AMD?
I really don't think so.
We went through a very long process of working with AMD
on training on their GPUs and friends and stuff,
and they're decent.
Their hardware is better in many ways
than in NVIDIAs.
The problem is their software is really bad.
And I think they're getting better, right?
They're getting better faster,
but they're just, the Gulf is so large.
And like, they don't spend enough resources on it
or have it historically, right?
Maybe they're changing their tune now,
but for multiple months, we were submitting those bugs,
right, like us, Semi-Analysis, right? Like, what the fuck? Like, why are we submitting those bugs, right? Like us, Semi-Analysis, right?
Like what the fuck?
Like why are we submitting those bugs, right?
Because they only cared about their biggest customers,
and so they'd ship them a private image, blah, blah, blah,
and it's like, okay, but like I am just using PyTorch
and I wanna use the publicly available libraries.
You don't care about that, right?
So they're getting better,
but I think AMD is not possible.
Intel's obviously in dire straits right now
and needs to be saved somehow.
Very important for national security,
for American technology dominance.
Can you explain that obviously?
So why are they in dire straits?
Going back to earlier, only three companies can R&D, right?
Taiwan, Sinshu, Samsung, Pyongyang,
and then Intel Hillsboro.
Samsung's doing horribly, Intel's doing horribly.
We could be in a world where there's only one company that can do R&D, and that one
company already manufactures most of the chips.
They've been gaining market share anyways, but that's a critical thing, right?
So what happens to Taiwan means the rest of the world's semiconductor industry and therefore
tech relies on Taiwan, right?
And that's obviously precarious.
As far as like Intel, they've
been slowly steadily declining. They were on top of servers and PCs, but
now Apple's done the M1 and Nvidia's releasing a PC chip and Qualcomm's
releasing a PC chip. And in servers, hyperscalers are all making their own
ARM-based server chips. And Intel has no AI silicon like wins, right? They have
very small wins.
And they never got into mobile
because they said no to the iPhone.
And like all these things have compounded
and they've lost their process technology leadership, right?
They were ahead for 20 years
and now they're behind by at least a couple of years, right?
And they're trying to catch back up
and we'll see if like their 18A, 14A strategy works out
where they try and leapfrog TSMC.
But like, and Intel is just like
losing tons of money anyways, right?
And they just fired their CEO,
even though the CEO was the only person
who understood the company well, right?
We'll see.
He was not the best, but he was pretty good, relatively.
Technical guy.
Where does Intel make most of its money?
The CPUs, though.
PCs and data center CPUs, yeah.
But data center CPUs are all going cloud,
and Amazon, Microsoft, Google are making ARM-based CPUs.
And then PC side, AMD's gained market share,
Nvidia's launching a chip, that's not gonna be a success,
MediaTek, Qualcomm ever launched chips,
Apple's doing well, right?
They could get squeezed a little bit in PC,
although PC generally, I imagine,
will just stick Intel mostly for Windows side.
Let's talk about the broad AI race.
Who do you think wins?
We talked about Google.
The leader, the default leader has been Google
because of their infrastructure advantage.
Well, like in the news, OpenAI is the leader.
They're the leading in the narrative.
They have the best model.
They have the best model that people can use
and they're experts.
And they have the most AI revenue.
Yeah, OpenAI is winning. So who's making money on AI right now?
Is anyone making money? So accounting profit wise Microsoft is making money, but they're spending a lot of catbacks, right?
You know and that gets depreciated over years
Metas making tons of money with recommendation systems, which is AI but not right right? Lama's losing money for sure, right?
I think anthropic and OpenAI are obviously not
making money because otherwise they wouldn't be raising money. They have to raise money to build
more. Although theoretically they are making money. You spent a few hundred million dollars
on GPT-4 and it's doing billions in revenue, so obviously it's making money. Although they
had to continue to research to get the compute efficiency wins, right? And move down the curve to get that 1200X that has been achieved for GPT-3. Maybe we're
only at a couple hundred X now, but with GPT-4 turbo and 4.0, and there'll be another one
probably cheaper than GPT-4.0 even that comes out at some point.
And that research costs a lot of money.
Yep, exactly.
That's the thing that I guess is not talked about
with the cost, that when you're referring
to the cost of the model, it's not just the training
or the test runs, it's the actual research,
the manpower.
Yeah, to do things like reasoning, right?
Now that that exists, they're gonna scale it,
they're gonna do a lot of research still.
I think the, you know, people focus on the payback question,
but it's really easy to just be like, well,
GDP is humans and industrial capital, right? And if you can make intelligence cheap,
then you can grow a lot, right? That's the sort of dumb way to explain it. But that's sort of
what basically the investment thesis is. I think only Nvidia is actually making tons of money,
and other hardware vendors. The hyperscalers are all on paper making money,
but in reality, they're spending a lot more
on purchasing the GPUs, which you don't know
if they're still gonna make this much money
on each GPU in two years, right?
You don't know if all of a sudden,
OpenAI goes kapoof and now Microsoft has
hundreds of thousands of GPUs they were renting to OpenAI
that they paid for themselves
with their investment in them,
that no longer have a customer, right?
Like this is always a possibility.
I don't believe that, right?
I think OpenAI will keep raising money.
I think others will keep raising money
because the investments, the returns from it
are gonna be eventually huge once we have AGI.
So do you think multiple companies will get,
let's assume- I don't assume it's winner take all. OK, so it's not.
Let's not call it a GI.
Whatever it's like a single day.
It's a gradual super powerful AI,
but it's a gradually increasing set of
features that are useful and make rapidly
increasing rapidly rapidly increasing set of features.
So you're saying a lot of companies will be.
It just seems absurd. Rapidly increasing set of features. Rapidly increasing set of features. So you're saying a lot of companies will be,
it just seems absurd that all of these companies
are building gigantic data centers.
There are companies that will benefit from AI,
but not because they train the best model.
Like Meta has so many avenues to benefit from AI
and all of their services.
People are there, people spend time on Meta's platforms,
and it's a way to make more money per user per hour.
Yeah, it seems like Google X slash XAI slash Tesla,
important to say, and then Meta will benefit
not directly from the AI, like the LLMs,
but from the intelligence,
like the additional boost
of intelligence to the products they already sell.
So whether that's the recommendation system,
or for Elon who's been talking about Optimus, the robot,
potentially the intelligence of the robot.
And then you have personalized robots in the home,
that kind of thing.
He thinks it's a 10 plus trillion dollar business, which.
At some point maybe, not soon,
but who knows what robotics will use for us.
Let's do a TAM analysis, right?
Eight billion humans and let's get eight billion robots,
right, and let's pay them the average salary.
And yeah, there we go, 10 trillion.
More than 10 trillion.
Yeah, I mean, if there's robots everywhere,
why does it have to be just eight billion robots?
Yeah, yeah, of course, of course.
I'm going to have like one robot.
You're going to have like 20.
Yeah, I mean, I see a use case for that.
So yeah, so I guess the benefit would be in the products I sell,
which is why OpenAI is in a trickier position because they...
All of the value of OpenAI right now as a brand is in ChatGPT.
And there is actually not them for most users there's not that much of a reason
that they need open AI to be spending billions and billions of dollars on the
next best model when they could just license llama 5 and for be way cheaper
so that's kind of like chat GPT is an extremely valuable entity to them but
they could make more money just off that.
The chat application is clearly like,
does not have tons of room to continue, right?
Like the standard chat, right?
Where you're just using it
for random questions and stuff, right?
The cost continues to collapse,
V3 is the latest one. It'll go down to ads.
Biggest, but it's gonna get supported by ads, right?
Like, you know, Meta already serves 405B,
probably loses the money, but at some point,
you know, they're going to get,
the models are gonna get so cheap
that they can just serve them for free
with ad-supported, right?
And that's what Google's gonna be able to do,
and that's obviously, they've got a bigger reach, right?
So chat is not gonna be the only use case.
It's like these reasoning, code, agents, computer use,
all this stuff is where OpenAI has to actually go
to make money in the future.
Otherwise, they're kaputs.
But X, Google, and Meta have these other products.
So isn't it likely that OpenAI and Anthropic
disappear eventually?
Unless they're so good at models, which they are.
But it's such a cutting edge, I mean.
It depends on where you think AI capabilities are going.
You have to keep winning.
Yes. You have to keep winning. As you climb, even if think AI capabilities are going. You have to keep winning. Yes.
You have to keep winning.
As you climb, even if the AI capabilities
are going super rapidly awesome into the direction of AGI,
like there's still a boost for X in terms of data.
Google in terms of data, Meta in terms of data,
in terms of other products and the money
and like there's just huge amounts of money.
The whole idea is human data is kind of tapped out.
We don't care.
We all care about self-play, verifiable tasks.
Yeah, so self-play.
Think about AWS.
Which is an RNG problem.
AWS does not make a lot of money on each individual machine.
And the same can be said for the most powerful AI platform,
which is even though the calls to the API are so cheap,
there's still a lot of money to be made
by owning that platform. And there's a lot of money to be made by owning that platform.
And there's a lot of discussions
as it's the next compute layer.
You have to believe that,
and yeah, there's a lot of discussions that tokens
and tokenomics and LLM APIs are the next compute layer
or the next paradigm for the economy,
kind of like energy and oil was.
But there's also like, you have to sort of believe
that APIs and chat are not where AI is stuck, right?
It is actually just tasks and agents and robotics and computer use and those are the areas where all the
Value will be delivered not API not chat application
Is it possible you have I mean it all just becomes a commodity and you have?
The the very thin wrapper
like perplexity. Just joking.
There are a lot of wrappers making a lot of money.
Yeah, but do you think it's possible that people would just even forget what open AI
and the Thropic is?
Because there'll be wrappers around the API and it just dynamically...
If model progress is not rapid, yeah, it's becoming a commodity, right?
DeepSeq v3 shows this, but also the GPT-3 chart earlier,
Kurt chart showed this, right?
Llama 3B is 1200X cheaper than GPT-3.
Any GPT-3, like anyone whose business model
was GPT-3 level capabilities is dead.
Anyone whose business model's GPT-4 level capabilities
is dead, right?
It is a common saying that the best business is being made.
Now are ones that are predicated on models getting better. Right. Which would be like rappers,
thing that is riding the wave of the models. The short term, the company that could make the
most money is the one that figures out what advertising targeting method works for language
model generations. We have the meta ads, which are hyper-targeted in feed, not within specific
pieces of content.
And we have search ads that are used by Google
and Amazon has been rising a lot on search.
But within a piece, within a return from ShadGBT,
it is not clear how you get a high quality placed ad
within the output.
And if you can do that with model costs coming down,
you can just get super high revenue.
Like that revenue is totally untapped
and it's not clear technically how it is done.
Yeah, that is, I mean, the sort of the AdSense innovation
that Google did.
The one day you'll have in GPT output an ad
and that's gonna make like billions.
And it could be very subtle.
It could be in conversation.
Like we have voice mode now.
It could be some way of making it
so the voice introduces certain things.
It's much harder to measure and it takes imagination,
but yeah.
And it wouldn't be so, it wouldn't come off shady
so that you would receive public blowback,
that kind of thing.
So you have to do it loud enough to where it's clear
it's an ad and balance all of that.
So that's the open question they're trying to solve.
Anthropic and OpenAI, they need to.
They might not say that they're trying.
I don't think they care about that at all.
They don't care about it right now.
I think it's places like Perplexity
are experimenting on that more.
Oh, interesting.
Yeah, for sure.
Like Perplexity, Google, Meta care about this.
I think OpenAI and Anthropic are purely laser focused on-
AGI.
Yeah, agents and AGI.
And if I build AGI, I can make tons of money, right? Or I can pay for
everything, right? And this is, it's just predicated like back on the like export control thing,
right? If you think AGI is five, 10 years away or less, right? These labs think it's two,
three years away. Obviously, your actions are, if you assume they're rational actors, which they are mostly,
what you do in a two-year AGI versus five-year versus 10-year is very, very, very different.
Right?
Do you think agents are promising?
We have to talk about this.
This is like the excitement of the year that agents are going to rev this is the generic
Hype term a lot of business folks are using AI agents are going to revolutionize everything
Okay, so mostly the term agent is obviously overblown
We've talked a lot about reinforcement learning as a way to train for verifiable outcomes
Agents should mean something that is open-ended and is solving a task independently on its own and able to adapt to uncertainty.
There's a lot of term agent applied to things like Apple intelligence, which we still don't have after the last WWDC, which is orchestrating between apps.
And that tool use thing is something that language models can do really well.
Apple intelligence, I suspect, will come eventually. It's a closed domain. It's your messages app integrating with your photos,
with AI in the background.
That will work.
That has been described as an agent
by a lot of software companies to get into the narrative.
The question is, what ways can we get language models
to generalize to new domains
and solve their own problems in real time. Maybe some
tiny amount of training when they're doing this with fine tuning themselves or in context learning,
which is the idea of storing information in a prompt and you can use learning algorithms to
update that and whether or not you believe that that is going to actually generalize to things like
me saying book my trip to go to Austin in two days. I have XYZ constraints
and actually trusting it. I think there's a HCI problem coming back for information.
Well, what's your prediction there? Because my gut says we're very far away from that.
I think OpenAI's statement, I don't know if you've seen the five levels, right? Or
it's chat is level one,
reasoning is level two, and then agents is level three. And I think there's a couple more levels,
but it's important to note, right? We were in chat for a couple of years, right? We just
theoretically got to reasoning. We'll be here for a year or two, right? And then agents,
but at the same time, people can try and approximate capabilities of the next level.
But the agents are doing things
autonomously, doing things for minutes at a time, hours at a time, et cetera. Reasoning is doing
things for tens of seconds at a time, and then coming back with an output that I still need to
verify and use and try to check out. And the biggest problem is, of course, it's the same thing with
manufacturing. There's the whole thing with manufacturing, right?
Like there's the whole six sigma thing, right?
Like, you know, how many nines do you get?
And then you compound the nines onto each other.
And it's like, if you multiply, you know,
by the number of steps that are six sigma,
you get to, you know, a yield or something, right?
So like in semiconductor manufacturing,
tens of thousands of steps,
9999999 is not enough, right?
Cause you multiply by that, by that many times, you actually end up9999 is not enough, right? Because you multiply by that many times,
you actually end up with like 6% yield, right?
Or zero.
Or zero, yeah, or zero.
And this is the same thing with agents, right?
Like chaining tasks together each time.
LLMs, even the best LLMs,
in particularly pretty good benchmarks,
don't get 100%, right?
They get a little bit below that
because there's a lot of noise. And so how do you get to enough nines, right?
This is the same thing with self-driving.
We can't have self-driving because without it being
like super geo-fenced like Google's, right?
And even then they have a bunch of teleoperators
to make sure it doesn't get stuck, right?
But you can't do that because it doesn't have enough nines.
And self-driving has quite a lot of structure
because roads have rules, it's self-driving has quite a lot of structure
because roads have rules, it's well-defined,
there's regulation.
When you're talking about computer use for the open web,
for example, or the open operating system,
like there's no, it's a mess.
So like the possibility, I'm always skeptical
of any system that is tasked with interacting with the human world,
with the open, messy human world.
If we can't get intelligence that's enough
to solve the human world on its own,
we can create infrastructure like the human operators
for Waymo over many years that enable certain workflows.
There is a company, I don't remember it, but it is,
but that's literally their pitches.
Yeah, we're just gonna be the human operator
when agents fail, and you just call us and we fix it.
Yeah, same thing for tele.
It's like an API call, and it's hilarious.
There's gonna be teleoperation markets
when we get human robots, which is,
there's gonna be somebody around the world
that's happy to fix the fact
that it can't finish loading my dishwasher
when I'm unhappy with it, but that's just gonna be part
of the Tesla service package.
I'm just with it, but that's just gonna be part of the Tesla service package. I'm just imagining like an AI agent talking to another AI agent. One company has an AI
agent that specializes in helping other AI agents.
But if you can make things that are good at one step, you can stack them together. So that's why
I'm willing, if it takes a long time, we're gonna build infrastructure that enables it.
You see the operator launch.
They have partnerships with certain websites,
with DoorDash, with OpenTable, with things like this.
Those partnerships are gonna let them climb really fast.
Their model's gonna get really good at those things.
It's gonna prove a concept that might be a network effect
where more companies wanna make it easier for AI.
Some companies will be like, no,
let's put blockers in place.
And this is a story of the internet we've seen. We see it now with training data for
language models where companies are like, no, you have to pay. Business working it out.
That said, I think airlines have a very, and hotels have high incentive to make their site
work really well and they usually don't. Like if you look at how many clicks it takes
to order an airplane ticket, it's insane.
You actually can't call an American Airlines agent anymore.
They don't have a phone number.
I mean, it's horrible on many, on the interface front
and all, to imagine that agents will be able
to deal with that website, when I as a human struggle,
like I have an existential
crisis every time I try to book an airplane ticket that I don't, I think it's going to
be extremely difficult to build an AI agent that's robust in that way.
But think about it, like United has accepted the Starlink term, which is they have to provide
Starlink for free and the users are going to love it.
What if one airline is like, we're going take a year and we're gonna make our website
have white text that works perfectly for the AIs.
Every time anyone asks about an AI flight,
they buy whatever airline it is.
That's true.
Or they just like, here's an API
and it's only exposed to AI agents.
And if anyone queries it, the price is 10% higher
for any flight, but we'll let you see any of our flights
and you can just book any of them.
Here you go, agent. And then it's like, oh, and I made 10% higher price. Awesome.
Am I willing to say that for like, hey, book me a flight to see Lex, right? And it's like, yeah,
whatever. I think computers and real world and the open world are really, really messy.
But if you start defining the problem in narrow regions, people are going to be able to create
very, very productive things and ratchet down cost massively.
Now crazy things like robotics in the home, those are going to be a lot harder to do just
like self-driving because there's just a billion different failure modes.
But agents that can like navigate
a certain set of websites and do certain sets of tasks
or like look at, you know, look at your, you know,
take a photo of your groceries, your fridge,
and or like upload your recipes
and then like it figures out what to order from, you know,
Amazon slash Whole Foods food delivery.
Like that's then that's gonna be like pretty quick
and easy to do, I think.
So it's gonna be a whole range of like business outcomes
and it's gonna be tons of tons of sort of optimism around people
can just figure out ways to make money. To be clear, these sandboxes already exist in
research. There are people who have built clones of all the most popular websites of
Google, Amazon, blah, blah, blah, to make it so that there's, I mean, open AI probably
has them internally to train these things. It's the same as DeepMind's robotics team
for years has had clusters for robotics where you interact with robots fully remotely. They just have a lab
in London and you send tasks to it, arrange the blocks and you do this research. Obviously,
there's techs there that fix stuff, but we've turned these cranks of automation before.
You go from sandbox to progress and then you add one more domain at a time and generalize.
I think in the history of NLP and language processing, instruction tuning in tasks per
language model used to be like one language model did one task.
And then in the instruction tuning literature, there's this point where you start adding
more and more tasks together, where it just starts to generalize to every task.
And we don't know where on this curve we are.
I think for reasoning with this RL and verifiable domains, we're early. And we don't know where on this curve we are. I think for reasoning with this RL and verifiable domains,
we're early, but we don't know where the point is
where you just start training on enough domains
and poof, like more domains just start working
and you've crossed the generalization barrier.
Well, what do you think about the programming context?
So software engineering, that's where I personally,
and I know a lot of people
interact with AI the most. There's a lot of fear and angst too from current CS students,
but there's also that's where that is the area where probably the most AI revenue
and productivity gains have come, right? Whether it be copilots or cursor or what have you, right?
This is or just standard chat GPT, right? Like a lot of, I know very few programmers
who don't have chat GPT and actually many of them
have the $200 tier because that's what it's so good for.
I think that in that world, we already see it like,
Swibench, and if you've looked at the benchmark
made by some Stanford students,
I wouldn't say it's like really hard,
but I wouldn't say it's easy either.
I think like it takes someone who's been through
at least a few years of CS or a couple years of programming
to do SweetBench well.
And the models went from 4% to 60% in a year, right?
And where are they gonna go to next year?
It's gonna be higher, probably won't be 100%
because again, that nine's is really hard to do.
But we're gonna get to some point where that's,
and then we're going to need harder software engineering
benchmarks, and so on and so forth.
But the way that people think of it now is it can do code
completion easy.
It can do some function generation
and have to review it.
Great.
But really, the software engineering agents, I think,
can be done faster sooner than any other agent
because it is a verifiable domain.
You can always like unit test
or compile and there's many different regions of like, it can inspect the whole code base at once,
which no engineer really can, only the architects can really think about this stuff, the really
senior guys, and they can define stuff and then the agent can execute on it. So I think software
engineering costs are going to plummet like crazy. And one interesting aspect of that
is when software engineering costs are really low,
you get very different markets, right?
So in the US, you have all these platform SaaS companies,
right, Salesforce and so on and so forth, right?
In China, no one uses platform SaaS.
Everyone just builds their own stack
because software engineering is much cheaper in China,
partially because like people, a number of STEM graduates, etc. So it's generally just cheaper to
do. And so at the same time, code for like code LLMs have been adopted much less in China because
the cost of an engineer there is much lower. But like what happens when every company can just
invent their own business logic like really cheaply and quickly? You stop using platform SaaS,
you start building custom tailored solutions,
you change them really quickly, now all of a sudden
your business is a little bit more efficient too potentially
because you're not dealing with the hell
that is like some random platform SaaS company stuff,
not working perfectly and having to adjust workflows
or random business automation cases
that aren't necessarily AI required,
it's just logic that needs to be built
that no one has built, right?
All of these things can go happen faster.
And so I think software, and then the other domain
is like industrial, chemical, mechanical engineers,
suck at coding, right?
Just generally, and like there are tools,
like semiconductor engineers, their tools are 20 years old.
All the tools run on XP, including
ASML lithography tools run on Windows XP, right?
It's like, you know, and like a lot of the analysis happens
in Excel, right? Like, it's just like, and like a lot of the analysis happens in Excel, right?
Like, it's just like, guys, like you guys can move 20 years forward with all the data
you have and gathered and like do a lot better.
It's just you need the engineering skills for software engineering to be delivered to
the actual domain expert engineers.
So I think I think that's the area where I'm like super duper bullish of, of generally
AI creating value.
The big picture is that I don't think it's gonna be a cliff.
It's like, we talked to anything,
a really good example of how growth changes
is when meta added stories.
So Snapchat was on an exponential,
they added stories, it flatlined.
Software engineers, been up until the rate.
AI is gonna come in, it's probably just gonna be flat.
It's like, it's not like everyone's gonna lose their job.
It's hard because the supply corrects more slowly, so the amount of students is still growing
and that'll correct on a multi-year delay, but the amount of jobs will just turn and then maybe in 20,
40 years it'll be well down. But in the few years, there'll never be the snap moment where it's like
software engineers aren't useful. I think also the nature of what it means to be a programmer and what kind of jobs
programmers do changes.
Cause I think there needs to be a human in the loop of everything you've talked
about.
There's a really important human in that picture of like correcting the code.
Like
thinking larger than the context length.
Yep, and debugging also, like debugging by,
so reading the code, understanding the,
steering the system, like no, no, no,
you missed the point, adding more to the prompt,
kind of like, yes, adding the human.
Designing the perfect Google button.
Google's famous for having people design buttons
that are so perfect.
And it's like, how is AI gonna do that?
Like, they could give you all the ideas, perfect fine.
I mean, that's the thing, you can call it taste.
Humans have, one thing humans can do is figure out
what other humans enjoy better than AI systems.
That's where the preference, you're loading that in,
but ultimately humans are the greatest preference generator.
That's where the preference comes from.
And humans are actually very good at reading,
or like judging between two things versus,
this goes back to the core of what RLHF
and preference tuning is,
is that it's hard to generate a good answer
for a lot of problems,
but it's easy to see which one is better.
And that's how we're using humans for AI now,
is judging which one is better,
and that's what software engineering could look like.
The PR review, here's a few options.
What are the like, here's some potential pros and cons.
And they're gonna be judges.
I think the thing I would very much recommend
is people start, programmers start using AI
and embracing that role of the supervisor of the AI system
and like partner of the AI system
versus writing from scratch
or not learning coding at all and just generating stuff.
Cause I think there actually has to be a pretty high level
of expertise as a programmer to be able to manage
increasingly intelligent systems.
I think it's that and then becoming a domain expert
in something.
Sure, yeah.
Cause like seriously, if you go look at aerospace
or semiconductors or chemical engineering,
everyone is using really crappy platforms,
really old software.
The job of a data scientist is a joke in many cases.
In many cases, it's very real, but it's like,
bring what the forefront of human capabilities are
to your domain.
And even if the forefront is from the AI,
your domain, you're at the forefront, right?
So it's like, you have to be at the forefront of something
and then leverage the rising tide
that is AI for everything else.
Oh yeah, there's so many low hanging fruit everywhere
in terms of where software can help automate a thing
or digitize a thing.
In a legal system, that's why Doge is exciting.
You have, I mean, I got to hang out
with a bunch of the Doge folks and they,
I mean, government is like so old school.
It's like begging for the modernization of software,
of organizing the data, all this kind of stuff.
I mean, in that case is by design
because bureaucracy create, protects centers of power and so on but software
breaks down those barriers so it hurts those that are holding on to power but
ultimately benefits humanity. So there's a bunch of domains of that kind. One thing
we didn't fully finish talking about is open source.
So first of all, congrats.
You released a new model.
Yeah.
Tulu.
I'll explain what a Tulu is.
A Tulu is a hybrid camel when you breed in Dromedary with a Bacchrian camel.
Back in the early days after Chad GPT, there was a big wave of models coming out like
alpaca, vicuna, et cetera, that were all named after various mammalian species. So Tulu is,
the brand is multiple years old, which comes from that. And we've been playing at the frontiers of
post training with open source code. And this first part of this release was in the fall where we built on LAMA's open weight models.
And then we add in our fully open code,
our fully open data.
There's a popular benchmark that is chatbot arena.
And that's generally the metric by which how these chat models
are evaluated.
And it's humans compare random models
from different organizations.
And if you looked at the leaderboard in November or December,
among the top 60 models from 10s to 20s of organizations,
none of them had open code or data for just post-training.
Among that, even fewer or none have pre-training data
and code available.
But post-training is much more accessible at this time.
It's still pretty cheap, and you can do it.
And the thing is, how high can we
push this number where people have access
to all the code and data? So that's kind of the motivation of the
project. We draw on lessons from llama. NVIDIA had a nematron model where the
recipe for their post training was fairly open with some data and a paper
and it's putting all these together to try to create a recipe that people can
fine-tune models like GPT-4 to their domain. So to be clear in the case of
Tulu maybe you can talk about Alma clear, in the case of Tulu,
maybe you can talk about Alma too,
but in the case of Tulu, you're taking Lama 3.405B.
Tulu has been a series of recipes for post-training,
so we've done multiple models over years.
And so you're open sourcing everything.
Yeah, if you start with an open weight-based model,
the whole model technically is an open source
because you don't know what llama put into it, which is why
we have the separate thing that we'll get to but it's just
getting parts of the pipeline where people can zoom in and
customize I know I hear from startups and businesses that
are like, okay, like I can take this post training and try to
apply it to my domain. We talked about verifiers a lot we use
this idea which is reinforcementforced Learning with Verifiable Domain Rewards, RLVR, kind of similar to RLHF. And we've applied it to math. And the model today,
which is like we applied it to the LAMA 405B base model from last year. And we have our other stuff,
we have our instruction tuning and our preference tuning. But the math thing is interesting,
which is like it's easier to improve this math benchmark.
There's a benchmark M-A-T-H math, all capitals,
tough name on the benchmark is name is the area
that you're evaluating.
We're researchers, we're not brands, brand strategists.
And this is something that the DeepSeq paper
talked about as well.
It was like at this bigger model,
it's easier to elicit powerful capabilities with this RL
training.
And then they distill it down from that big model to the small model.
And this model we released today, we saw the same thing as it were at AI2.
We don't have a ton of compute.
We can't train 405b models all the time.
So we just did a few runs and they tend to work.
And it's like, it just shows that there's a lot of room for people to play in these things
And that and they crushed llama's actual release right like they're way better than it. Yeah, so our eval numbers
I mean we have extra months in this but our eval numbers are like much better than the llama instruct
Model that they released and he also said better than deep seek v3. Yeah on our eval benchmark
The most deep seek V3 is really similar. We have a safety benchmark
to understand if it will say harmful things and things like that. And that's what draws down
most of the way. It's still like- It's like an amalgamation of multiple benchmarks or what do
you mean? Yeah. So we have a 10 evaluation. This is standard practice in post-training is you choose
your evaluations you care about. In academics, in smaller labs, you'll have fewer evaluations.
In companies, you'll have a really one domain that you really care about. In smaller labs, you'll have fewer evaluations. In companies, you'll have a really one domain
that you really care about.
In frontier labs, you'll have tens to twenties
to maybe even like a hundred evaluations
of specific things.
So we choose a representative suite of things
that look like chat, precise instruction following,
which is like respond only in emojis.
Like does the model follow weird things like that?
Math, code, and you create a suite like this.
So safety would be one of 10 in that type of suite
where you have like, what does the broader community
of AI care about?
And for example, in comparison to DeepSeq,
it would be something like our average of alpha,
our model would be 80, including safety and similar without,
and DeepSeq would be like 79% average score without safety
and their safety score would bring it down.
Oh, so you beat them even ignoring safety.
Yeah, so this is something that internally it's like, I don't want to win only by like
how you shape the eval benchmark.
So if there's something that's like people may or may not care about safety in their
model, safety can come downstream.
Safety can be when you host the model for an API, like safety is addressed in a spectrum
of applications and AI applications.
So it's like, if you want to say that you have the best recipe, you can't just gate model for an API, like safety is addressed in a spectrum of locations and applications.
So it's like, if you want to say that you have the best recipe,
you can't just gate it on these things that some people might
not want.
And this is like the time of progress we benefit.
We can release a model later.
We have more time to learn new techniques,
like this RL technique.
We had started this in the fall.
It's now really popular.
There's reasoning models.
The next thing to do for open source post-training is to scale up verifiers,
to scale up data, to replicate some of DeepSeq's results. And it's awesome that we have a paper
to draw on and it makes it a lot easier. And that's the type of things that is going on among
academic and closed frontier research in AI. Since you're pushing open source, what do you
think is the future of it?
Do you think DeepSeq actually changes things since it's open source or open weight or is
pushing the open source movement into the open direction?
This goes very back to license discussion.
So DeepSeq R1 with a friendly license is a major reset.
So it's like the first time that we've had a really clear frontier model that is open
weights and with a commercially friendly license with no restrictions on downstream use cases since that data
distillation whatever. This has never been the case at all in the history of
AI in the last few years since ChachiBT. There have been models that are off the
frontier or models with weird licenses that you can't really use them.
So isn't Meta's license like pretty much permissible except for five companies?
And there's also, so this goes to like what open source AI is, which is there's
also use case restrictions in the llama license, which says you can't use it for
specific things. So if you come from an open source software background, you
would say that that is not an open source license. What kind of things are
those though? Like are they like, at this point I can't pull them off.
It used to be military use was one.
And they removed that for scale.
It'll be like, like CSAM, like child abuse material.
Or like, that's the type of thing that is forbidden there.
But that's enough from an open source background
to say it's not open source license.
And also the llama license has this horrible thing
where you have to name your model llama
if you touch it to the llama models.
So it's like the branding thing.
So if a company uses llama, technically the license says that they should say built
with llama at the bottom of their application.
And from like a marketing perspective, that just, that just hurts.
Like I can, I could suck it up as a researcher.
I'm like, oh, it's fine.
Like it says llama dash on all of our, on all of our materials for this release.
But this is why we need truly open models, which is, uh, we don't know deepek R1's data. So you're saying I can't make a cheap copy of llama and pretend
it's mine, but I can do this with the Chinese model. Yeah. Hell yeah. That's what I'm saying.
And that's why it's like we want this whole open language models thing, the Olmo thing, is to try
to keep the model where everything is open with the data as close to the frontier as possible. So we're compute constrained, we're personnel constrained,
we rely on getting insights from people like John Schulman tells us to do URL and outputs,
we can make these big jumps, but it just takes a long time to push the frontier of open source.
And fundamentally, I would say that that's because open source AI does not have the same
feedback loops as open source software.
We talked about open source software for security.
Also it's just because you build something once and you can reuse it.
If you go into a new company, there's so many benefits.
But if you open source a language model, you have this data sitting around, you have this
training code.
It's not like that easy for someone to come and build on and improve because you need
to spend a lot on compute.
You need to have expertise.
So until there are feedback loops of open source AI,
it seems like mostly an ideological mission.
Like people like Mark Zuckerberg,
which is like America needs this.
And I agree with him,
but in the time where the motivation ideologically is high,
we need to capitalize and build this ecosystem around
what benefits do you get
from seeing the language model data.
And there's not a lot about that.
We're going to try to launch a demo soon where you can look at an Omo model and a query and
see what pre-training data is similar to it, which is like legally risky and complicated,
but it's like, what does it mean to see the data that the AI was trained on?
It's hard to parse.
It's terabytes of files. It's like, I don't know what I'm going to see the data that the AI was trained on? It's hard to parse. It's terabytes of files.
It's like, I don't know what I'm gonna find in there.
But that's what we need to do as an ecosystem
if people want open source AI to be financially useful.
We didn't really talk about Stargate.
I would love to get your opinion on like
where the new administration, the Trump administration,
everything that's doing, that's being done
in from the America side and supporting AI infrastructure the Trump administration, everything that's being done
from the America side and supporting AI infrastructure
and the efforts of the different AI companies.
What do you think about Stargate?
What are we supposed to think about Stargate?
And does Sam have the money?
Yeah, so I think Stargate is a opaque thing.
It definitely doesn't have $500 billion.
Doesn't even have $100 billion, right? So what they announced't have $500 billion. It doesn't even have $100 billion.
What they announced is this $500 billion number, Larry Ellison, Sam Altman, and Trump said it.
They thanked Trump and Trump did do some executive actions that do significantly improve the ability
for this to be built faster. One of the executive actions he did is on federal land, you can just basically build data centers in power, you know, like pretty much like that.
And then the permitting process is basically gone or you file after the fact.
So like one of the, again, like I had a schizo take earlier, another schizo take, if you've ever been to the Presidio in San Francisco, beautiful area.
You could build a power plant and a data center there if you wanted to. Because it is federal land, it used to be a military base.
But obviously this would piss people off.
But it's a good bit, anyways.
Trump has made it much easier to do this, right, generally.
Texas has the only unregulated grid in the nation as well.
Let's go Texas.
And so therefore, like, ERCOT enables people
to build faster as well.
In addition, the federal regulations are coming down.
Stargate is predicated, and this is why that whole show happened.
Now, how they came up with a $500 billion number is beyond me.
How they came up with a $100 billion number makes sense to some extent.
There's actually a good table in here that I would like to show in that Stargate piece
that I had.
It's the most recent one.
So anyways, Stargate, it's basically a table about cost.
There, you passed it already.
It's that one.
So this table is kind of explaining what happens. So Stargate is in Abilene, Texas, the first
hundred billion dollars of it. That site is 2.2 gigawatts of power in, about 1.8 gigawatts
of power consumed. Per GPU, they have like roughly, Oracle is already building the first
part of this before Stargate came about. To
clear they've been building it for a year, they tried to rent it to Elon in fact,
right? But Elon was like, it's too slow, I need it faster. So then he went and did his Memphis thing.
And so OpenAI was able to get it with this like weird joint venture called Stargate.
They initially signed a deal with just Oracle for the first section of this cluster, right?
This first section of this cluster, This first section of this cluster is roughly
$5 billion to $6 billion of server spend. Then there's another billion or so of data center spend.
Then likewise, if you fill out that entire 1.8 gigawatts with the next two generations
of NVIDIA's chips, GB 200, GB 300, VR 200, and you fill it out completely, that ends up being roughly $50
billion of server cost, right?
Plus there's data center cost, plus maintenance cost, plus operation cost, plus all these
things.
And that's where OpenAI gets to their $100 billion announcement that they had, right?
Because they talked about $100 billion as phase one, that's this Abilene, Texas data
center, right? $100 billion of total cost of ownership, quote unquote, right? So it's not capex, it's not
investment, it's $100 billion of total cost of ownership. And then there will be future phases,
they're looking at other sites that are even bigger than this 2.2 gigawatts, by the way,
in Texas and elsewhere. And so they're not completely ignoring that, but there is the number of $100 billion that they save for phase one, which I do think will happen.
They don't even have the money for that. Furthermore, it's not $100 billion, it's $50 billion of spend, and then $50 billion of operational cost power, etc.
Rental pricing, etc. Because they're renting it, OpenAI is renting the GPUs from the Stargate joint venture.
Right? What money do they actually have? Right? SoftBank, SoftBank is going to invest,
Oracle is going to invest, OpenAI is going to invest. OpenAI is on the line for $19 billion.
Everyone knows that they've only got 6 billion in their last round and 4 billion in debt.
So, but there is news of SoftBank maybe investing 25 billion into OpenAI. Right? So that's part of
it. Right? So19 billion can come from
there. So OpenAI does not have the money at all, to be clear. Inc. is not dried on anything. OpenAI
has $0 for this $50 billion, in which they're legally obligated to put $19 billion of CapEx
into the joint venture, and then the rest they're going to pay via renting the GPUs from the joint
venture. And then there's Oracle. Oracle has a lot of money. They're building the
first section completely. They were spending for themselves, right? This $6 billion of CapEx,
$10 billion of TCO. And they were going to do that first section. They're paying for that, right?
As far as the rest of the section, I don't know how much Larry wants to spend, right? At any point,
he could pull out, right? Like this is again, it's like completely voluntary. So at any point,
there's no signed ink on this, but he potentially could contribute tens of
billions of dollars to be clear.
He's got the money, Oracle's got the money.
And then there's like MGX, which is the UAE fund, which technically has $1.5 trillion
for investing in AI.
But again, I don't know how real that money is and whereas there is no ink signed for
this, SoftBank does not have
25 billion dollars of cash. They have to sell down their stake in ARM, which is the leader in CPUs
and they IPO'd it. This is obviously what they've always wanted to do. They just didn't know where
they'd redeploy the capital. Selling down the stake in ARM makes a ton of sense. So they can
sell that down and invest in this if they want to and invest in OpenAIR if they want to. As far as money secured,
the first 100,000 GB 200 cluster can be funded.
Everything else after that.
Up in the air.
Is up in the air, money's coming.
I believe the money will come.
I personally do.
Just, it's a belief.
It's a belief that they are gonna release better models
and be able to raise more money.
Yeah. Right?
But the actual reality is that Elon's right.
The money does not exist. What does the US government have to do with anything? models and be able to raise more money. But the actual reality is that Elon's right, the
money does not exist.
What does the US government have to do with anything? What does Trump have to do with
everything? He's just a hype man?
Trump is reducing the regulation so they can build it faster, and he's allowing them to
do it. Any investment of this side is going to involve antitrust stuff. So obviously he's
going to allow them to do it. He he's gonna enable the regulations to actually allow it to be
built I don't believe there's any US government dollars being spent on this
though yeah so I think he's also just creating a general vibe that this is
regulation will go down and this is the era of building so if you're a builder
you want to create stuff you want to launch stuff this is the time to do it
and so like we've had this 1.8 gigawatt data center in our data for over a year now,
and we've been like sort of sending it to all of our clients, including many of these companies
that are building the multi gigawatts. But that is like at a level that's not quite maybe executives
like seeing $500 billion, $100 billion, and then everyone's asking them like, so it could spur
like another like an even faster arms race, right? Cause there's already an arms race,
but like this, this like a hundred billion,
$500 billion number Trump talking about it on TV,
like it could spur the arm race to be even faster
and more investors to flood in and et cetera, et cetera.
So I think, I think you're right is that,
in that sense that open AI or sort of Trump is sort of
like championing people are going to build more
and his actions are going gonna let people build more.
What are you excited about these several years
that are upcoming in terms of cluster build outs,
in terms of breakthroughs in AI?
Like the best possible future you can imagine
in the next couple of years, two, three, four years,
what does that look like?
Just it could be very specific technical things
like breakthroughs on post-training,
or it could be just size, big, impressive clusters.
I really enjoy tracking supply chain
and like who's involved and what.
I really do, it's really fun to see like the numbers,
the cost, who's building what capacity,
helping them figure out how much capacity they should build, winning deals, strategic stuff, that's really do. It's really fun to see the numbers, the cost, who's building what capacity, helping them figure out
how much capacity they should build, winning deals,
strategic stuff, that's really cool.
I think technologically, there's a lot around
the networking side that really excites me
with optics and electronics, right?
Kind of getting closer and closer,
whether it be co-packaged optics
or some sort of forms of new forms of switching.
This is internal to a cluster?
Yeah, also multi data center training, right?
Like there's people are putting so much fiber
between these data centers and lighting it up
with so many different, you know, with so much bandwidth
that there's a lot of interesting stuff happening on that.
And right, telecom has been really boring since 5G.
And now it's like really exciting again.
Can you educate me a little bit about the speed of things?
So the speed of memory versus the speed of interconnect versus the speed of fiber
between data centers. Are these like orders of magnitude different? Can we
at some point converge towards a place where it all just feels like one
computer? No, I don't think that's possible. It's only gonna get
harder to program, not easier. It's only going to get more difficult and complicated
and more layers, right?
The general image that people like to have
is like this hierarchy of memory.
So on chip is really close, localized within the chip,
right, you have registers, right?
Those are shared between some compute elements.
And then you'll have caches,
which are shared between more compute elements.
Then you have like memory, right?
Like HBM or DRAM, like DDR memory or whatever it is,
and that's shared between the whole chip. And then you can have, you
know, pools of memory that are shared between many chips, right, and then storage
and you keep zoning out, right? The access latency across data centers,
across within the data center, within a chip, is different. So like you're
obviously always, you're always gonna have different programming paradigms for
this. It's not gonna to be easy. Programming
this stuff is going to be hard. Maybe AI can help with programming this. But the way to think about
it is that there is sort of like the more elements you add to a task, you don't get strong scaling.
If I double the number of chips, I don't get 2x the performance.
This is just like a reality of computing because there's inefficiencies. And there's a lot of interesting work being done to make it more linear,
whether it's making the chips more networked together more tightly or cool programming models or cool algorithmic things that you can do on the model side.
DeepSeek did some of these really cool innovations
because they were limited on interconnect,
but they still needed to parallelize, right?
Like all sorts of, you know, everyone's always doing stuff.
Google's got a bunch of work
and everyone's got a bunch of work about this.
That stuff is super exciting on the model
and workload and innovation side, right?
Hardware, solid state transformers are interesting, right?
For the power side, there's all sorts of stuff on batteries and there's all sorts of stuff on, you know, I think when you look at,
if you look at every layer of the compute stack, right, whether it goes from lithography and etch
all the way to like fabrication to like optics to networking to power to transformers to cooling
to, you know, networking and you just go on up and up and up and up the stack, you know,
even air conditioners for data centers are like innovating, right? innovating. There's like, copper cables are innovating, right?
You wouldn't think it, but copper cables,
there's some innovations happening there
with the density of how you can pack them.
It's like all of these layers of the stack,
all the way up to the models,
human progress is at a pace that's never been seen before.
I'm just imagining you sitting back in a layer somewhere
with screens everywhere just monitoring the supply chain
where all these clusters,
like all the information you're gathering. I mean, you do incredible.
There's a big team. There's a big team.
You're, you, you do quite incredible work with semi analysis. I mean, it's just,
uh,
keeping your finger on the pulse of human civilization in the digital world.
It's pretty cool. Like just to watch, feel that.
Yeah, thank you, I guess.
Feel all of us like doing shit, epic shit.
Feel the AGI.
I mean, from meme to like reality.
What Nathan, is there like breakthroughs
that you're like looking forward to potentially?
I had a while to think about this
while listening to Dylan's beautiful response.
He didn't listen to me, he was so listen to me. I knew this was coming and it's like realistically
training models is very fun because there's so much low-hanging fruit and
the thing that makes my job entertaining, I train models, I write analysis about
what's happening with models and it's fun because there is obviously so much
more progress to be had.
And the real motivation why I do this somewhere where I can share things is that there's just, I don't trust people that are like, trust me, bro, we're going to make AI good.
It's like, we're the ones that it's like, we're going to do it and you can trust us and we're just going to have all the AI.
And it's just like, I would like a future where more people have a say in what AI is and can understand it.
And that's a little bit less fun that it's not a positive thing of like, this is just all really fun. Like training models is fun and bringing people in is fun,
but it's really like AI if it is going to be the most powerful technology of my lifetime.
It's like, we need to have a lot of people involved in making that.
And making it open helps with that.
As accessible as possible, as open as possible, yeah.
In my read of the last few years is that more openness
would help the AI ecosystem in terms of having more people
understand what's going on, rather than researchers
from non-AI fields to governments to everything.
It doesn't mean that openness will always be the answer.
I think then it will reassess of like,
what is the biggest problem facing AI
and tack on a different angle
to the wild ride that we're on.
And for me, just from even the user experience,
anytime you have the, like Apathy said,
the aha moments, like the magic,
like seeing the reasoning, the chain of thought,
it's like there's something really just fundamentally
beautiful about that.
It's putting a mirror to ourselves and seeing like,
oh shit, it is solving intelligence as the cliche,
like goal of these companies is.
You get to understand why we humans are special.
The intelligence within us is special.
And for now also why we're special in terms of
we seem to be conscious and the A.I. systems for now
aren't and we get to explore that mystery.
So that's, it's just really cool to get to explore
these questions that I don't think,
I would have never imagined would be even possible.
Back when, so just watching with excitement, deep blue, big Kasparov.
Like, I wouldn't have ever thought this kind of AI would be possible in my lifetime.
It's like this is really feels like AI.
Yeah, it's incredible.
I started with AI of learning to fly a Cilia quadrotor.
It's like learn to fly and it just like it learned to fly up, it would hit the ceiling,
it'd stop, it'd catch it.
It's like, okay, that is really stupid
compared to what's going on now.
And now you could probably, with natural language,
tell it to learn to fly and it's going to generate
the control algorithm required to do that.
Oh boy, there's low level blockers.
Like, we had to do some weird stuff for that,
but you can, you definitely can.
Back to our robotics conversation, yeah. When you have to interact some weird stuff for that, but you can. You definitely can. Back to our robotics conversation.
Yeah, when you have to interact in an actual
physical world, it's hard.
What gives you hope about the future of human civilization?
Looking into the next 10 years, 100 years, 1,000 years,
how long do you think we'll make it?
You think we've got 1,000 years?
Humans will definitely be around in 1,000 years.
I think there's ways that very bad things
could happen that it'll be way fewer humans, but humans are very good at surviving. There's been a
lot of things that that is true. I don't think they're necessarily, we're good at long-term
credit assignment of risk, but when the risk becomes immediate, we tend to figure things out.
Oh yeah. the risk becomes immediate, we tend to figure things out. And for that reason, there's physical
constraints to things like AGI, like recursive improvement to kill us all type stuff. For
physical reasons and for how humans have figured things out before, I'm not too worried about it.
AI takeover. There are other international things that are worrying, but there's just fundamental
There are other international things that are worrying, but there's just fundamental human goodness
and trying to amplify that.
And like we're on a tenuous time.
And I mean, if you look at humanity as a whole,
there's been times where things go backwards.
There's times when things don't happen at all
and we're on a,
what should be very positive trajectory right now.
Yeah, there seems to be progress,
but just like with power,
there's like spikes of human suffering.
And we wanna try to minimize the amount of spikes.
Generally, humanity is gonna suffer a lot less, right?
I'm very optimistic about that.
I do worry of like techno-fascism type stuff arising
as AI becomes more and more prevalent and powerful and those who control
it can do more and more. Maybe it doesn't kill us all, but at some point every very powerful human
is going to want a brain-computer interface so that they can interact with the AGI and all of its
advantages in many more ways and merge its mind with its capabilities or that person's capabilities
can leverage those much
better than anyone else and therefore be, it won't be one person rule them all, but it will be,
the thing I worry about is it'll be like few people, hundreds, thousands, tens of thousands,
maybe millions of people rule whoever's left, right? And the economy around it, right? And I
think that's like the thing that's probably more worrisome is human machine amalgamations.
This enables an individual human to have more impact on the world and that impact can be both
positive and negative. Generally, humans have positive impacts on the world, at least societally,
but it's possible for individual humans to have such negative impacts. And AGI, at least as I
think the labs define it, which is not a runaway sentient thing,
but rather just something that can do a lot of tasks really efficiently,
amplifies the capabilities of someone causing extreme damage.
But for the most part, I think it'll be used for, you know, profit-seeking motives,
which will then reduce, which will increase the abundance and supply of things and therefore reduce suffering, right?
which will increase the abundance and supply of things and therefore reduce suffering, right?
That's the goal.
Scrolling on a timeline, just drowning it open.
Scrolling is stasis.
That is holding, scrolling holds the status quo of the world.
That is a positive outcome, right?
Like, if I have food tubes and I'm scrolling and I'm happy,
that's a positive outcome.
While expanding out into the cosmos.
Well, this is a fun time to be alive and
Thank you for pushing the forefront away as possible in humans and thank you for talking to it. This is fun. Thanks for having us.
Thanks for listening to this conversation with Dylan Patel and Nathan Lambert. To support this podcast,
please check out our sponsors in the description.
And now, let me leave you with some words
from Richard Feynman.
For a successful technology,
reality must take precedence over public relations.
For nature cannot be fooled.
Thank you for listening.
I hope to see you next time.
Thank you for listening. I hope to see you next time.