Tech Brew Ride Home - Mon. 01/27 – Why DeepSeek Has Stunned Silicon Valley (And Wall Street)
Episode Date: January 27, 2025It’s one of those days where there’s only one story. Maybe you saw that tech stocks got obliterated today. I’m here to tell you why. It’s solely because of DeepSeek and Chinese AI tech general...ly. How this tech is making people think twice about the AI boom, what DeepSeek did that is different and how this could affect all of Silicon Valley. Sponsors: TryJoyMode.com and use code RIDE Links: China’s DeepSeek Tops iPhone Downloads and Spurs AI Selloff (Bloomberg) The Short Case for Nvidia Stock (Jeffrey Emanuel) DeepSeek R1’s bold bet on reinforcement learning: How it outpaced OpenAI at 3% of the cost (VentureBeat) DeepSeek resets the board (Axios) 17 Thoughts About the Big DeepSeek Selloff (Bloomberg) Learn more about your ad choices. Visit megaphone.fm/adchoices
Transcript
Discussion (0)
On April 4th, 2023, around 2 in the morning, a man was found stabbed multiple times on a sidewalk in downtown San Francisco.
Hey, who did this to you?
What happened next turned the story into a political firestorm.
Reports have identified the victim as Bob Lee, the founder of Cash App.
From Bloomberg Podcasts, this is Foundering, the Killing of Bob Lee, beginning April 16.
Welcome to the Tech meme right home for Monday, January 27th, 2025. I'm Brian McCullough today. It's one of those days where there's only one story. Maybe you saw that tech stocks got obliterated today. I'm here to tell you why. It's solely because of DeepSeek and Chinese AI tech generally. How this tech is making people think twice about the AI boom, what Deep Seek did that is different, and how this could affect all of Silicon Valley. Here's what you miss today in the world of tech. Here's why the stock market is having a bit of a crash this morning. Invidio, Nvidia.
down more than 8%. Meta and Microsoft both down, ASML down almost 10%, Japanese chip companies
filing, crypto also falling. It's all because of DeepSeek. We spoke about DeepSeek last month with
Simon Willison. To sum it up, most succinctly, DeepSeek was apparently able to train an AI model at
3% of the cost of cutting edge models from the likes of OpenAI. So why do you need to buy
100,000 H-100s from Nvidia when maybe you only need 3,000? See, a lot of the eye-popping cap-ex spending
from the likes of every tech player in the world was predicated on the idea of scale.
The only way to get smarter AI was to throw more and more compute at it,
which meant more and more GPUs and more data centers.
I mean, this was the whole premise behind the Stargate announcement.
But what this is making people think, what if that is no longer true?
Then all of this spending could be pulled back all at once and thus crash.
But not only that, Deepseek has jumped to the top of the App Store charts.
It's suddenly seeing rapid adoption in the AI community.
Deepseek is Chinese tech, and not only that, it's open source tech, not proprietary.
If this cheaper tech, which is open source, is just as good, then that would mean that
the ginormous valuations for the likes of OpenAI and Anthropic and the rest might not be
warranted, suggesting a bubble would pop in VC funding.
While it remains to be seen, if Deepseek will prove to be a viable, cheaper alternative
in the long-term, initial worries are centered on whether U.S. tech giant's pricing power
is being threatened, and if their massive AI spending needs reevaluation, said June Rong Yip of
IG Asia. That a small and efficient AI model emerged from China, which has been subject to
escalating U.S. trade sanctions on advance Nvidia chips is also challenging the effectiveness of
such measures, end quote. Certainly U.S. tech players seem to be taking this seriously. Mark
Andreessen called Deepseek, quote, one of the most amazing and impressive breakthroughs, and meta
has reportedly set up four war rooms to analyze Deepseek's tech, two focusing on how high flyer
cut training costs and one on what data high flyer might have used, but back to the stock market
fallout, quoting the Financial Times.
It's deep seek for sure, said one Tokyo-based fund manager of the selling on Monday, adding
that investors were rapidly assessing whether hardware spending on AI could ultimately be
a lot lower than current estimates.
AI investment by large-cap U.S. tech companies hit $224 billion last year, according to UBS,
which expects the total to reach $280 billion this year.
OpenAI and SoftBank announced last week a plan to invest $500 billion over the next four years
in AI infrastructure, end quote.
that's a ton of very stimulative spending in the economy that could, again, potentially dry up
if the status quo is upended.
Again, who or what is DeepSeek, the single AI model that is crashing the stock market
in a roiling Silicon Valley.
DeepSeek is a Chinese AI lab that started as a deep learning research branch of Chinese
quant hedge fund High Flyer.
They've released several different models, all of which seem to be just as capable
as the highest end AI models produced by the recent flurry of Western AI startup.
Again, crucially, while all their models seem to be cutting edge, their costs in terms of money and compute needed to train their models is believed to be a fraction of what Western models cost. One model reportedly costs $6 million to train, as opposed to the hundreds of millions of dollars that has become table stakes for other AI tech.
Now, this has not been without controversy.
The assumption is that these Chinese models, along with others from the likes of Bight Dance,
which have shown similar costs versus performance improvements,
were able to make this breakthrough because U.S.-led export controls over GPUs and other technology
may have spurred deep-seek to innovate and release its models without the latest chips.
In other words, they engineered their way around the roadblocks put up to slow them down,
necessity being the mother of invention, or at least innovation around efficiency in this case.
Though some have also suggested they might have copied the way.
works of others. For example, Deepseek V3 sometimes identifies itself as ChatCHEPT when asked which
model it is, leading some to speculate that its training datasets may contain text generated by
chat GPT. There are also censorship concerns. DeepSeek's latest AI model R1 seems to stick to
Chinese government restrictions on sensitive topics like Tiananmen Square, Taiwan, and the treatment
of Uyghurs in China. But with Deepseek apps topping the app stores, the suggestion is that
none of this may matter. The AI community could naturally gravitate toward using models that are
far cheaper to operate. Quoting Venture Beat. The implications for enterprise AI strategies are profound.
With reduced costs and open access, enterprises now have an alternative to costly proprietary
models like open AIs. Deep Seek's release could democratize access to cutting edge AI capabilities,
enabling smaller organizations to compete effectively in the AI arms race, end quote.
Why is this having such an impact on people's assumptions? Let's use Nvidia as the prime
example of the potential implications here, quoting Jeffrey Emanuel. Perhaps most devastating is
deep-seek's recent efficiency breakthrough, achieving comparable model performance at approximately
1.45th the compute cost. This suggests the entire industry has been massively over-provisioning
compute resources. Combined with the emergence of more efficient inference, architectures through
chain of thought models, the aggregate demand for compute could be significantly lower than
the current projections assume. The economics are compelling. When DeepSeek can match GPT4-level
performance while charging 95% less for API calls, it suggests either Nvidia's customers are burning cash
unnecessarily or margins must come down dramatically. The fact that TSM will manufacture competitive
chips for any well-funded customer puts a natural ceiling on NVIDIA's architectural advantages.
But more fundamentally, history shows that markets eventually find a way around an artificial
bottlenecks that generate supernormal profits. But how exactly did DeepSeek outpace OpenAI and others
at a fraction of the cost? First, open source, as we've been saying, but other details like
Quoting Venture Beat.
full release of R1 and the accompanying technical paper, the company revealed a surprising innovation.
A deliberate departure from the conventional supervised fine-tuning or SFT process widely used
in training large language models. SFT, a standard step in AI development involves training
models on curated datasets to teach step-by-step reasoning, often referred to as chain of thought or
COT. It is considered essential for improving reasoning capabilities. However, DeepSeek challenged this
assumption by skipping SFT entirely opting instead to rely on reinforcement learning RL to train the model.
This bold move forced Deep Seek R1 to develop independent reasoning abilities avoiding the
brittleness often introduced by prescriptive datasets. While some flaws emerge, leading the team
to reintroduce a limited amount of SFT during the final stages of building the model, the results
confirmed the fundamental breakthrough. Reenforcement learning alone could drive substantial performance
its gains. Little is known about the company's exact approach, but it quickly open-sourced its models,
and it's extremely likely that the company built upon the open projects produced by meta,
for example, the Lama model and ML library Pi Torch. To train its models, High Flyer Qant
secured over 10,000 Nvidia GPUs before U.S. export restrictions and reportedly expanded to 50,000
GPUs through alternative supply routes despite trade barriers. This pales compared to leading AI labs like
OpenAI, Google, and Anthropic, which operate with more than 500,000 GPUs each.
The journey to DeepSeek R1's final iteration began with an intermediate model DeepSeek R10,
which was trained using pure reinforcement learning by relying solely on RL,
Deepseek incentivize this model to think independently,
rewarding both correct answers and the logical processes used to arrive at them.
This approach led to an unexpected phenomenon.
The model began allocating additional processing time to more complex problems,
demonstrating an ability to prioritize tasks based on their difficulty. Deepseek's researchers
described this as an aha moment, where the model itself identified and articulated novel solutions
to challenging problems. This milestone underscored the power of reinforcement learning to
unlock advanced reasoning capabilities without relying on traditional training methods like
SFT, end quote. And more from Jeffrey Emanuel, quote,
Deep Seek has made profound advancements not just in model quality, but more importantly in model
training and inference efficiency. By being extremely close to the hardware and by layering together
a handful of distinct, very clever optimizations, DeepSeek was able to train these incredible
models using GPUs in a dramatically more efficient way. How in the world could this be possible?
How could this little Chinese company completely upstage all the smartest minds at our leading
AI labs, which have 100 times more resources, headcount, payroll, capital GPUs, etc.
Wasn't China supposed to be crippled by Biden's restrictions on GPU exports? Well, the details are
fairly technical, but we can at least describe them at a high level. It might have just turned out
that the relative GPU processing poverty of DeepSeek was the critical ingredient to make them
more creative and clever necessity being the mother of invention at all. A major innovation is their
sophisticated mixed precision training framework that lets them use 8-bit floating point numbers
FP8 throughout the entire training process. Most Western labs train using full precision 32-bit
numbers. This basically specifies the number of gradations possible in describing the output of an
artificial neuron. 8 bits in FP8 lets you store a much wider range of numbers than you might expect.
It's just not limited to 256 different equal-sized magnitudes like you'd get with regular integers,
but instead uses clever matrix to store both very small and very large numbers, though naturally
with less precision than you'd get with 32 bits. Deep Sea cracked this problem by developing a
clever system that breaks numbers into small tiles for activations and blocks for weights and strategically
uses high-precision calculations at key points in the network. Unlike other labs that train in
high-precision and then compress later, losing some quality in the process, DeepSeek's native
FP8 approach means they get the massive memory savings without compromising performance.
When you're training across thousands of GPUs, this dramatic reduction in memory requirements
per GPU translates into needing far fewer GPUs overall.
Another major breakthrough is their multi-token prediction system.
Most transformer-based LLM models do inference by predicting the next token, one token at a time.
DeepSeek figured out how to predict multiple tokens while maintaining the quality you'd get from single token prediction.
Their approach achieves about 85 to 90 percent accuracy on these additional token predictions,
which effectively doubles inference speed without sacrificing much quality.
The clever part is they maintain the complete causal chain of predictions,
so the model isn't just guessing it's making structured contextual predictions.
The brilliant part is this compression is built directly into how the model learns.
It's not some separate step they need to do.
It's built directly into the end-to-end training pipeline.
This means that the entire mechanism is differentiable and able to be trained directly using the standard optimizers.
All this stuff works because these models are ultimately finding much lower dimensional representations of the underlying data
than the so-called ambient dimensions.
so it's wasteful to store the full KV indices, even though that is basically what everyone else does.
Not only do you end up wasting tons of space by storing way more numbers than you need,
which gives a massive boost to the training memory footprint and efficiency.
Again, slashing the number of GPUs you need to train a world-class model,
but it can actually end up improving model quality because it can act like a regularizer,
forcing the model to pay attention to the truly important stuff,
instead of using the wasted capacity to fit to noise in the training data.
So not only do you save a ton of memory, but the model might even perform better.
At the very least, you don't get a massive hit to performance in exchange for the huge memory savings,
which is generally the kind of tradeoff you are faced with in AI training.
Another very smart thing they did is to use what is known as mixture of experts or MOE transformer architecture,
but with key innovations around load balancing.
As you might know, the size or capacity of an AI model is often measured in terms of the number of parameters the model contains.
A parameter is just a number that stores some attribute of the model, either the weight or importance a particular artificial neuron has relative to another one or the importance of a particular token depending on its context in the attention mechanism, etc.
Meta's latest Lama 3 model comes in a few sizes, for example, a 1 billion parameter version, the smallest, a 70 billion parameter model, and even a massive 405 billion parameter model.
This largest model is of limited utility for most users because you would need to have tens of thousands of dollars worth of GPUs in your computer just to run at tolerable speeds for inference, at least if you deployed it in the native full precision version.
Therefore, most of the real-world usage and excitement surrounding these open-source models is at the $8 billion parameter or highly quantized 70 billion parameter levels, since that's what can fit in a consumer-grade Nvidia 4090 GPU, which you can buy now for under $1,000.
So why does any of this matter? Well, in a sense, the parameter count and precision tells you something about how much raw information or data the model has stored internally. Note that I'm not talking about reasoning ability or the model's IQ, if you will, it turns out that models with even surprisingly modest parameter counts can show remarkable cognitive performance when it comes to solving complex logic problems, proving theorems in plain geometry, SAT math problems, etc., and quote.
Okay, look, as I said this whole day is about Deepseek, and here's more of why, quoting Axios.
This could be an extinction-level event for venture capital firms that went all in on foundational model companies,
particularly if those companies haven't yet productized with wide distribution.
The quantums of capital are just so much more than anything VC has ever before dispersed
based on what might be a suddenly stale thesis.
If nanotech and Web3 were venture industry grenades, this could be a nuclear-bore.
I spoke to over the weekend aren't panicking, but they're clearly concerned, particularly
that they could be taken so off guard. Don't be surprised if some deals in process get paused.
There's still a ton we don't know about DeepSeek, including if it really spent as little
money as it claims, and obviously there could be national security impediments for U.S.
companies or consumers, given what we've seen with TikTok. But bottom line, the game has changed,
end quote. And finally, let's end with Joe Wisenthal taking the contrarian view just a bit,
i.e., maybe if AI is a race down to becoming a commodity, that could be a good thing.
Quote, suddenly everyone is talking about Jevin's paradox. This is usually discussed with
respect to energy markets. Basically, when you get more energy efficient, you don't use less
of the energy source. You just use your efficiency gains to do new things, and demand keeps booming.
This is certainly the hope if you're an NVIDIA or any company that builds underlying AI infrastructure,
that everyone will use the deep-seek breakthroughs and just race even faster with no effect on total demand for compute.
We'll see. As I'm typing this, NVIDIA has opened down about 13%.
Certainly, investors aren't taking much comfort in Jevin's paradox right now.
One of my favorite Tracy Allaway lines is that it's only a crisis when you can't throw money at the problem.
COVID was a crisis because money alone wasn't enough to address it.
The supply chain shocks were a crisis because money alone couldn't fix the problem.
There's no guarantee here that just throwing more money at U.S. tech companies will be enough to keep them competitive in AI, let alone chips, if it's perceived that they're falling behind.
Human capital, talent, takes years and years to develop.
Getting the incentives right is not something where you can snap your fingers overnight and make things happen.
These are big, slow-moving things, end quote.
Nothing more for you today. Talk to you tomorrow.
