Tech Brew Ride Home - Mon. 01/27 – Why DeepSeek Has Stunned Silicon Valley (And Wall Street)

Starting point is 00:00:00 On April 4th, 2023, around 2 in the morning, a man was found stabbed multiple times on a sidewalk in downtown San Francisco. Hey, who did this to you? What happened next turned the story into a political firestorm. Reports have identified the victim as Bob Lee, the founder of Cash App. From Bloomberg Podcasts, this is Foundering, the Killing of Bob Lee, beginning April 16. Welcome to the Tech meme right home for Monday, January 27th, 2025. I'm Brian McCullough today. It's one of those days where there's only one story. Maybe you saw that tech stocks got obliterated today. I'm here to tell you why. It's solely because of DeepSeek and Chinese AI tech generally. How this tech is making people think twice about the AI boom, what Deep Seek did that is different, and how this could affect all of Silicon Valley. Here's what you miss today in the world of tech. Here's why the stock market is having a bit of a crash this morning. Invidio, Nvidia. down more than 8%. Meta and Microsoft both down, ASML down almost 10%, Japanese chip companies filing, crypto also falling. It's all because of DeepSeek. We spoke about DeepSeek last month with

Starting point is 00:01:18 Simon Willison. To sum it up, most succinctly, DeepSeek was apparently able to train an AI model at 3% of the cost of cutting edge models from the likes of OpenAI. So why do you need to buy 100,000 H-100s from Nvidia when maybe you only need 3,000? See, a lot of the eye-popping cap-ex spending from the likes of every tech player in the world was predicated on the idea of scale. The only way to get smarter AI was to throw more and more compute at it, which meant more and more GPUs and more data centers. I mean, this was the whole premise behind the Stargate announcement. But what this is making people think, what if that is no longer true?

Starting point is 00:01:52 Then all of this spending could be pulled back all at once and thus crash. But not only that, Deepseek has jumped to the top of the App Store charts. It's suddenly seeing rapid adoption in the AI community. Deepseek is Chinese tech, and not only that, it's open source tech, not proprietary. If this cheaper tech, which is open source, is just as good, then that would mean that the ginormous valuations for the likes of OpenAI and Anthropic and the rest might not be warranted, suggesting a bubble would pop in VC funding. While it remains to be seen, if Deepseek will prove to be a viable, cheaper alternative

Starting point is 00:02:24 in the long-term, initial worries are centered on whether U.S. tech giant's pricing power is being threatened, and if their massive AI spending needs reevaluation, said June Rong Yip of IG Asia. That a small and efficient AI model emerged from China, which has been subject to escalating U.S. trade sanctions on advance Nvidia chips is also challenging the effectiveness of such measures, end quote. Certainly U.S. tech players seem to be taking this seriously. Mark Andreessen called Deepseek, quote, one of the most amazing and impressive breakthroughs, and meta has reportedly set up four war rooms to analyze Deepseek's tech, two focusing on how high flyer cut training costs and one on what data high flyer might have used, but back to the stock market

Starting point is 00:03:01 fallout, quoting the Financial Times. It's deep seek for sure, said one Tokyo-based fund manager of the selling on Monday, adding that investors were rapidly assessing whether hardware spending on AI could ultimately be a lot lower than current estimates. AI investment by large-cap U.S. tech companies hit $224 billion last year, according to UBS, which expects the total to reach $280 billion this year. OpenAI and SoftBank announced last week a plan to invest $500 billion over the next four years in AI infrastructure, end quote.

Starting point is 00:03:30 that's a ton of very stimulative spending in the economy that could, again, potentially dry up if the status quo is upended. Again, who or what is DeepSeek, the single AI model that is crashing the stock market in a roiling Silicon Valley. DeepSeek is a Chinese AI lab that started as a deep learning research branch of Chinese quant hedge fund High Flyer. They've released several different models, all of which seem to be just as capable as the highest end AI models produced by the recent flurry of Western AI startup.

Starting point is 00:04:06 Again, crucially, while all their models seem to be cutting edge, their costs in terms of money and compute needed to train their models is believed to be a fraction of what Western models cost. One model reportedly costs $6 million to train, as opposed to the hundreds of millions of dollars that has become table stakes for other AI tech. Now, this has not been without controversy. The assumption is that these Chinese models, along with others from the likes of Bight Dance, which have shown similar costs versus performance improvements, were able to make this breakthrough because U.S.-led export controls over GPUs and other technology may have spurred deep-seek to innovate and release its models without the latest chips. In other words, they engineered their way around the roadblocks put up to slow them down, necessity being the mother of invention, or at least innovation around efficiency in this case.

Starting point is 00:04:50 Though some have also suggested they might have copied the way. works of others. For example, Deepseek V3 sometimes identifies itself as ChatCHEPT when asked which model it is, leading some to speculate that its training datasets may contain text generated by chat GPT. There are also censorship concerns. DeepSeek's latest AI model R1 seems to stick to Chinese government restrictions on sensitive topics like Tiananmen Square, Taiwan, and the treatment of Uyghurs in China. But with Deepseek apps topping the app stores, the suggestion is that none of this may matter. The AI community could naturally gravitate toward using models that are far cheaper to operate. Quoting Venture Beat. The implications for enterprise AI strategies are profound.

Starting point is 00:05:28 With reduced costs and open access, enterprises now have an alternative to costly proprietary models like open AIs. Deep Seek's release could democratize access to cutting edge AI capabilities, enabling smaller organizations to compete effectively in the AI arms race, end quote. Why is this having such an impact on people's assumptions? Let's use Nvidia as the prime example of the potential implications here, quoting Jeffrey Emanuel. Perhaps most devastating is deep-seek's recent efficiency breakthrough, achieving comparable model performance at approximately 1.45th the compute cost. This suggests the entire industry has been massively over-provisioning compute resources. Combined with the emergence of more efficient inference, architectures through

Starting point is 00:06:06 chain of thought models, the aggregate demand for compute could be significantly lower than the current projections assume. The economics are compelling. When DeepSeek can match GPT4-level performance while charging 95% less for API calls, it suggests either Nvidia's customers are burning cash unnecessarily or margins must come down dramatically. The fact that TSM will manufacture competitive chips for any well-funded customer puts a natural ceiling on NVIDIA's architectural advantages. But more fundamentally, history shows that markets eventually find a way around an artificial bottlenecks that generate supernormal profits. But how exactly did DeepSeek outpace OpenAI and others at a fraction of the cost? First, open source, as we've been saying, but other details like

Starting point is 00:06:52 Quoting Venture Beat. full release of R1 and the accompanying technical paper, the company revealed a surprising innovation. A deliberate departure from the conventional supervised fine-tuning or SFT process widely used in training large language models. SFT, a standard step in AI development involves training models on curated datasets to teach step-by-step reasoning, often referred to as chain of thought or COT. It is considered essential for improving reasoning capabilities. However, DeepSeek challenged this assumption by skipping SFT entirely opting instead to rely on reinforcement learning RL to train the model. This bold move forced Deep Seek R1 to develop independent reasoning abilities avoiding the

Starting point is 00:07:36 brittleness often introduced by prescriptive datasets. While some flaws emerge, leading the team to reintroduce a limited amount of SFT during the final stages of building the model, the results confirmed the fundamental breakthrough. Reenforcement learning alone could drive substantial performance its gains. Little is known about the company's exact approach, but it quickly open-sourced its models, and it's extremely likely that the company built upon the open projects produced by meta, for example, the Lama model and ML library Pi Torch. To train its models, High Flyer Qant secured over 10,000 Nvidia GPUs before U.S. export restrictions and reportedly expanded to 50,000 GPUs through alternative supply routes despite trade barriers. This pales compared to leading AI labs like

Starting point is 00:08:19 OpenAI, Google, and Anthropic, which operate with more than 500,000 GPUs each. The journey to DeepSeek R1's final iteration began with an intermediate model DeepSeek R10, which was trained using pure reinforcement learning by relying solely on RL, Deepseek incentivize this model to think independently, rewarding both correct answers and the logical processes used to arrive at them. This approach led to an unexpected phenomenon. The model began allocating additional processing time to more complex problems, demonstrating an ability to prioritize tasks based on their difficulty. Deepseek's researchers

Starting point is 00:08:52 described this as an aha moment, where the model itself identified and articulated novel solutions to challenging problems. This milestone underscored the power of reinforcement learning to unlock advanced reasoning capabilities without relying on traditional training methods like SFT, end quote. And more from Jeffrey Emanuel, quote, Deep Seek has made profound advancements not just in model quality, but more importantly in model training and inference efficiency. By being extremely close to the hardware and by layering together a handful of distinct, very clever optimizations, DeepSeek was able to train these incredible models using GPUs in a dramatically more efficient way. How in the world could this be possible?

Starting point is 00:09:29 How could this little Chinese company completely upstage all the smartest minds at our leading AI labs, which have 100 times more resources, headcount, payroll, capital GPUs, etc. Wasn't China supposed to be crippled by Biden's restrictions on GPU exports? Well, the details are fairly technical, but we can at least describe them at a high level. It might have just turned out that the relative GPU processing poverty of DeepSeek was the critical ingredient to make them more creative and clever necessity being the mother of invention at all. A major innovation is their sophisticated mixed precision training framework that lets them use 8-bit floating point numbers FP8 throughout the entire training process. Most Western labs train using full precision 32-bit

Starting point is 00:10:10 numbers. This basically specifies the number of gradations possible in describing the output of an artificial neuron. 8 bits in FP8 lets you store a much wider range of numbers than you might expect. It's just not limited to 256 different equal-sized magnitudes like you'd get with regular integers, but instead uses clever matrix to store both very small and very large numbers, though naturally with less precision than you'd get with 32 bits. Deep Sea cracked this problem by developing a clever system that breaks numbers into small tiles for activations and blocks for weights and strategically uses high-precision calculations at key points in the network. Unlike other labs that train in high-precision and then compress later, losing some quality in the process, DeepSeek's native

Starting point is 00:10:53 FP8 approach means they get the massive memory savings without compromising performance. When you're training across thousands of GPUs, this dramatic reduction in memory requirements per GPU translates into needing far fewer GPUs overall. Another major breakthrough is their multi-token prediction system. Most transformer-based LLM models do inference by predicting the next token, one token at a time. DeepSeek figured out how to predict multiple tokens while maintaining the quality you'd get from single token prediction. Their approach achieves about 85 to 90 percent accuracy on these additional token predictions, which effectively doubles inference speed without sacrificing much quality.

Starting point is 00:11:30 The clever part is they maintain the complete causal chain of predictions, so the model isn't just guessing it's making structured contextual predictions. The brilliant part is this compression is built directly into how the model learns. It's not some separate step they need to do. It's built directly into the end-to-end training pipeline. This means that the entire mechanism is differentiable and able to be trained directly using the standard optimizers. All this stuff works because these models are ultimately finding much lower dimensional representations of the underlying data than the so-called ambient dimensions.

Starting point is 00:12:01 so it's wasteful to store the full KV indices, even though that is basically what everyone else does. Not only do you end up wasting tons of space by storing way more numbers than you need, which gives a massive boost to the training memory footprint and efficiency. Again, slashing the number of GPUs you need to train a world-class model, but it can actually end up improving model quality because it can act like a regularizer, forcing the model to pay attention to the truly important stuff, instead of using the wasted capacity to fit to noise in the training data. So not only do you save a ton of memory, but the model might even perform better.

Starting point is 00:12:34 At the very least, you don't get a massive hit to performance in exchange for the huge memory savings, which is generally the kind of tradeoff you are faced with in AI training. Another very smart thing they did is to use what is known as mixture of experts or MOE transformer architecture, but with key innovations around load balancing. As you might know, the size or capacity of an AI model is often measured in terms of the number of parameters the model contains. A parameter is just a number that stores some attribute of the model, either the weight or importance a particular artificial neuron has relative to another one or the importance of a particular token depending on its context in the attention mechanism, etc. Meta's latest Lama 3 model comes in a few sizes, for example, a 1 billion parameter version, the smallest, a 70 billion parameter model, and even a massive 405 billion parameter model. This largest model is of limited utility for most users because you would need to have tens of thousands of dollars worth of GPUs in your computer just to run at tolerable speeds for inference, at least if you deployed it in the native full precision version.

Starting point is 00:13:35 Therefore, most of the real-world usage and excitement surrounding these open-source models is at the $8 billion parameter or highly quantized 70 billion parameter levels, since that's what can fit in a consumer-grade Nvidia 4090 GPU, which you can buy now for under $1,000. So why does any of this matter? Well, in a sense, the parameter count and precision tells you something about how much raw information or data the model has stored internally. Note that I'm not talking about reasoning ability or the model's IQ, if you will, it turns out that models with even surprisingly modest parameter counts can show remarkable cognitive performance when it comes to solving complex logic problems, proving theorems in plain geometry, SAT math problems, etc., and quote. Okay, look, as I said this whole day is about Deepseek, and here's more of why, quoting Axios. This could be an extinction-level event for venture capital firms that went all in on foundational model companies, particularly if those companies haven't yet productized with wide distribution. The quantums of capital are just so much more than anything VC has ever before dispersed based on what might be a suddenly stale thesis. If nanotech and Web3 were venture industry grenades, this could be a nuclear-bore.

Starting point is 00:14:51 I spoke to over the weekend aren't panicking, but they're clearly concerned, particularly that they could be taken so off guard. Don't be surprised if some deals in process get paused. There's still a ton we don't know about DeepSeek, including if it really spent as little money as it claims, and obviously there could be national security impediments for U.S. companies or consumers, given what we've seen with TikTok. But bottom line, the game has changed, end quote. And finally, let's end with Joe Wisenthal taking the contrarian view just a bit, i.e., maybe if AI is a race down to becoming a commodity, that could be a good thing. Quote, suddenly everyone is talking about Jevin's paradox. This is usually discussed with

Starting point is 00:15:35 respect to energy markets. Basically, when you get more energy efficient, you don't use less of the energy source. You just use your efficiency gains to do new things, and demand keeps booming. This is certainly the hope if you're an NVIDIA or any company that builds underlying AI infrastructure, that everyone will use the deep-seek breakthroughs and just race even faster with no effect on total demand for compute. We'll see. As I'm typing this, NVIDIA has opened down about 13%. Certainly, investors aren't taking much comfort in Jevin's paradox right now. One of my favorite Tracy Allaway lines is that it's only a crisis when you can't throw money at the problem. COVID was a crisis because money alone wasn't enough to address it.

Starting point is 00:16:14 The supply chain shocks were a crisis because money alone couldn't fix the problem. There's no guarantee here that just throwing more money at U.S. tech companies will be enough to keep them competitive in AI, let alone chips, if it's perceived that they're falling behind. Human capital, talent, takes years and years to develop. Getting the incentives right is not something where you can snap your fingers overnight and make things happen. These are big, slow-moving things, end quote. Nothing more for you today. Talk to you tomorrow.

Tech Brew Ride Home - Mon. 01/27 – Why DeepSeek Has Stunned Silicon Valley (And Wall Street)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.