The a16z Show - Chasing Silicon: The Race for GPUs

Starting point is 00:00:00 We currently don't have as many AI chips or servers as we'd like to have. How do I get access to the compute that I need? Who decides this? You're looking at some very large investment projects that take some time to adjust. We're rebuilding a stack. You can't look at AI just as a new application, but honestly, I think it's probably a better way to look as a different type of compute. With software becoming more important than ever,

Starting point is 00:00:27 hardware is following suit. And with the world constantly generating more data, unlocking the full potential of AI means a constant need for faster and more resilient hardware. That is exactly why we've created this mini-series on AI hardware. In part one, we took you through the emerging architecture powering LLMs, from GPU to TPU, including how they work, who's creating them, and also whether we can expect Moore's Law to continue. But part two is for.

Starting point is 00:00:58 where the founders trying to build AI companies. And here we dive into the delta between supply and demand, why we can't just print our way out of a shortage, how founders can get access to inventory, whether they should think about renting or owning, where moats can be found, and even where open source comes into play. You should also look out for part three coming very soon,

Starting point is 00:01:19 where we break down exactly how much all of this costs, from training to inference. And today we're joined again by A16Z, Special Advisor, Gito Appenzeller, someone who is truly uniquely suited for this deep dive as a storied infrastructure expert with experience-like. CTO for Intel's Data Center Group dealing a lot with hardware and the low-level components. So it's given myself, I think, a good insight how large data centers work, what the basic components are that make all of this AI boom possible today.

Starting point is 00:01:50 Despite working with infrastructure for quite some time, here's Gito commenting on how the momentum of the recent AI wave is shifting, supplying supply and demand dynamics. The biggest thing that is stringed that is just the crazy exponential growth of AI at the moment. AI has been booming since mid-last year. I think nobody expected how quickly it would move. And that is just created in demand, which at the moment the market can't fulfill. As a reminder, the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security and is not directed at any investors or potential investors in any A16Z fund. Please note that A16Z and is a

Starting point is 00:02:29 may also maintain investments in the companies discussed in this podcast. For more details, including a link to our investments, please see A16.com slash disclosures. In a recent article, Gito even stated that some reputable sources indicate that demand for AI hardware outstripped supply by a factor of 10. Here's some commenting on how that dynamic is impacting competition. We currently don't have as many AI chips or servers as we'd like to have. So for some of our portfolio companies, finding the compute capacity that they need to run their applications is actually a real challenge, right?

Starting point is 00:03:11 There's a whole value chain behind that. It's a combination of many things. We have some bottlenecks on the chip manufacturing side. We have some bottlenecks on building the actual cards. These development cycles take some time. So it's a combination of factors. But probably the biggest thing that is stringed that is just the crazy exponential growth of AI at the moment. Maybe this is a silly question.

Starting point is 00:03:31 But what really is stopping companies like Intel? like Navidia from going and 10xing their production. Like, is that on the roadmap where we're just going to see a lot more chips and we won't see this discrepancy between supply and demand, or is there something more complex at play? It's a bit more complex because if you want to make a chip, right, the way you do is you make it in a foundry, right, which are extremely large, extremely complex.

Starting point is 00:03:54 Intel makes chips on their own foundries, but most companies manufacture with Taiwan semiconductor TSMC, right? And they are a capacity constraint, right? You often have to reserve capacity along in advance, There's different processes. So, you know, it might be for a certain process, which you don't want to use, there's capacity. But for another one that you do want to use, they don't have the capacity.

Starting point is 00:04:11 And you could just say, like, well, in that case, let's just build more fabs. But building a fab takes you a couple of years and probably a couple of billion or 10 billion of investment. So you're looking at some very large investment projects that take some time to adjust. And that's sort of what prevents us who are acting more quickly at the moment. While some countries are making major multi-billion-dollar investments

Starting point is 00:04:32 in new semiconductor production, injection plants, aka fabs, these will take time to scale. And there are also no promises, given that expertise is concentrated in a few companies. So with demand not subsiding, what does this mean for who gets access to the supply available? It doesn't sound like the demand is going to subside, especially because we see this really what seems like intrinsic relationship between the power of these models and then the compute that's thrown at them. And so if we do expect demand to continue. I guess the question that arises is how is this demand allocated? So how does a company, let's say if I'm a founder today, how do I get access to the compute that I need?

Starting point is 00:05:16 Who decides this? Is it just who's willing to pay the most? Or how is that supply being distributed? Yeah, there's some of that. At the moment, capacity is expensive wherever you go. You know, I try just to run some personal experiments, try to reserve an instance-bound the cloud service providers a few days ago. And they just did it. didn't have any. I was like, no, not available. And what we're seeing is that often in order to get access to the newer cards and newer chips, right, if you want it at scale, you have to pre-reserve capacity. So often these are negotiations between a company and a large cloud where you say, okay, I need this many chips for this amount

Starting point is 00:05:50 of time. What they'll often ask for is they ask for a certain time commitment. So I'll be like, okay, we can give you this many chips, but we want you to sign basically that you'll get them exclusively for two years and you paid for that amount. And I think Open AI wasn't in use with that, right? Where you have investment deals where, for example, a cloud provider comes in and invests in a company. And as a result, the company gets capacity. So we're seeing all kinds of deals being struck.

Starting point is 00:06:11 As with any scarce resource, right, there's a lot of deal making going on. It's not just a matter of getting access to compute. It's about ensuring you get access to the kind of compute tailored to your needs. And cost is not the only factor here. What would you say in terms of the considerations that they should be keeping in mind, really how much should founders know about hardware and again, selecting which hardware to use? I think the first question, honestly, I would ask is, do you really need to consume the hardware directly or do you really just want to consume something that runs on top of the hardware, right?

Starting point is 00:06:47 Let's take an example. If I want to generate images with stable diffusion, for example, for my mobile phone app or something like that, it might be easier to go to a SaaS company, like Replicate, for example, that essentially will host the model for you, where you just pay for access to the model, and they send you back the generated images, and they will manage all the provision of compute infrastructure and will find the GPS for you. If you do want to run your own model,

Starting point is 00:07:12 I think my number one advice would be to shop around, right? There's a fair number of providers. The large clouds, in my experience, are not always the best option, right, if you price it out. We've seen that the startups typically are more likely to go with specialized clouds, like CoreWeave or Lambda, right? That are specialized in providing AI. infrastructure to startups.

Starting point is 00:07:31 Shop around, look at the different offers, compare prices. And when you're shopping around, in addition to price, which I feel like is a major motivating factor, what other factors are there in terms of these other companies who maybe aren't the big clouds? How are they differentiating relative to one other? How are they standing out in that market? There's a whole sort of decision tree there.

Starting point is 00:07:52 But the first thing is, one thing it often drives the decision is how much memory do I need in my cards, right? if I have a small image model, right, I might be able to work with a more consumer-grade card, which is much cheaper, right, per hour, if I reserve it in a cloud, versus if I, for example, train of a large language model and not only need a card with the most memory I can find,

Starting point is 00:08:11 but I probably want to have as many cards as possible in one server because communication between them matters, and I may even care about the networking fabric behind it, right? Some of very large models you actually network constraints in terms of how quickly you can train them. So it really becomes a question of what's your objective, is inference, is the training, if it's training, how big is your model, right?

Starting point is 00:08:29 And based on that, you figure out what the card is, what kind of server you need, what kind of fabric you need between those servers, and then you sort of can decide what the right fit is for your application. Even prior to this AI wave, compute was a major line item for many software companies, and the calculus of leaning on the easily accessible cloud versus bringing infrastructure in-house

Starting point is 00:08:49 was becoming an increasingly important consideration. Here is Gito touching further on that very calculus in today's era, and where scale comes at the play. Compute is expensive. It's a major line item for many companies, and this is even before the AI revolution, you could say. Even more true today, yeah? So how do you think about, again,

Starting point is 00:09:10 how that impacts different companies' bottom lines and whether they really factor that in to having their own allocated GPUs versus using something more like replicate? You really have to figure out what is the right fit for you, and it probably depends a lot on the scale at which you need them, right? If you need a lot, you frankly, you have to pre-reserve them.

Starting point is 00:09:28 You have to have your own. There's just no way around that. If you need a smaller quantity, you may be able to reserve them on a more short-term basis, or you have various models where it can consume only while your application runs, but at a higher price, right? And so this really comes down to what kind of load do you have? Well, we're typically seeing if somebody is training, they're more likely to do a long-term reservation for a GPU because you want to make sure you have access to it. If somebody has more continuous workloads where availability is important, like if I'm a lot of

Starting point is 00:09:56 I just do inference, but I want to make 100% sure that if the request comes in, I can service it. I can never be down, right? They probably need to reserve capacity as well. On the other hand, if I have more batch jobs where it's like, ah, this jobs runs an hour later. That's not the end of the world. Then you probably can go with variable capacity and just preserve it ad hoc.

Starting point is 00:10:12 But it's really a conversation of what is your usage pattern? What is your demand pattern? And from that comes the best pick for the parties that you work with. We've seen that companies, even prior to AI, have benefited from building their own infrastructure by basically bringing that in-house because before that they were renting and they were paying a lot to rent that compute. Do you think that will be a differentiator for companies moving forward or how should founders be thinking about that relationship between owning the infrastructure and renting it?

Starting point is 00:10:45 Owning the infrastructure comes with cost as well, right? Because you need to now hire people that run it, right? You need to get money for the KappaX and so on. So my guess is that most early stage founders and probably even most mid-stage and late-stage founders are better off by renting capacity, renting a cloud or using a consumer SaaS service. There's a couple of exceptions. If you have really, really specialized needs, right, you may just not find anybody who has exactly the kind of hardware that you need, right? There might be some cases where you have geopolitical concerns, your data is just too sensitive that you need to run your own data center. And there's probably a certain scale where it makes sense for you to run your own data center. But it's a pretty large scale.

Starting point is 00:11:24 If you're spending $10 million a year, you're probably still undercritical. If you're spending $100 million a year on infrastructure, that may be a reason to look into options for your own data center. But if everyone is competing for the same compute, are there other ways to stand out? Where's the moat here? You could say a moat is getting access to different training data, but that actually doesn't necessarily have to do with

Starting point is 00:11:47 compute or money being thrown at the problem. It's getting access to differentiated data. If you have access to differentiated data, If you have access to differentiated data, that could be a mode. I mean, it's a bit more subtle. Because, look, if you had an area where there's just not much public training data, that's probably right. There might be areas like in finance or so where that's the case.

Starting point is 00:12:06 But for a large language model, it turns out that just making a larger model in training on more data has more benefits than just absorbing more knowledge. It also means that it's better in reasoning and understanding abstract context than answering really complex, multistage questions and so on. So probably, if I have to guess, I think the future will be that we'll still train on all the data we can find. And then maybe you find you, meaning you sort of do some additional training on a particular problem domain with your private data. That makes sense. Right.

Starting point is 00:12:35 So you first go to elementary school to learn reading and writing, and then you go to your vocational training for the specialized job that you have to do in the future. Another important question worth addressing is who can realistically compete. If compute is expensive, will all the largest, most heavily capitalized companies, When says they can build the largest models with the most data? Or what role does open source play? As one of many emerging examples, Bukunya was created by fine-tuning Meta's Lama-1 model for chat. The cost of fine-tuning added only an additional $300, but the result is competitive with

Starting point is 00:13:13 much larger models like ChatGPT or BART. So what might this example and a growing number of open source projects? Tell us about the future of OpenLMs. So first of all, in general, larger models, if they're everything else being equal, before better. So the really small open source models that we're seeing out there today, they're not yet at a level of a GPT 3.5 or GPD4. And there's actually a website that runs sort of regular bakeoffs

Starting point is 00:13:44 where they basically ask users to prepare answers. And it seems to be pretty clear that the large ones are still a little bit ahead. That said, we're making big. advances there. We're figuring out a couple of things. One thing we've learned is there's something called the chinchilla scaling laws that basically give us an idea how does data correspond to a model size.

Starting point is 00:14:02 If we over-train, so don't train as efficiently as we could, we can actually get potentially a smaller and better model. So you can match the performance of a large model with a smaller model if you train it more. So that's interesting. That reduces model sizes. And the trend at the moment is they should make slightly smaller models and train them

Starting point is 00:14:18 more to get equal performance. The second thing is that when we talk about models, models for slightly different purposes, right? You have the base, large language models. All they're trained in, practically speaking, is completing text, right? Literally, how you train them is you give them text and say, guess the next letter.

Starting point is 00:14:32 And then you tell them, nope, that was wrong, or yes, that was right. I didn't, and basically backpropagate how they predict. And they're really good at that, completing text, right? That's not quite the same that you want from a chatbot or from a model that you can tell to do something. So there's usually another step afterwards, which is called fine-tuning for instruction following

Starting point is 00:14:51 or for chat specifically. where basically I tell a model, look, if somebody asks you to come up with a list of, to do item, like a list of steps, how to make pizza, right? This is roughly what I expect you to answer, right? These models are very good in learning these things. So basically, you first train them, just complete text, and then you train them how to react to human requests and instructions, right? It's called instruction fine-tuning.

Starting point is 00:15:14 And so Lama, for example, that was a Facebook model where they published the weights for researchers, and then some people took that and they fin-tuned it, meaning they took a bunch of instruction following things. to turn it into alpaca or vicunia, which is a much, much nicer model in terms of interacting with it, right? For humans, it's much, much more useful. And so the biggest challenge at the moment we have in the open source side, there's currently no large open source LLM out there.

Starting point is 00:15:36 GPT 3, also 175 billion parameters. There's currently nothing in that weight class that's open source and that people could use to fine-tune or to play without to modify. It is worth noting that since this recording, several more open models have been released. including Lama 2 with 70 billion parameters and an open license, unlike its predecessor, Lama 1. Another 40 billion parameter open source model, Falcon, was released as well. Both of these are still dwarfed in parameters compared to closed models like OpenAI's GPD3 at 175 billion parameters,

Starting point is 00:16:13 or GPD4 at an estimated 1.8 trillion parameters, although the latter is speculated to be a collection of multiple smaller models. However, parameter count is not the only driver of performance. For example, while Lama 2 has fewer models than GPD 3, its performance is actually much better due to being trained on more data. In fact, Lama 2 is currently comparable to GPD3's successor, GBT 3.5, the current default of chat GBT. And as many of these models continue to get larger, we may see some models compress, becoming more efficient in enabling inference on your device.

Starting point is 00:16:56 You already mentioned stable diffusion, Chem, run on your computer's GPU. Do we expect to see more of that? Because right now they are all hosted by these companies, right? They're trained by these companies on their dedicated servers. And then even if you interface with chat GPT, it's running that inference for you. Do we expect to see that change at all as compute becomes cheaper, maybe more decentralized? Or how would you think about that? That's a really good question.

Starting point is 00:17:21 And we're speculating a little bit here, but my guess is we will, right? And we're seeing some of these smaller models getting pretty good. They run on your laptop or even your phone. We're starting to see stable diffusion implementations that run well on phones, right, which I would have never thought, right? And they take a couple of 10 seconds to create an image, which is comparatively slow, but there's certain applications that's acceptable. So my guess is as both the devices get faster and the models get more optimized, right?

Starting point is 00:17:48 This will be a trend that we see more and more. And in the future, it might just be part of the operating system to have a basic large language model, a basic image generation model. Maybe I'm off base here, but we've talked about how expensive compute can be and how ultimately that can be a major line item for companies. And I guess probably the model training will remain with those companies and not necessarily on folks' devices. But in terms of the inference, I assume that's still a pretty significant cost. And in a way, if someone is able to run that locally, doesn't that disjoint, the company from having to pay for that compute because it's running on, let's say, someone's MacBook GPU.

Starting point is 00:18:27 Oh, yeah, totally. I mean, look, if I can generate an image of my phone directly, all it takes is some battery power and it gets a little warm, right? And that's it, right? So that's a huge advantage. At the same time, there's probably going to be a little bit bifurcation there on quality and parameters, right? You can run things locally, but you can probably run them a lot better in the cloud, right?

Starting point is 00:18:43 Because you have a much bigger server there. So it probably depends on what you want to do. If I just want to have a better spell checker that checks my email or maybe, maybe just some simple completion, that's perfectly fine. I can run that on my phone. On the other hand, if I want something that is more write a good speech or summarize a complex text, they might be like, well, that are going to run the cloud because it takes so many more operations.

Starting point is 00:19:04 Hopefully, this is getting your wheel spinning in terms of what can be built. And here is Gito speaking to how this presents a fundamentally new stack and what that means in terms of opportunity. It feels like this really is like this massive wave, this renaissance of innovation. It's full of opportunities, right? I mean, we're rebuilding a stack. You can't look at AI just as a new application,

Starting point is 00:19:27 but honestly, I think it's probably a better way to look as a different type of compute, right? We traditionally build software by composing algorithms in a way that we understand well, and where the end result was programmed or so bottoms up constructed. Now we have a second type compute where we just trained a large neural network.

Starting point is 00:19:43 And the big advantage is we don't actually need to know how to solve a problem as long as the network can figure it out, right? The neural network can figure it out. we're fine. And that opens up a bunch of new applications, but it also means you need a completely different stack in terms of all the different pieces, right? You probably want vector debes to retrieve context. You want different types of hosting providers that are good in hosting these models

Starting point is 00:20:04 and providing them to you as a service. It's a whole like Cambrian explosion of creativity has a whole new ecosystem forming. And I think there's a ton of opportunities to build back companies. I think that paints a pretty incredible picture of opportunity across the stack. And as many of these trends continue to progress, like supply and demand, the calculus of renting versus owning compute, close versus open source models, we look to part three of the series to answer a very important question. How much does all of this cost? We'll explore all this in depth, including how much startups are really spending on AI compute and whether that's sustainable, how much it really costs to train a model like GPT3, the difference in cost between training and inference, and how all of this will change with time. We'll see you there.

Starting point is 00:20:57 Thank you so much for listening to Part 2 of our AI Harbor series. We spent a lot of time trying to get these episodes right, so if you are enjoying them, go ahead and leave a review or tell a friend. We'll also have an animated video version up on our YouTube channel soon, but for now you can find some of our recent videos, like my conversation with Waymo's chief product officer in a Waymo, or a conversation I recently had at the Aspen Ideas Festival where we discussed the classroom of 2050.

Starting point is 00:21:29 As always, thank you so much for listening.

The a16z Show - Chasing Silicon: The Race for GPUs

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.