a16z Podcast - Chasing Silicon: The Race for GPUs

Starting point is 00:00:00 We currently don't have as many AI chips or servers as we'd like to have. How do I get access to the compute that I need? Who decides this? You're looking at some very large investment projects that take some time to adjust. We're rebuilding a stack. You can look at AI just as a new application, but honestly, I think it's probably a better way to look as a different type of compute. With software becoming more important than ever,

Starting point is 00:00:27 hardware is following suit. And with the world constantly generating more data, unlocking the full potential of AI means a constant need for faster and more resilient hardware. That is exactly why we've created this mini-series on AI hardware. In part one, we took you through the emerging architecture powering LLMs, from GPU to TPU, including how they work, who's creating them, and also whether we can expect Moore's Law to continue.

Starting point is 00:00:57 But part two is for the fact. Founders trying to build AI companies. And here we dive into the delta between supply and demand, why we can't just print our way out of a shortage, how founders can get access to inventory, whether they should think about renting or owning, where moats can be found, and even where open source comes into play.

Starting point is 00:01:16 You should also look out for part three coming very soon, where we break down exactly how much all of this costs, from training to inference. And today we're joined again by A16Z, special advisor, Gito Appenzeller, someone who is truly uniquely suited for this deep dive as a storied infrastructure expert with experience like. CTO for Intel's Data Center Group dealing a lot with hardware and the low-level components. It's given myself, I think, a good insight how large data centers work,

Starting point is 00:01:44 what the basic components are that make all of this AI boom possible today. Despite working with infrastructure for quite some time, here's Gito commenting on how the momentum of the recent AI wave is shifting supply and demand dynamics. The biggest thing that is triggering that is just the crazy exponential growth of AI at the moment. AI has been booming since mid-last year. I think nobody expected how quickly it would move.

Starting point is 00:02:09 And that is just created in demand, which at the moment the market can't fulfill. As a reminder, the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security and is not directed at any investors or potential investors in any A16Z fund. Please note that A16Z and its affiliates

Starting point is 00:02:30 may also maintain investments in the companies discussed in this podcast. For more details, including a link to our investments, please see A16C.com slash disclosures. In a recent article, Gito even stated that some reputable sources indicate that demand for AI hardware outstripped supply by a factor of 10.

Starting point is 00:02:54 Here's some commenting on how that dynamic is impacting competition. We currently don't have as many AI chips or servers as we'd like to have. So for some of our portfolio companies, finding the compute capacity that they need to run their applications is actually a real challenge, right? There's a whole value chain behind that. It's a combination of many things. We have some bottlenecks on the chip manufacturing side. We have some bottlenecks on building the actual cards.

Starting point is 00:03:20 These development cycles take some time. So it's a combination of factors. But probably the biggest thing that is triggering that is just the crazy exponential growth of AI at the moment. Maybe this is a silly question, but what really is stopping companies like Intel, like Navidia, from going in 10xing their production? Like, is that on the roadmap where we're just going to see a lot more chips? And we won't see this discrepancy between supply and demand, or is there something more complex at play? It's a bit more complex because if you want to make a chip, right?

Starting point is 00:03:49 The way you do it is you make it in a foundry, right? which are extremely large, extremely complex. Intel makes chips on their own foundries, but most companies manufacture with Taiwan Semiconductor TSM, right? And they are a capacity constraint, right? You often have to reserve capacity long in advance. There's different processes, so, you know, might be for a certain process, which you don't want to use,

Starting point is 00:04:08 there is capacity, but for another one that you do want to use, they don't have the capacity. And you could just say, like, well, in that case, let's just build more fabs. But building a fab takes you a couple of years and probably a couple of billion or 10 billion of investment. So you're looking at some very long, large investment projects that take some time to adjust.

Starting point is 00:04:24 And that's sort of what prevents us who are actually more quickly at the moment. While some countries are making major multi-billion dollar investments in new semiconductor production plants, aka fabs, these will take time to scale. And there are also no promises given that expertise is concentrated in a few companies. So with demand not subsiding, what does this mean for who gets access to the supply available? it doesn't sound like the demand is going to subside, especially because we see this really what seems like intrinsic relationship between the power of these models and then the compute that's thrown at them. And so if we do expect demand to continue, I guess the question

Starting point is 00:05:06 that arises is how is this demand allocated? So how does a company, let's say if I'm a founder today, how do I get access to the compute that I need? Who decides this? Is it just who's willing to pay the most, or how is that supply being distributed? Yeah, there's some of that, right? At the moment, capacity is expensive wherever you go. You know, I try just to run some personal experiments, try to reserve an instance-bound-the-cloud service providers a few days ago, and they just didn't have any.

Starting point is 00:05:33 I was like, no, I'm not available. And what we're seeing is that often in order to get access to the newer cards, the newer chips, right? If you want it at scale, you have to pre-reserve capacity. So often these are negotiations between a company and a large cloud, where you say, okay, I need this many chips for this amount of time. What they'll often ask for is they ask for a certain time commitment. So I'll be like, okay, we can give you this many chips,

Starting point is 00:05:56 but we want you to sign basically that you'll get them exclusively for two years and you paid for that amount. I think Open AI wasn't in use with that, right, where you have investment deals where, for example, a cloud provider comes in and invests in a company. And as a result, the company gets capacity. So we're seeing all kinds of deals being struck. As with any Scars resource, right, there's a lot of deal making going on.

Starting point is 00:06:16 It's not just a matter of getting access. to compute. It's about ensuring you get access to the kind of compute tailored to your needs. And cost is not the only factor here. What would you say in terms of the considerations that they should be keeping in mind? Really, how much should founders know about hardware and, again, selecting which hardware to use? I think the first question, honestly, I would ask is, do you really need to consume the hardware

Starting point is 00:06:43 directly or do you really just want to consume something that runs on top of the hardware, right? take an example, if I want to generate images with stable diffusion, for example, for my mobile phone app or something like that, it might be easier to go to a SaaS company, like Replicate, for example, essentially will host the model for you where you just pay for access to the model and they send you back the generated images, and they will manage all the provision of compute infrastructure and will find the GPS for you. If you do want to run your own model, I think my number one advice would be to shop around, right? There's a fair number of providers. The large clouds In my experience, I'm not always the best option, right, if you price it out.

Starting point is 00:07:21 We've seen that the startups typically are more likely to go with specialized clouds, like CoreWeave or Lambda, right? That are you specialized in providing AI infrastructure to startups. Shop around, look at the different offers, compare prices. And when you're shopping around, in addition to price, which I feel like is a major motivating factor, what other factors are there in terms of these other companies who maybe aren't the big clouds?

Starting point is 00:07:45 How are they differentiating? relative to one other, how are they standing out in that market? There's a whole sort of decision tree there. But the first thing is, one thing it often drives the decision is how much memory do I need in my cards, right? If I have a small image model, right, I might be able to work with a more consumer-grade card, which is much cheaper, right, per hour if I reserve it in a cloud, versus if I, for example, train of a large language model, and not only need a card

Starting point is 00:08:09 with the most memory I can find, but I probably want to have as many cards as possible in one server because communication between them matters. may even care about the networking fabric behind it. Some of the very large models, you actually network constraints in terms of how quickly you can train them. So it really becomes a question of, what's your objective? It's inference is the training. If it's training, how big is your model?

Starting point is 00:08:29 And based on that, you figure out what the card is, what kind of server you need, what kind of fabric you need between those servers, and then you sort of can decide what the right fit is for your application. Even prior to this AI wave, compute was a major line item for many software companies and the calculus of leaning on the easily accessible. cloud versus bringing infrastructure in-house was becoming an increasingly important consideration. Here is Gito touching further on that very calculus in today's era and where scale comes at to play. Compute is expensive. It's a major line item for many companies,

Starting point is 00:09:03 and this is even before the AI revolution, you could say. Even more so today, yeah? So how do you think about, again, like how that impacts different companies' bottom lines and whether they really factor that in to having their own allocated GPU? versus using something more like replicate. You really have to figure out what is the right fit for you. And it probably depends a lot on the scale at which you need them, right? If you need a lot, you frankly, you have to pre-reserve them, right?

Starting point is 00:09:28 You have to have your own. There's just no way around that. If you need a smaller quantity, you may be able to reserve them on a more short-term basis, or you have various models where it can consume only while your application runs, but at a higher price, right? And so this really comes down to what kind of load do you have? Well, we're typically seeing if somebody is training, they're more likely to do a long-term reservation for a GPU

Starting point is 00:09:50 because you want to make sure you have access to it. If somebody has more continuous workloads where availability is important. If I just do inference, but I want to make 100% sure that if the request comes in, I can service it. I can never be down, right? They probably need to reserve capacity as well. On the other hand, if I have more batch jobs where it's like, well, this jobs runs an hour later.

Starting point is 00:10:07 That's not the end of the world. Then you probably can go with variable capacity and just preserve it ad hoc. But it's really a conversation of what does the usage, What's your demand pattern? And from that comes the best pick for the parties that you work with. We've seen that companies, even prior to AI, have benefited from building their own infrastructure by basically bringing that in-house because before that they were renting and they were paying a lot to rent that compute. Do you think that will be a differentiator for companies moving forward or how should founders be thinking about that relationship between owning the infrastructure and renting it?

Starting point is 00:10:44 owning the infrastructure comes with cost as well, right? Because you need to now hire people that run it, right? You need to get money for the CapEx and so on. So my guess is that most early stage founders and probably even most mid-stage and late-stage founders are better off by renting capacity, renting a cloud, or using consumer SaaS service. There's a couple of exceptions. If you have really, really specialized needs, right, you may just not find anybody who has exactly the kind of hardware that you need, right?

Starting point is 00:11:13 But there might be some cases where you have geopolitical concerns, your data is just too sensitive, that you need to run your own data center. And there's probably a certain scale where it makes sense for you to run your own data center. But it's a pretty large scale. If you're spending $10 million a year, you're probably still undercritical. If you're spending $100 million a year on infrastructure, that may be a reason to look into options for your own data center. But if everyone is competing for the same compute, are there other ways to stand out? Where's the moat here?

Starting point is 00:11:40 You could say a moat is getting access to different. training data, but that actually doesn't necessarily have to do with compute or money being thrown out of the problem. It's getting access to differentiated data. If you have access to differentiated data, that could be a mode. I mean, it's a bit more subtle because, look, if you had an area where there's just not much public training data, that's probably right. There might be areas like in finance or so, where that's the case. But for a large language model, it turns out that just making a larger model in training on more data has more benefits than just absorbing more knowledge. It also means that it's better in reasoning and understanding abstract context than answering

Starting point is 00:12:20 really complex, multistage questions and so on. So probably, if I have to guess, I think the future will be that we'll still train on all the data we can find. And then maybe you find you, meaning you yourself to do some additional training on a particular problem domain with your private data. That makes sense. Right. So you first go to elementary school to learn reading and writing, and then you go to your vocational training for the specialized job that we're going to do in the future. Another important question worth addressing is who can realistically compete. If compute is expensive, will all the largest, most heavily capitalized companies win,

Starting point is 00:12:53 since they can build the largest models with the most data? Or what role does open source play? As one of many emerging examples, Bukunya was created by fine-tuning Meta's Lama-1 model for chat. The cost of fine-tuning added only an additional $300, but the result is competitive with much larger models like ChatGBT.GPT or BART. So what might this example and a growing number of open source projects tell us about the future of OpenLLMs? So first of all, in general, larger models, if they're everything else being equal, before better. So the really small open source models that we're seeing out there today,

Starting point is 00:13:36 they're not yet at a level of a GPT 3.5 or GPD 4. And there's actually a website that runs sort of regular bakeoffs where they basically ask users to prepare answers. And it seems to be pretty clear that the large ones are still a little bit ahead. That said, we're making big advances there. And we're figuring out a couple of things. So one thing we've learned is there's something called the Chinchilla scaling loss that basically give us an idea how does data correspond to model size.

Starting point is 00:14:02 And if we over-train, so don't train as efficiently as we could, we can actually get potentially a smaller and better model, right? So you can match the performance of a large model with a smaller model if you train it more, right? So that's interesting. That reduces model sizes. And the trend at the moment is they should make slightly smaller models and train them more to get equal performance. The second thing is that when we talk about models, models for slightly different purposes, right? You have the base, large language models.

Starting point is 00:14:26 All they're trained and practically speaking is completing text, right? Literally, how you train them is you give them text and say, guess the next letter. And then you tell them, nope, that was wrong. Oh, yes, that was right. I didn't, and basically backpropagate how they predict. And they're really good at that, completing text, right? That's not quite the same that you want from a chatbot or from a model that you can tell to do something.

Starting point is 00:14:46 So there's usually another step afterwards, which is called fine-tuning for instruction following or for chat specifically, where basically I tell a model, look, if somebody asks you to come up with a list of, like a list of steps how to make pizza, right? This is roughly what I expect you to answer, right? These models are very good in learning these things.

Starting point is 00:15:03 things. So basically, you first train them just complete text, and then you train them how to react to human requests and instructions, right? It's called instruction fine-tuning. And so Lama, for example, that was a Facebook model where they published the weights for researchers. And then some people took that and they fine-tuted, meaning they took a bunch of instruction for anything to turn it into Alpaca or Vicunia, which is a much, much nicer model in terms of interacting with it, right? For humans, it's much, much more useful. And so the biggest challenge at the moment we have in the open source side, there's currently no large open source LLM out there, right? GPT 3, also 175 billion parameters. There's currently nothing in that weight class that's open source and that people

Starting point is 00:15:43 could use to fine tune or to play without to modify. It is worth noting that since this recording, several more open models have been released, including Lama 2 with 70 billion parameters and an open license, unlike its predecessor, Lama 1. Another 40, billion-parameter open-source model Falcon was released as well. Both of these are still dwarfed in parameters compared to closed models like OpenAI's GPT3 at 175 billion parameters or GPD4 at an estimated 1.8 trillion parameters, although the latter is speculated to be a collection of multiple smaller models. However, parameter count is not the only driver of performance. For example, while Lama 2 has fewer models than GPD

Starting point is 00:16:30 its performance is actually much better due to being trained on more data. In fact, Lama 2 is currently comparable to GPD3's successor, GBT 3.5, the current default of chat GBT. And as many of these models continue to get larger, we may see some models compress, becoming more efficient and enabling inference on your device. You already mentioned stable diffusion can run on your computer's GPU. Do we expect to see more of that? Because right now they are all hosted by these companies, right? They're trained by these companies on their dedicated servers.

Starting point is 00:17:08 And then even if you interface with chat GPT, it's running that inference for you. Do we expect to see that change at all as compute becomes cheaper, maybe more decentralized? Or how would you think about that? That's a really good question. And we're speculating a little bit here. But my guess is we will, right? And we're seeing some of these smaller models getting pretty good. They run on your laptop or even your phone.

Starting point is 00:17:30 we're starting to see stable diffusion implementations that run well on phones, which I would have never thought, right? And they take a couple of 10 seconds to create an image, which is comparatively slow, but there's certain applications where that's acceptable. So my guess is as both the devices get faster and the models get more optimized, right? This will be a trend that we see more and more. In the future, it might just be part of the operating system to have a basic large language model, a basic image generation model. Maybe I'm off base here, but we've talked about how expensive. compute can be and how ultimately that can be a major line item for companies. And I guess probably the model training will remain with those companies and not necessarily on folks' devices. But in terms

Starting point is 00:18:12 of the inference, I assume that's still a pretty significant cost. And in a way, if someone is able to run that locally, doesn't that destroy the company from having to pay for that compute because it's running on, let's say, someone's MacBook GPU? Oh, yeah, totally. I mean, look, if I can generate an image of my phone directly, all it takes some battery power and it gets a little warm, right? And that's it, right? So that's a huge advantage. At the same time, there's probably going to be a little bit bifurcation there on

Starting point is 00:18:37 quality and parameters, right? You can run things locally, but you can probably run them a lot better in the cloud, right, because you have a much bigger server there. So it probably depends on what you want to do. If I just want to have a better spell checker that checks my email or maybe just some simple completion, that's perfectly fine. I can run that on my phone. On the other hand, if I want something that is more right to good speech or somewhere

Starting point is 00:18:58 a complex text. They might be like, well, that are going to run the cloud because it takes so many more operations. Hopefully, this is getting your wheel spinning in terms of what can be built. And here is Gito speaking to how this presents a fundamentally new stack

Starting point is 00:19:13 and what that means in terms of opportunity. It feels like this really is like this massive wave, this renaissance of innovation. It's full of opportunities, right? I mean, we're rebuilding a stack. You can't look at AI just as a new application, but honestly, I think it's probably a better way to look as a different type of compute, right?

Starting point is 00:19:31 We traditionally build software by composing algorithms in a way that we understand well and where the end result was programmed, so bottoms up constructed. Now we have a second type compute where we just trained a large neural network. And the big advantage is we don't actually need to know how to solve a problem as long as the network can figure it out, right? The neural network can figure it out. We're fine. And that opens up a bunch of new applications, but it also means you need a completely different

Starting point is 00:19:55 stack in terms of all the different pieces, right? You probably want vector dbs to retrieve. context. You want different types of hosting providers that are good in hosting these models and providing them to you as a service. It's a whole like Cambrian explosion of creativity has a whole new ecosystem forming and I think there's a ton of opportunities to build right companies. I think that paints a pretty incredible picture of opportunity across the stack. And as many of these trends continue to progress, like supply and demand, the calculus of renting versus owning compute, close versus open source models,

Starting point is 00:20:29 we look to part three of the series to answer a very important question. How much does all of this cost? We'll explore all this in depth, including how much startups are really spending on AI compute and whether that's sustainable, how much it really costs to train a model like GPT3, the difference in cost between training and inference,

Starting point is 00:20:49 and how all of this will change with time. We'll see you there. Thank you so much for listening to part two of our AI Harbor series. We spent a lot of time trying to get these episodes right. So if you are enjoying them, go ahead and leave a review or tell a friend. We'll also have an animated video version up on our YouTube channel soon. But for now, you can find some of our recent videos like my conversation with Waymo's chief product officer in a Waymo, or a conversation I recently had at the Aspen Ideas Festival

Starting point is 00:21:25 where we discussed the classroom of 2050. As always, thank you so much for listening.

a16z Podcast - Chasing Silicon: The Race for GPUs

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.