a16z Podcast - The True Cost of Compute

Episode Date: August 7, 2023

With software becoming more important than ever, hardware is following suit.As the world generates more data, unlocking the full potential of AI means a constant need for faster and more resilient har...dware. But how much does this all really cost? In this final segment of our AI hardware series, we tackle that question head on. Be sure to check part 1 and 2, where we explore the emerging architectures and the momentous competition for AI hardware. Topics Covered:00:00 – The cost of compute02:20 – Is this sustainable?03:23 – The cost to train a model05:39 – Computation requirements09:05  – The relationship between compute, capital, and technology11:15 – GPT4 commenting on the technology with help from ElevenLabs Resources: Find Guido on LinkedIn: https://www.linkedin.com/in/appenz/Find Guido on Twitter: https://twitter.com/appenz Stay Updated: Find a16z on Twitter: https://twitter.com/a16zFind a16z on LinkedIn: https://www.linkedin.com/company/a16zSubscribe on your favorite podcast app: https://a16z.simplecast.com/Follow our host: https://twitter.com/stephsmithioPlease note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.

Transcript
Discussion (0)
Starting point is 00:00:00 There's very few computational problems that complex that mankind has actually undertaken. How do you think about the relationship between compute, capital, and then the technology that we have today? Yeah, that's a million-dollar question or maybe a trillion-dollar question. The expectation at the moment is that the cost of training these models may actually sort of top out or even go down a little bit as the chips get faster, but we don't discover new training material as quickly. With software becoming more important than ever, hardware is following suit. And with the world constantly generating more data, unlocking the full potential of AI means a constant need for faster and more resilient hardware. But how much does all of this really cost? In this final segment of our AI hardware series, we tackle that question head on.
Starting point is 00:00:56 But if you're just catching up, be sure to check out part one and part three. part one and part two, where we explored the emerging architectures and the momentous competition for AI hardware. And today, we're joined again by A16D Special Advisor, Gito Appenzeller, someone who is uniquely suited for this deep dive as a storied infrastructure expert, with experience like Intel's Data Center group dealing a lot with hardware and the low-level components. So it's giving me a sort of, I think, a good insight how large data centers work, what the basic components are that make all of this AI boom possible today. Here is Gito touching on the reality of these models and how much they cost today.
Starting point is 00:01:36 Training one of these large language models today, it's not a $100,000 thing. It's probably millions of dollars thing. Practically speaking, what we're seeing in industry is that it's actually more for tens of millions of dollars thing. As a reminder, the content here is for informational purposes only. Should not be taken as legal, business, tax. or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast.
Starting point is 00:02:10 For more details, including a link to our investments, please see A16c.com slash disclosures. In Gito's recent article, navigating the high cost of AI compute, Gito even noticed that access to compute resources has become a determining factor for the success of AI companies. And this is not just true for the largest companies building the largest models. In fact, many companies are spending more than 80% of their total capital raised on compute resources. So naturally, this begs the question. Is this really sustainable? The core technology that you're building in very early days towards more a complete product offering, right?
Starting point is 00:02:55 There's just a lot more boxes to check and features to implement and all the administrative parts of your application if you're getting to the enterprise. So probably you'll have more normal software development that's not AI, right? A classic software development happening. You'll probably also have a larger headcount of people that they have to pay. So at the end of the day, I would expect as a percentage that they'll go down over time, right? As an absolute amount, I think it'll be going up for some time just because this AI boom is still just in its infancy. The AI boom has just begun. and in part two, we discussed how it's unlikely for compute demand to subside anytime soon.
Starting point is 00:03:29 There, we also discussed how the decision to own or rent infrastructure can make a non-trivial difference to a company's bottom line. But there are other considerations when it comes to cost. Batch size, learning rate, and the duration of the training process all contribute to the final price tag. How much does it cost to train a model depends on a mirror? factors, right? Now, the good news is we can simplify this a little bit because the vast majority of models that are being used today are transformer models, right? That was a transformer architecture, huge breakthrough in AI. They've proven to be incredibly versatile.
Starting point is 00:04:05 They're easier to train because they paralyze a little bit better than previous models. And so in a transformer, you can sort of approximate the inference time as twice the number of parameters, and the training time is about six times the number of parameters. So if you take something like GP3, right, which is open AI's big models, model, they have 175 billion parameters, so you need twice as much. So 350 billion floating point operations, two to one inference. And so based on that, you can sort of figure out how much compute capacity you need, how this is going to scale, how you should price it, you know, how much it will cost you at the end
Starting point is 00:04:41 of the day. This also gives you for model training and idea how long the training is going to take, right? You know how much your AI accelerator can do in terms of floating point operations per second, right? You can sort of theoretically calculate how many operations it is to train your model. In practice, the math is more complicated because there are certain ways to accelerate that, so maybe you can train with a reduced precision. But it's also very hard to achieve 100% utilization on these cards. If you naively implement, you probably can be below 10% utilization, but you know, you can probably get into the tens of percent with a little bit of four. This
Starting point is 00:05:15 gives you a rough swag, how much capacity you need for training and for inference, but at the end, you probably do want to test it before you you make any final decisions on these things, make sure that your assumptions hold on how much you need. Now, if all those numbers confused you, that's okay. We'll walk through a very specific example. GPD3. GPD3 has about 175 billion parameters. And here's Gito on the computation requirements for training the model and ultimately inference. That's when you're prompting the already trained model to elicit a response. So if you just do very naively the math, right? Let's start with training, right? We know how many tokens it was trained on. We know how many parameters the model
Starting point is 00:05:55 has, so we can do a soft napkin math, and you end up with something like three times 10 to the 23 floating point operations. That's a completely crazy number, right? It's like a number with 23 digits, right? It's like hard to write down. There's very few computational problems that complex that mankind has actually undertaken, right? It's a huge effort. Then you can be like, okay, so let's take, say, an A100, right? The most commonly used card. We know how many floating point operations it can do per second. we can divide that, right? Let's give us an order of magnitude estimation, like how much time it would take, right?
Starting point is 00:06:27 And then we know how much one of these cards costs, right? Like renting an A100 costs you between, I want to say between one and four dollars probably, right, depending on who you rent it from. And you end up with something in the order of half a million dollars, right, with this very naive analysis. Now, there's a couple of things there, right? We didn't take to account optimization.
Starting point is 00:06:44 We also didn't take into account that you probably cannot run this at full capacity because of memory bandwidth limitations and network limitations. And last but not least, you probably need more than one run to get this right. You probably need a bunch of test runs. They're probably not going to be full runs and so on. But this gives you an idea that's of training one of these large language models today. It's not a hundred thousand dollar thing. It's probably millions of dollars thing. Practically speaking, what we're seeing in industry is that it's actually more of a tens of millions of dollars thing. And that's because you need the reserved capacity.
Starting point is 00:07:14 So if I could get all my cards for the next two months would only cost me a million dollars, but the problem is they want a two-year reservation. So really, the cost is 12 times as high. And so that basically adds a zero to much money cost. Right. And how does that compare to inference? So inference is much, much, much cheaper. Basically, my training set, for a modern text model, for example, the training set is about a trillion tokens, right? And if I run inference, each word that comes out is one token. Right. So a factor of a trillion or so faster on the inference part. If you run the numbers, like a large language model, you actually at a fraction of a cent, like a tenth of a cent or hundreds of a cent, somewhere in that
Starting point is 00:07:54 ballpark for the inference. Again, if we just naively look at this, right? For inference, your problem is usually that you have to provision for peak capacity, right? So if everybody wants to use your model at 9 a.m. on a Monday, right? You still have to pay for Saturday night at midnight when nobody is using it. That increases your cost substantially there. For some of them on specifically image models, what you can do for inference is that you use much, much cheaper cards, because the model is small enough that you can run it on, essentially
Starting point is 00:08:19 the server version of a consumer graphics card, and that can save a lot of cost. And unfortunately, as we discussed in part one, you can't just make up for these inefficiencies by piecing together a bunch of less performance chips, at least for model training. You need some very sophisticated software for that, right? Because the overhead of distributing the data between these cards would probably outweigh any saving you get from cheaper cards. Inference, on the other hand. For inference, you can often do the inference on a single card. So that's not really a problem. If you take something like Stable Diffusion, right,
Starting point is 00:08:51 a very popular model for image generation, that runs on a MacBook, for example, that has enough memory and enough compute power so you can generate an image locally. So that'll run on a relatively cheap consumer card and you don't have to use an A100 for it to do inference. So when we're talking about the training of the models, clearly the amount of compute is just drastically more than the inference.
Starting point is 00:09:11 And something else that we've already talked about is the more compute, often, not always, but often the better model. And so does this ultimately, these factors all ladder up to the idea that heavily capitalize incumbents win this race or this competition? Or how do you think about the relationship between compute, capital, and then the technology that we have today? Yeah, that's a million dollar question or maybe a trillion dollar question. I don't know. So first of all, training these models is expensive, right? For example, we haven't seen yet a really good open source large language model. And I'm sure part of the reason is that training.
Starting point is 00:09:48 these models is just really, really expensive, right? I mean, there's a bunch of enthusiasts out that would love to do this, but you need to find a couple of million or $10 million of compute capacity to do it, and that makes it so much harder, right? It means you sort of need to create a substantial effort before something like that can happen. All that said, the cost for training these models overall seems to be coming down. And in part, I think it is because it seems to me like we're becoming data limited, right? So it turns out there is a correspondence between how big your model is and what the optimal amount of training data is. for the model.
Starting point is 00:10:19 It's having a super large model with very few data doesn't get you anything or having a ton of data with a small model also doesn't get you anything, right? The size of your brain needs to roughly correspond to the length of your university education here, right? Otherwise, it doesn't work. And what this means is that because some of the large models today already leverage a good percentage of all human knowledge in a particular area. I mean, if you look at GPT, there was probably trained on something like 10% of the internet,
Starting point is 00:10:46 right, in all of Wikipedia and many books. like a good chunk of all books, right? So going up by a factor of 10, yeah, that's probably possible. Going up by a factor of 100, that's not clear if that's possible. I mean, we as mankind just haven't produced enough knowledge yet that you could absorb all of that into one of these large models. And so I think the expectation at the moment is that the cost of training these models may actually sort of top out or even go down a little bit as the chips get faster,
Starting point is 00:11:13 but we don't discover new training material as quickly. I mean, unless somebody comes up, with a new idea of to generate training material. And so if that assumption is true, I think this means that the moat that's created by these large capital investments is actually not particularly deep, right? It's more of a speed bump than something
Starting point is 00:11:30 that prevents new entrants. I mean, today, training a large language model is something that is definitely within reach for a well-funded startup, right? So for that reason, we expect to see more innovation in that area in the future. All right, that is a wrap for our AI hardware series. we genuinely hope you came away with a little more knowledge about this increasingly important space.
Starting point is 00:11:53 Because if software is indeed eating the world, well, hardware is coming along for that ride. And as a reminder, if you haven't yet listened to part one where we explore the emerging architectures and who's creating them, or part two, where we dive into the future AI stack and how founders can participate, well, those are already live and ready for consumption. And as always, thank you so much for listening. We'd actually like to leave you with a fun fact from GPT4 itself, commenting on the technology that created it. And yes, we did fact-check this,
Starting point is 00:12:28 and this is also AI-generated audio from 11 labs. We'll see you next time. Chat GPT and its sibling models are trained on diverse internet text. However, the exact amount of data used can be hard to comprehend. If we were to print all of the data used, to train these models, it could fill a large library. Consider that one single book may contain around 1 million characters. If we estimate that the training data is hundreds of gigabytes of text data,
Starting point is 00:12:58 let's take a conservative estimate and say it's 100 gigabytes. Considering that one character is approximately one byte, this would mean the model was trained on approximately 100 billion characters. If each book has 1 million characters, then the data used to train chat GPT is. equivalent to the text in approximately 100 million books. If we take the size of a large library, such as the New York Public Library, which has around 53 million items, not just books, the training data is equivalent to the text in almost twice
Starting point is 00:13:30 the number of items in that library. Thanks, chat, GBT. A quick note to close out that many models today are even bigger, with Lama 2, for example, being trained on 2 trillion tokens or about 8 trillion characters. Now, that is a lot of libraries. Thank you so much for listening to our full AI hardware series. We spent a ton of time trying to get these episodes right. So if you did enjoy them, go ahead and leave a review or tell a friend.
Starting point is 00:13:59 We would so appreciate that. And you can also look forward to a video animated version of these up on our YouTube channel soon. But for now, you can find some of our recent videos there, like my conversation with Waymo's chief product officer in a Waymo, or a conversation I had at the Aspen Ideas Festival where we discussed Classroom 2050. As always, thank you so much for listening.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.