a16z Podcast - AI Hardware, Explained

Starting point is 00:00:00 If you look at the pure hardware statistics, so how many floating point operations per second can these chips do? There's others that are very competitive with what Nvidia has. Are we now at the limits of lithography? I think it's very surprising. We would have thought that my gaming PC or my Bitcoin miner would eventually become a good AI engine. Power is becoming an issue, heat is becoming an issue, and we need to rely more and more in parallel processes.

Starting point is 00:00:23 In 2011, Mark Andreessen said, Software is eating the world. And the decade that followed just solidified this notion, with software infiltrating nearly every aspect of our lives. The last year in particular introduced a new wave of generative AI, with some apps becoming some of the most swiftly adopted software products of all time. And just like all the other software that came before it, AI software is fundamentally underpinned by the hardware that runs the underlying computation. So, if software is becoming more important than ever, then hardware is following suit. Plus, the world is constantly generating more data. And unlocking the full potential of these technologies from longer-contacts windows to multi-modality

Starting point is 00:01:13 means a constant need for faster and more resilient hardware. And it's equally important for us to understand who builds and controls the supply of this resource, especially since many of even the most established AI companies are now hardware constrained, with some reputable sources indicating that demand for AI hardware outstripped supply by a factor of 10. That is exactly why we've created this mini-series on AI hardware. We'll take you on a journey through understanding the hardware that has long powered our computers, but is now the backbone of these AI models absolutely taking the world by storm. And in this first segment, we dive into the terminology and technology, from GPU to GPU, including what they are, how they work, the key players like Nvidia competing for chip dominance, and also we address the question, is Moore's Law dead?

Starting point is 00:02:09 But make sure to look out for the rest of our series where we dive even deeper, covering supply and demand mechanics, including why we can't just print our way out of a shortage, how founders can get access to inventory, whether or we dive even deeper, covering supply and demand mechanics, including why we can't just print our way of a shortage, how founders can get access to inventory, whether they should think about owning or renting, where open source plays a role. And of course, how much all of this truly costs. And across all three videos, we explore with the help of A16Z Special Advisor, Gito Appenzeller, someone who is truly uniquely suited for this deep dive as a storied infrastructure expert. I spent my last couple of years mostly in software, but most recently before joining Andreessen Horowitz. I actually was CTO for Intel's Data Center We're dealing a lot with hardware and the low-level components. So it's given me yourself, I think, a good insight how large data centers work,

Starting point is 00:02:57 what the basic components are that make all of this AI boom possible today and that really underpin this great technological ecosystem. Guido has also spent time at Ubiko, VMware, Big Switch Networks, and more. But let's get into it. As a reminder, the content here is for informational purposes only. Should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. Please note that A16Z and its affiliates

Starting point is 00:03:30 may also maintain investments in the companies discussed in this podcast. For more details, including a link to our investments, please see A16c.com slash disposures. We are increasingly hearing terms like chips, semiconductors, servers, and compute, but What are all of these the same thing and what role do they play in our AI future? If you're running any kind of AI algorithm, right, this AI algorithm runs on a chip. And the most commonly used chips today are AI accelerators, which are in terms of how they're built, they're very close to graphics chips. So the cards that these chips are on that are in these servers often referred to as GPUs,

Starting point is 00:04:13 which stands for graphics processing unit, which is kind of funny, right? They're not doing graphics, obviously, but it's a very similar type of technology. If you look inside of them, they basically are very good in processing very large number of math operations per cycle in a very short period of time. So, very classically, like an old-fashioned CPU would run one instructions every cycle, and then they had multiple cores, so maybe now modern CPU can do a couple of ten instructions. But these sort of modern AI cards, they can do more than 100,000 instructions per cycle. So they're extremely performant. So this is a GPU, and these GPUs run inside of servers.

Starting point is 00:04:49 I think of them as big boxes. I have a power plug on the outside in a networking plug. And then these servers sit in data centers where you have racks and racks of them that do the actual compute. Let's quickly recap. CPU is central processing unit and GPU is graphics processing unit. And while both CPUs and GPUs today can perform parallel processing, the degree of parallelization is what sets GPUs apart for certain workloads.

Starting point is 00:05:15 So, for example, CPUs can actually do tens or even thousands. of floating point operations per cycle, but a GPU can now do over 100,000. The basic idea of a GPU is that instead of just working with individual values, it works with vectors or even matrices, right, or tensors more generally. TPU, for example, is Google's name for these kind of chips, right? And they call them tensor processing units, which is actually a pretty good name for them, right?

Starting point is 00:05:40 The cores and these modern GPUs often call tensor cores, like that's how it's media calls them because they operate on tensors. And basically, the core of their values, value propositions, they can do matrix multiplication. So remember, metrics like the rolls and columns of numbers, they can, for example, multiply two matrices in a single cycle. So in a very, very fast operation. And that's really what gives us the speed that's necessary

Starting point is 00:06:01 to run these incredibly large language and image models that make generative AI today. Today's GPUs are far more powerful than their ancestors, whether we're comparing to the earliest graphics cards in Arcade Gaming Days 50 years ago, or the GForce 256, the first personal computer GPU unveiled by Nvidia in 1999. But is it surprising that we're seeing this chip design

Starting point is 00:06:27 applied so readily to the emerging space of AI? Or should we expect a new architecture to evolve and become more performant in the future? In one way, I think it's very surprising. We would have thought that my gaming PC or my Bitcoin miner would eventually become a good AI engine. At the same time, what all of these problems have in common is that you want to execute many,

Starting point is 00:06:48 operations in parallel, right? And so you can think of a GPU as something was built for graphics, but you can think of them also just as something that's very good in performing the same operation and a very large number of parallel inputs, right, a very large vector or very large metrics. All right, so perhaps it's not so surprising

Starting point is 00:07:04 that NVIDIA's prize GPUs are aligned to this AI wave, but they're also not the only company participating. Here is Gito breaking down the hardware ecosystem. The ecosystem comes in many layers, right? So let's start with the chips at the bottom. Nvidia is king of the hill at the moment right there. A100 is the workhorse that powers the current AI revolution. They're coming up with a new one called the H100, which is the next generation.

Starting point is 00:07:29 There's a couple of other vendors in this space. Intel has something called Gaudi, Gaudi 2, as well as that graphics card with ARC. They're seeing some usage. AMD has a chip in this space. And then we have the large clouds that are starting to build or in some case have been building for some time their own chips. Google with the TPU, you mentioned before. that is quite popular. And Amazon has a chip called Tranium for training and Inferencia for inference.

Starting point is 00:07:52 And we'll probably see more of those in the future from some of these vendors. But at the moment, NVIDIA still has a very, very strong position as the vast majority of training is going on on their chips. And when we think about the different chips, so you mentioned like the A100s are the strongest and maybe there's the most demand for those, but how do they compare to some of these chips created by other companies? Is it like double the performance or is there some other metric or factor that, that makes them much more performant. It's a great question. If you look at the pure hardware statistics,

Starting point is 00:08:22 so how many floating point operations per second can these chips do? There's others that are very competitive with what Nvidia has. Nvidia's big advantage is that they have a very mature software ecosystem. So imagine you are an artificial intelligence developer or engineer or researcher.

Starting point is 00:08:36 You're often using a model that's open source that somebody else developed. And how fast that model runs, in many cases depends on how well is optimized for a particular chip. And so the big advantage of Nvidia has today is that their software ecosystem is so much more mature, right?

Starting point is 00:08:52 I can grab a model, it has all the necessary optimizations for Nvidia to run out of the box, right? I don't have to do anything. But with some of these other chips, I may have to do a lot more of these optimizations myself, right? And that's what gives them the strategic advantage at the moment.

Starting point is 00:09:05 So as we've touched on, AI software is heavily dependent on hardware. But what Gito was pointing towards here is the performance of hardware being heavily integrated with software. So, NVIDIA's Kuda system makes it easier for engineers to plug in and make optimizations, like running with lower precision numbers. Here is Gito speaking to the kind of optimizations that do exist.

Starting point is 00:09:28 And what does that actually look like in terms of those software optimizations? Like what kind of developers are working on that? Because that also seems to be maybe an emerging space where different companies are having to hire developers to actually facilitate that integration. Yeah, and it happens at all layers of the stack. Some of it is coming from academia. Some of it is done by the large companies that operate in the space, right? Some of them are frankly done by enthusiasts that just want to see their water run faster. But to give an idea of how this works, like, for example, typically a floating point number is represented in 32 bits, right?

Starting point is 00:09:59 And some people figured out how to reduce that to 16 bits. And then somebody was like, well, actually, we can do it in 8 bits. And you have to be really careful how you do. They have to normalize to make sure it doesn't overrun or underrun, right? But if you normalize everything, probably, you can use much, much shorter floats or integers for these calculations. There's many tricks like that that, you know, like the really good AI developers use to squeeze more performance out of the chips that they have. So to reiterate Gito's point, floating point numbers are typically represented in 32 bits.

Starting point is 00:10:28 That's 32 zeros and ones, with the first bit being for sign, the next eight for the exponent, and the next 23 for the fraction. This gives a fairly large range between the smallest possible value and the largest possible value, while also allowing many steps in between. Now, developers can choose to encode numbers in other systems with fewer bits, but the trade-off comes with precision. So depending on the numbers that you're working with, this may or may not have much consequence.

Starting point is 00:10:59 But this does require some checking and normalizing, plus an eye for overrunning. That's when you get a number so small or so large that it can't be properly encoded in a system. And just to give a sense for size, the range of 32-bit floats lies between 10 to the power of 38 and 10 to the power of negative 38. That's a pretty big range, while 16-bit floats operate in a precision range of 10 to the power of 4 and 10 to the power of negative 5. Now, when many people think of semiconductors, they naturally think of Morse law.

Starting point is 00:11:35 That's the term that describes the phenomenon, observed by Gordon Moore, by the way back in 1965, where the number of transistors in an integrated circuit doubles every two years. But despite our collective success for decades to continue to push more computation onto smaller chips, are we now at the limits of lithography? For example, an Apple M1 chip from 2022 has 116 billion, that's billion with a B transistors, and if we compare that to the Arm 1 processor from 1985, that had 25,000. And by the way, the Apple M1 chip is not even the highest transistor count. Today, I believe that belongs to the Waifer Scale Engine 2 by Cerebrus with 2.6 trillion transistors.

Starting point is 00:12:23 So looking ahead, are we at the point where we really don't see the same kind of advancement in at least the physical architecture of chips? And if so, where do we see advancements moving forward? Is it in the software? Is it in the specialization of these chips? How do you see this industry moving forward? Yeah, great question. There's two things to tease apart there.

Starting point is 00:12:46 Like Moore's Law is actually still, as of today, alive and kicking, right? Moore's Law talks about the density of transistors on a chip, and we're still increasing that, right? The scale of transistors going down, right? Is it exactly the same speed, I don't know? But as of today, right, if you plot the curve, right, it seems to be intact. There's a second thing called Dennard scaling, which used to, basically, say, just as the number of transistors,

Starting point is 00:13:10 I can squeeze onto a chip, right, doubles every 18 months or so. It essentially meant that the power at the same time would decrease by the same factor. It says something about frequency, but let's see. The net outcome is power. And that, for the last 10, 15 years or so, no longer is true. If you look at the frequency of a CPU, it hasn't moved much over the past 10, 12, 15 years. And the net result of this is we're getting chips that have more transistors, but each individual core doesn't actually run faster.

Starting point is 00:13:41 And what this means is we have to have lots and lots more parallel cores, and this is why these tensor operations are so attractive, right? Like on a single core, I can't add numbers more quickly, but if I can do a matrix operation instead, and it's basically to do many of them in parallel at the same time, right? The second big consequence of that is that our chips are getting more and more power hungry. If you look at even a graphics card for a gaming PC today, right, you have these graphics cars that are like hundreds of watts of power,

Starting point is 00:14:05 a 500-watt card, right? is much, much more than they used to be. And that trend is going to continue. And we're seeing what's happening in data centers, seeing more and more things like liquid cooling, at least being experimented with, or in some cases getting deployed, where basically the energy densities for these AI chips,

Starting point is 00:14:23 is getting so high that we need novel cooling solutions to make them happen. So Moore's Law, yes, but power is becoming an issue. He just becoming an issue, and we need to rely more and more in parallel processes. So it sounds like Moore's Law is indeed not quite, dead but perhaps a little more complex than it once was. Performance increases continue as we integrate parallel cores, but we're also seeing chips become a lot more power hungry. All of this will continue being dynamic as demand continues to outpace supply for high performance chips. So as we

Starting point is 00:14:55 look ahead, what does all this mean for competition and cost? You'll learn a lot more about that in the rest of our AI hardware series, tackling the questions that everybody is asking, including We currently don't have as many AI chips or servers as we'd like to have. How do you think about the relationship between compute, capital, and then the technology that we have today? Yeah, that's a million dollar question or maybe a trillion dollar question. We'll see you there. Thanks for listening to the A16C podcast. If you like this episode, don't forget to subscribe, leave a review, or tell a friend.

Starting point is 00:15:34 We also recently launched on YouTube at YouTube.com. A16Z underscore video, where you'll find exclusive video content. We'll see you next time.

Your Ad Here

a16z Podcast - AI Hardware, Explained

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.