a16z Podcast - AI Hardware, Explained
Episode Date: July 27, 2023In 2011, Marc Andreessen said, “software is eating the world.” And in the last year, we’ve seen a new wave of generative AI, with some apps becoming some of the most swiftly adopted software pro...ducts of all time.So if software is becoming more important than ever, hardware is following suit. In this episode – the first in our three-part series – we explore the terminology and technology that is now the backbone of the AI models taking the world by storm. We’ll explore what GPUs are, how they work, the key players like Nvidia competing for chip dominance, and also… whether Moore’s Law is dead?Look out for the rest of our series, where we dive even deeper; covering supply and demand mechanics, including why we can’t just “print” our way out of a shortage, how founders get access to inventory, whether they should own or rent, where open source plays a role, and of course… how much all of this truly costs! Topics Covered:00:00 – AI terminology and technology03:44 - Chips, semiconductors, servers, and compute04:48 - CPUs and GPUs06:07 - Future architecture and performance07:01 - The hardware ecosystem09:05 - Software optimizations12:23 - What do we expect for the future?14:35 - Upcoming episodes on market and cost Resources: Find Guido on LinkedIn: https://www.linkedin.com/in/appenz/Find Guido on Twitter: https://twitter.com/appenz Stay Updated: Find a16z on Twitter: https://twitter.com/a16zFind a16z on LinkedIn: https://www.linkedin.com/company/a16zSubscribe on your favorite podcast app: https://a16z.simplecast.com/Follow our host: https://twitter.com/stephsmithioPlease note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.
Transcript
Discussion (0)
If you look at the pure hardware statistics,
so how many floating point operations per second can these chips do?
There's others that are very competitive with what Nvidia has.
Are we now at the limits of lithography?
I think it's very surprising.
We would have thought that my gaming PC or my Bitcoin miner would eventually become a good AI engine.
Power is becoming an issue, heat is becoming an issue,
and we need to rely more and more in parallel processes.
In 2011, Mark Andreessen said,
Software is eating the world.
And the decade that followed just solidified this notion, with software infiltrating nearly every aspect of our lives.
The last year in particular introduced a new wave of generative AI, with some apps becoming some of the most swiftly adopted software products of all time.
And just like all the other software that came before it, AI software is fundamentally underpinned by the hardware that runs the underlying computation.
So, if software is becoming more important than ever, then hardware is following suit.
Plus, the world is constantly generating more data.
And unlocking the full potential of these technologies from longer-contacts windows to multi-modality
means a constant need for faster and more resilient hardware.
And it's equally important for us to understand who builds and controls the supply of this resource,
especially since many of even the most established AI companies are now hardware constrained,
with some reputable sources indicating that demand for AI hardware outstripped supply by a factor of 10.
That is exactly why we've created this mini-series on AI hardware.
We'll take you on a journey through understanding the hardware that has long powered our computers,
but is now the backbone of these AI models absolutely taking the world by storm.
And in this first segment, we dive into the terminology and technology, from GPU to GPU, including what they are, how they work, the key players like Nvidia competing for chip dominance, and also we address the question, is Moore's Law dead?
But make sure to look out for the rest of our series where we dive even deeper, covering supply and demand mechanics, including why we can't just print our way out of a shortage, how founders can get access to inventory, whether or we dive even deeper, covering supply and demand mechanics, including why we can't just print our way of a shortage, how founders can get access to inventory, whether
they should think about owning or renting, where open source plays a role. And of course,
how much all of this truly costs. And across all three videos, we explore with the help of
A16Z Special Advisor, Gito Appenzeller, someone who is truly uniquely suited for this deep dive
as a storied infrastructure expert. I spent my last couple of years mostly in software,
but most recently before joining Andreessen Horowitz. I actually was CTO for Intel's Data Center
We're dealing a lot with hardware and the low-level components.
So it's given me yourself, I think, a good insight how large data centers work,
what the basic components are that make all of this AI boom possible today
and that really underpin this great technological ecosystem.
Guido has also spent time at Ubiko, VMware, Big Switch Networks, and more.
But let's get into it.
As a reminder, the content here is for informational purposes only.
Should not be taken as legal, business, tax,
or investment advice, or be used to evaluate any investment or security, and is not directed
at any investors or potential investors in any A16Z fund. Please note that A16Z and its affiliates
may also maintain investments in the companies discussed in this podcast. For more details,
including a link to our investments, please see A16c.com slash disposures.
We are increasingly hearing terms like chips, semiconductors, servers, and compute, but
What are all of these the same thing and what role do they play in our AI future?
If you're running any kind of AI algorithm, right, this AI algorithm runs on a chip.
And the most commonly used chips today are AI accelerators, which are in terms of how they're
built, they're very close to graphics chips.
So the cards that these chips are on that are in these servers often referred to as GPUs,
which stands for graphics processing unit, which is kind of funny, right?
They're not doing graphics, obviously, but it's a very similar type of technology.
If you look inside of them, they basically are very good in processing very large number of math operations per cycle in a very short period of time.
So, very classically, like an old-fashioned CPU would run one instructions every cycle, and then they had multiple cores,
so maybe now modern CPU can do a couple of ten instructions.
But these sort of modern AI cards, they can do more than 100,000 instructions per cycle.
So they're extremely performant.
So this is a GPU, and these GPUs run inside of servers.
I think of them as big boxes.
I have a power plug on the outside in a networking plug.
And then these servers sit in data centers where you have racks and racks of them
that do the actual compute.
Let's quickly recap.
CPU is central processing unit and GPU is graphics processing unit.
And while both CPUs and GPUs today can perform parallel processing,
the degree of parallelization is what sets GPUs apart for certain workloads.
So, for example, CPUs can actually do tens or even thousands.
of floating point operations per cycle,
but a GPU can now do over 100,000.
The basic idea of a GPU is that instead of just working with individual values,
it works with vectors or even matrices, right, or tensors more generally.
TPU, for example, is Google's name for these kind of chips, right?
And they call them tensor processing units,
which is actually a pretty good name for them, right?
The cores and these modern GPUs often call tensor cores,
like that's how it's media calls them because they operate on tensors.
And basically, the core of their values,
value propositions, they can do matrix multiplication.
So remember, metrics like the rolls and columns of numbers,
they can, for example, multiply two matrices in a single cycle.
So in a very, very fast operation.
And that's really what gives us the speed that's necessary
to run these incredibly large language and image models
that make generative AI today.
Today's GPUs are far more powerful than their ancestors,
whether we're comparing to the earliest graphics cards
in Arcade Gaming Days 50 years ago,
or the GForce 256,
the first personal computer GPU unveiled by Nvidia in 1999.
But is it surprising that we're seeing this chip design
applied so readily to the emerging space of AI?
Or should we expect a new architecture to evolve
and become more performant in the future?
In one way, I think it's very surprising.
We would have thought that my gaming PC or my Bitcoin miner
would eventually become a good AI engine.
At the same time, what all of these problems have in common
is that you want to execute many,
operations in parallel, right?
And so you can think of a GPU as something
was built for graphics,
but you can think of them also just as something
that's very good in performing the same operation
and a very large number of parallel inputs,
right, a very large vector or very large metrics.
All right, so perhaps it's not so surprising
that NVIDIA's prize GPUs are aligned to this AI wave,
but they're also not the only company participating.
Here is Gito breaking down the hardware ecosystem.
The ecosystem comes in many layers, right?
So let's start with the chips at the bottom.
Nvidia is king of the hill at the moment right there.
A100 is the workhorse that powers the current AI revolution.
They're coming up with a new one called the H100, which is the next generation.
There's a couple of other vendors in this space.
Intel has something called Gaudi, Gaudi 2, as well as that graphics card with ARC.
They're seeing some usage.
AMD has a chip in this space.
And then we have the large clouds that are starting to build or in some case have been building for some time their own chips.
Google with the TPU, you mentioned before.
that is quite popular.
And Amazon has a chip called Tranium for training and Inferencia for inference.
And we'll probably see more of those in the future from some of these vendors.
But at the moment, NVIDIA still has a very, very strong position as the vast majority of training is going on on their chips.
And when we think about the different chips, so you mentioned like the A100s are the strongest
and maybe there's the most demand for those, but how do they compare to some of these chips created by other companies?
Is it like double the performance or is there some other metric or factor that,
that makes them much more performant.
It's a great question.
If you look at the pure hardware statistics,
so how many floating point operations per second
can these chips do?
There's others that are very competitive
with what Nvidia has.
Nvidia's big advantage is that they have
a very mature software ecosystem.
So imagine you are an artificial intelligence developer
or engineer or researcher.
You're often using a model that's open source
that somebody else developed.
And how fast that model runs,
in many cases depends on how well is optimized
for a particular chip.
And so the big advantage of
Nvidia has today is that their software ecosystem
is so much more mature, right?
I can grab a model,
it has all the necessary optimizations for
Nvidia to run out of the box, right?
I don't have to do anything.
But with some of these other chips,
I may have to do a lot more of these optimizations myself,
right?
And that's what gives them the strategic advantage at the moment.
So as we've touched on,
AI software is heavily dependent on hardware.
But what Gito was pointing towards here
is the performance of hardware being heavily
integrated with software.
So, NVIDIA's Kuda system makes it easier for engineers to plug in and make optimizations,
like running with lower precision numbers.
Here is Gito speaking to the kind of optimizations that do exist.
And what does that actually look like in terms of those software optimizations?
Like what kind of developers are working on that?
Because that also seems to be maybe an emerging space where different companies are having to hire developers to actually facilitate that integration.
Yeah, and it happens at all layers of the stack.
Some of it is coming from academia.
Some of it is done by the large companies that operate in the space, right?
Some of them are frankly done by enthusiasts that just want to see their water run faster.
But to give an idea of how this works, like, for example, typically a floating point number is represented in 32 bits, right?
And some people figured out how to reduce that to 16 bits.
And then somebody was like, well, actually, we can do it in 8 bits.
And you have to be really careful how you do.
They have to normalize to make sure it doesn't overrun or underrun, right?
But if you normalize everything, probably, you can use much, much shorter floats or integers for these calculations.
There's many tricks like that that, you know, like the really good AI developers use
to squeeze more performance out of the chips that they have.
So to reiterate Gito's point, floating point numbers are typically represented in 32 bits.
That's 32 zeros and ones, with the first bit being for sign, the next eight for the exponent,
and the next 23 for the fraction.
This gives a fairly large range between the smallest possible value and the largest possible value,
while also allowing many steps in between.
Now, developers can choose to encode numbers in other systems with fewer bits,
but the trade-off comes with precision.
So depending on the numbers that you're working with,
this may or may not have much consequence.
But this does require some checking and normalizing,
plus an eye for overrunning.
That's when you get a number so small or so large
that it can't be properly encoded in a system.
And just to give a sense for size,
the range of 32-bit floats lies between 10 to the power of 38 and 10 to the power of negative 38.
That's a pretty big range, while 16-bit floats operate in a precision range of 10 to the power of 4 and 10 to the power of negative 5.
Now, when many people think of semiconductors, they naturally think of Morse law.
That's the term that describes the phenomenon, observed by Gordon Moore, by the way back in 1965, where the number of
transistors in an integrated circuit doubles every two years. But despite our collective success
for decades to continue to push more computation onto smaller chips, are we now at the limits of
lithography? For example, an Apple M1 chip from 2022 has 116 billion, that's billion with a B
transistors, and if we compare that to the Arm 1 processor from 1985, that had 25,000. And by the way,
the Apple M1 chip is not even the highest transistor count.
Today, I believe that belongs to the Waifer Scale Engine 2 by Cerebrus
with 2.6 trillion transistors.
So looking ahead, are we at the point where we really don't see
the same kind of advancement in at least the physical architecture of chips?
And if so, where do we see advancements moving forward?
Is it in the software?
Is it in the specialization of these chips?
How do you see this industry moving forward?
Yeah, great question.
There's two things to tease apart there.
Like Moore's Law is actually still, as of today, alive and kicking, right?
Moore's Law talks about the density of transistors on a chip,
and we're still increasing that, right?
The scale of transistors going down, right?
Is it exactly the same speed, I don't know?
But as of today, right, if you plot the curve, right, it seems to be intact.
There's a second thing called Dennard scaling,
which used to, basically, say, just as the number of transistors,
I can squeeze onto a chip, right, doubles every 18 months or so.
It essentially meant that the power at the same time would decrease by the same factor.
It says something about frequency, but let's see.
The net outcome is power.
And that, for the last 10, 15 years or so, no longer is true.
If you look at the frequency of a CPU, it hasn't moved much over the past 10, 12, 15 years.
And the net result of this is we're getting chips that have more transistors,
but each individual core doesn't actually run faster.
And what this means is we have to have lots and lots more parallel cores,
and this is why these tensor operations are so attractive, right?
Like on a single core, I can't add numbers more quickly,
but if I can do a matrix operation instead,
and it's basically to do many of them in parallel at the same time, right?
The second big consequence of that is that our chips are getting more and more power hungry.
If you look at even a graphics card for a gaming PC today, right,
you have these graphics cars that are like hundreds of watts of power,
a 500-watt card, right?
is much, much more than they used to be.
And that trend is going to continue.
And we're seeing what's happening in data centers,
seeing more and more things like liquid cooling,
at least being experimented with,
or in some cases getting deployed,
where basically the energy densities for these AI chips,
is getting so high that we need novel cooling solutions to make them happen.
So Moore's Law, yes, but power is becoming an issue.
He just becoming an issue,
and we need to rely more and more in parallel processes.
So it sounds like Moore's Law is indeed not quite,
dead but perhaps a little more complex than it once was. Performance increases continue as we
integrate parallel cores, but we're also seeing chips become a lot more power hungry. All of this
will continue being dynamic as demand continues to outpace supply for high performance chips. So as we
look ahead, what does all this mean for competition and cost? You'll learn a lot more about that
in the rest of our AI hardware series, tackling the questions that everybody is asking, including
We currently don't have as many AI chips or servers as we'd like to have.
How do you think about the relationship between compute, capital, and then the technology that we have today?
Yeah, that's a million dollar question or maybe a trillion dollar question.
We'll see you there.
Thanks for listening to the A16C podcast.
If you like this episode, don't forget to subscribe, leave a review, or tell a friend.
We also recently launched on YouTube at YouTube.com.
A16Z underscore video, where you'll find exclusive video content. We'll see you next time.