a16z Podcast - How 'Hyperscalers' are Innovating — and Competing — in the Data Center
Episode Date: December 10, 2021Innovation in the data center has been constrained by the traditional model of suppliers providing fixed-function chips that limit how much the biggest data center operators can differentiate. But pro...grammable chips have emerged that allow these companies to not only increase performance, but innovate throughout the pipeline, from operating system to networking interface to user application.This is a major trend among "hyperscalers," which are some of the world’s most well known companies running massive data centers with tens of thousands of servers. We’re talking about companies like Amazon, Facebook, Microsoft, Google, Apple, Alibaba, Tencent.To talk about the trends in data centers and how software may be “eating the world of the data center,” we talked this summer to two experts. Martin Casado is an a16z general partner focused on enterprise investing. Before that he was a pioneer in the software-defined networking movement and the cofounder of Nicira, which was acquired by VMWare. (Martin has written frequently on infrastructure and data-center issues and has appeared on many a16z podcasts on these topics.)He’s joined by Nick McKeown, a Stanford professor of computer science who has founded multiple companies (and was Martin’s cofounder at Nicira) and has worked with hyperscalers to innovate within their data centers. After this podcast was recorded, Nick was appointed Senior Vice President and General Manager of a new Intel organization, the Network and Edge Group. We begin with Nick, talking about the sheer scale of data-center traffic.
Transcript
Discussion (0)
Welcome to the A6D&Z podcast. I'm Zoren. Today we're talking about trends in data centers, focusing on hyperscalers. These are some of the world's most well-known companies running massive data centers with hundreds of thousands of servers. We're talking about companies like Amazon, Facebook, Microsoft, Google, Apple, and Tencent. Innovation in the data center traditionally has been constrained by the model in which suppliers provided fixed function networking ships that limit how much the biggest data center operators can differentiate. In recent years, though,
programmable chips have allowed these companies to customize and increase performance,
latency, and reliability for the incredible amounts of data center traffic they manage.
This trend also allows the hypers to innovate throughout the pipeline, from operating system
to networking interface to user application.
To talk about the trends in data centers, we talked this summer to two experts.
Martine Casado is an A16Z general partner focused on enterprise investing.
Before that, he was a pioneer in the software-defined networking movement and the co-founder
of NICERA, which was acquired by VMware.
He's joined by Nick McKin, a Stanford professor of computer science, who has founded multiple
companies and was Martin's co-founder at Nysera and has worked with hyperscalers to innovate within
their data centers.
After this podcast was recorded, Nick was appointed senior vice president and general
manager of a new Intel organization, the Network and Edge Group.
We start with Nick explaining the sheer scale of traffic inside data centers and how that is
changing the industry.
If you were to cut a line vertically through the United States and then look at all of the
public internet traffic that was passing from left to right and right to left. We call this
the bisection bandwidth of the internet. It's essentially how much capacity the internet has
crossing the United States. That is less than the amount of traffic going between a couple of
hundred servers inside one data center is just being inverted. The whole model of how
communication takes place has really become dominated by the insides of data centers, of all the
communication that's taking place for compute and nowadays machine learning, training, and
inference that's taking place. It's totally dominating how we communicate. So it means that the
industry itself has focused on how to serve that market. And so the majority of the silicon
that is being produced, whether it's CPUs, whether it's storage, whether it's for networking,
for the network interfaces and the switches,
it's mostly targeted on that data center business
just because of the sheer scale of it.
And so we're going to get into how data centers got here
and how they're evolving.
But first, to situate ourselves in what's happening right now,
who are the companies that are running these massive data centers,
these hyperscalers, as they're called?
And what's the current model?
Like who supplies them and who builds the data centers?
So hyperscalers are those companies that we know very well,
like Google and Amazon and Facebook and Microsoft,
the companies that are building,
warehouse scale computing. They've been doing this for 10 or 15 years now, and they have big
warehouses full of hundreds of thousands of servers that are interconnected by rich networks
with huge amounts of storage. And this is where they serve up things like Google Search and the
Facebook apps that we all use are nowadays office products like Microsoft Office. So there's a
rich network of suppliers, technology suppliers, who serve them. And nowadays, actually for most of those
suppliers, these big hyperscalers represent a very big part of the market share.
The hyperscalers are well known for building their own data centers, everything from famously
Google their own servers to the way they do cooling, to the way they do real estate, to the
entire software stack. And the reason is, is because they're actually pushing these technologies
at scales and efficiencies that they just weren't designed for. And that's required a pretty large
redesign. And so this is in the context of organizations that are incredibly sophisticated,
incredibly savvy, and have really remade every other part of the data center down to like,
you know, the ducting in the electricity. Okay, so let's look at how we got here. How have
data centers changed over the years? The early data centers, some of their first attempts,
they were made out of off-the-shelf computers that they were just order from, you know,
companies like Dell and Compact was probably around in those days. And then they would arrange them
into racks of servers, perhaps 30 or 40 in a rack. And then they would hang as much disk
storage as they could off it, put as much memory into them as they could. And then they would
string them all together with networking equipment. They would just buy from regular equipment vendors.
And, you know, there was an ecosystem of vendors like Cisco and HP and folks like this
who were selling the networking equipment for them in the early days. So that's what the early
data centers look like. And as they grew, there was fear in their eyes.
because they realized that they were building systems at such phenomenal scale that no one had ever built before that, you know, these systems just weren't going to be able to scale up to the size that they needed. Now, in the data center itself, hundreds of thousands of servers are connected together by thousands of switches. And each switch decides where it's going to send these packets of information next so that they eventually reach the correct server. And today these switches are made.
from a single silicon chip called an ASIC.
And in a single switch ASIC can process over 10 billion packets per second today.
Data rates well over 10 terabits per second, all within a single piece of silicon.
So these are immensely powerful devices.
To put this into perspective, a single switch chip processes a new packet in the time it takes a photon to travel just one inch.
So that's the time that it takes for it to process one of these.
And a 10 terabits per second switch can stream the entire Netflix catalog at high definition
in less than a minute.
All right.
So we have these incredibly powerful data centers, hundreds of thousands of servers,
incredibly powerful single switch A6s that have emerged.
Martine, how have hypers reacted to these trends?
Like, what does this processing power that Nick just outlined for these chips mean for how
hypers think about their business and their strategies?
Right.
these are very, very specialized chips that have, you know, very demanding requirements,
but they're very integral to a core function of the data center, which is communication.
I mean, all the data center workloads are distributed.
In order for the workloads to work, or most of them are distributed,
in order for them to work, they have got to communicate a lot,
which means, like, latency is very important.
Prioritization of traffic is very important, and that's what these switches do.
And so the question has been, is like, you know, if you're, you know,
a Google or an Amazon or a Facebook.
Part of your differentiation is how fast your data center is, right?
And you've changed so much more of your stack
than the question is,
what can you do in the network in order to differentiate at that level?
And that is kind of the basis for this discussion has,
and it's been around for 15 years, right?
It continues to evolve.
That's right.
Yeah, and as these data center companies get bigger and bigger,
they're naturally going to consider whether the silicon that they can buy
will best support their customers' workloads
or if they would do better to modify and design them for themselves.
There's a natural question as they get to a scale
where they're operating at enormous economies of scale,
but dealing with very ferocious competitors, the other hyperscalers.
And in order to be able to compete with each other,
they've got to be looking for not only cost reduction,
but also ways to differentiate in terms of the performance they get.
Martin, give us the sense of the range of options
that hypers have before them as they consider, you know,
how to differentiate up to and including building their own chip.
Like one of them is, okay, so you can buy a CPU from somebody else,
an ASIC from somebody else,
like whether it's a standard CPU or a Nick,
and then you could write, you know, your own software for it,
even really low-level software, right?
I'm not talking about application code.
You're writing a low-level code, right?
And so pretty much all the hypers will do this.
This is so they could the most efficiently drive it
and kind of use the best majority.
Other features.
Another thing that you can do is you can include an FPGA,
a field programmable gate array.
So this is a way in which a hyperscaler can build their own functionality in hardware.
Normally this is ancillary functionality.
Like it won't do full switching.
It won't be a full nick,
but it'll add some ancillary,
but very, very high performance functionality.
And it'll do this in a programmable way,
like let's say on the nick or on the motherboard of the switch.
And then the most aggressive by far is you build your own ASIC.
I mean, this is a phenomenal undertaking.
My opinion is, in order for a hyperskiner to build an ASIC, I think it's actually crazy.
It's just such a phenomenal effort, whatever.
And so what we do need, particularly in the networking, is we need networking trips that can be programmed so that you have tons of flexibility, but you don't lose the cost performance initiatives.
And so, you know, that is kind of tradeoff between, you know, do you build your own ASIC versus, you know, do you get the same type of functionality?
through programming. Nick, if you don't mind, I think it'd be interesting to just talk through
like, what does it mean to build an ASIC just from like a resource in time perspective?
Yeah, absolutely. So today, the general rule of thumb is that it's $150 to $200 million to
contemplate building an ASIC and you need a team typically of a couple of hundred people.
And this is not something that you want to do lightly unless you're doing something in extremely
high volume or you're going to derive immense value from the exercise.
So in most cases, it sort of benefits the industry to have technology vendors who are building
these chips, taking on the great investment and the great cost to do this, and selling to
multiple customers.
And here it lies the problem, right?
If the costs favor having a technology vendor who is selling to multiple customers, they
have to figure out what is the product they're going to build.
And so if that product has a fixed functionality for all of those customers, it leaves the
customers with no opportunity to differentiate with each other. And so where we've seen this change
first is in some of the accelerators that hyperscalers have developed for themselves, or in the
customer-specific network interfaces, network interface chips, and switches that have been developed by
some. However, there is an alternative way of doing this that my team was just alluding to,
and that is, instead of making it fixed function, make all of those devices programmable as well. And so
the hyperscaler can determine how they're going to introduce new IDs to differentiate from
their competitors by programming the device themselves. So lift these functions up and out of the
hardware, just make them programs. And this has been the trend that we've seen over the last
few years. It's essentially software eating the world of the data center as a whole for the
infrastructure as a whole. And this at this point seems inevitable. You know, it's the natural next
step beyond disaggregation of the equipment by breaking it down into chips that are programmed
rather than buying sort of vertically integrated, complete solutions. And so the natural next
choice is take the chips that were themselves fixed, make them programmable, and allow for that
differentiation. I've been involved in developing some networking ASICs exactly for this purpose.
and I've observed that over the last decade that as the hypers want to have more control
over how individual packets are processed, then they will do interesting, crazy, sometimes wild
things that I would have never have thought of, that their competitors don't think of,
and so they're actually diverging and going off in slightly different directions.
That's the nature of competition, and it's the first time you're seeing that real competition
taking place in the infrastructure of these hyposcalers.
And so the job of the technology providers, whether that's, you know,
companies like Intel, Procom, Invidia, the task is kind of simple.
Design and build the best programmable devices you can and execute like hell.
Yeah.
I still think that we should revisit the question.
Why would Amazon, or at least rumors of Amazon, building their own ASIC, right?
I mean, you could believe it's because the technology has ever existed before.
Or you could believe there's actually something endemic to the industry.
And I actually think it's a bit of a spicy question, but it's very interesting, which is, you know, the CPU grew up in an era where it was really about programmability.
And the networking industry did not.
And they've been exceptionally resistant to it.
And now we're actually seeing the damn break.
And I do think this two things, which is we actually now have the right abstraction so people have programmed.
So there has been a technical breakthrough.
I also do think that there's now finally an awareness in the industry that this is what the customers want.
And there's enough competition to get it.
we're actually seeing a restructuring of the industry allow for this.
It's actually leading to something, which I think is quite transformational.
So we're at the point now where every piece of the pipeline that goes from a user application,
like a Facebook application in a server, all the way through the Linux operating system,
out through the network interface, all the way through the switches,
all the way into the application at the other end, that entire pipeline can now be programmed.
And that's something that's really just happened in the last few years.
And what we have developed as an industry are extremely efficient programmable pipelines,
which you can program in a high-level language.
It's called P4.
It doesn't particularly matter the specifics of this language, but it lends itself to writing programs
that run very, very fast.
And so you can write programs where previously required fixed function.
And so it means that the network becomes programmable from end to end.
You can change its behavior to be whatever you want.
Now, in a generation before that, there was another transformation that took place that came out of Martin's Ph.D. work. And this was to change and transform how networks are controlled. And so this revolution took place in which the software that is used to define the behavior of how networks are controlled, not how individual packets are processed, but the sort of overall management and control of the network. That was opened up and taken essentially away from the
equipment vendors and handed to those who own and operate the biggest networks in the world,
the internet service providers, the hypers. And so today, every hyperscaler in the world
builds its own networking equipment, writes the software to control it. And they take it for
granted that they can download it, they can write it, they can commission someone else to
write it, they can modify somebody else's. But this is software that is under their control.
And that was the big deal. So not only are the networks programmable sort of from end
to end, they're programmable from the top down in any way that we would manage and control them.
And so if you step back a little bit, the entire system, the computers, the storage, the network,
is becoming a big programmable platform as one. It's a big distributed system that you can think
of programming to do exactly what you want. And this is going to change the way that the hypers,
anybody that builds a network, internet service providers, mobile operators, they're all going to
start thinking differently about how they build not only networks, but the entire data center
that they use to run the software for their own infrastructure and for their customers.
I mean, one way to view this is the following, which is one of the reason perhaps that networking
has never been programmable is because nobody's wanted to program them.
It's like, you know, computers were programmable because everybody had a CPU and we were writing
applications, but there wasn't any gain for programming and networking.
But the hypers, they get a very specific benefit for programming the network.
Like Nick said it perfectly, which is basically all of the world's internet traffic.
A large percentage is actually happening within the data center.
That is a competitive advantage.
Like latency is a competitive advantage.
I'm sure Google has some algorithm that shows how much money they waste for every millisecond of latency, right?
And a lot of that latency is going to be network related.
And if that is the case, then, you know, like this is certainly for their internal operations very important.
And so now you have, you know, a large set of companies that have a specific interest in programmability.
Then they're putting the pressure on the industry at large for programmability, and that's happening now.
And I think that there's two potential net impacts.
One of them is I do think that this allows for disruption in the industry in the sense of, like, new challengers, maybe a shifting of position of incumbents, empowering different parts of the stack, right?
And then the second thing is, like, yeah, like the actual technology innovation that's going to come out of this is also enormous because now we've just unlocked another key piece of infrastructure that can be programmable and can add differentiated value.
So what are some of the ways hyperscalers are using programmability?
What I've seen people do, once you give them a programmable device that allows them to decide how packets are processed within the network, one of the first things they do is they simplify.
They take many of the things that were embedded in the fixed function devices that they never
used that were sitting there as a vulnerability for them because they might actually
trigger a problem that they don't really understand because they're not actually using
that part of the fixed function device.
They throw them all out.
So they write programs that just boil it down to the essence.
The networking people talk in terms of protocols, these are the ways in which different
devices communicate within a network.
And a typical fixed function device might have to support.
45 or 50 of these different types of protocols, yet most hypers are using three or four
internally within their data centers. So they want to get the other ones out because they're
consuming power, there are security risk, they're a vulnerability. So this is one of the first things
they do. The second thing that they do is they want to get greater visibility into what their
system is doing, into where packets are flowing, where storage blocks are, and computers have become
very good at logging what they're doing. If you want to debug what a computer is doing, you can use
software, you can use hardware, you can go and interrogate what it was doing, and then even
reproduce the problems on the fly. In networking and storage, this is extremely difficult,
and it's very rudimentary today, the sorts of tools that you have available to you.
No one has ever operated at sufficient scale before to really figure out what they needed.
The hyperscalers did, though, they knew exactly what they wanted. And as soon as you put into
their hands, the ability to write programs to tell them so that they can look at and interrogate
what their infrastructure is doing.
It's one of the first things that they do.
They want to know, where did packets go?
How long did a storage block actually take to reach its destination?
If it failed to get written into the correct location, why was that?
If it actually met some congestion along the way, whose fault was it so they can go back
and rectify it?
So as soon as you put this capability into their hands, they all go off and measure the
things that they've been wanting to measure for a long time to give them greater visibility
into what the system is doing.
And then the next thing they do is they just want to smooth and control everything in a way that everything is just better behaved and easier to keep going at that large scale.
So they can sort of tame the traffic and route it in the directions that they need to give them the reliability that they need in their system and the redundancy in case a link fails or report part of the system fails.
And so these are the things that they do.
They're in a way not that surprising, but they only know how to do this because they're operating at such huge scale.
No one that ever designed a chip actually was running a big hyperscaler system.
They didn't know what to put in there.
So if you make it programmable, you can leave the customer to actually figure out what they need
and write the software and they need to do it.
What's the broader story here, Martine?
What are some of the other implications of this for the industry?
I mean, I think the undertone in all of this, which just may not be super apparent,
which is how dramatic the industry had to shift in order for this to happen.
So when Nick and I started in the mid-2000s,
the networking industry looked like the mainframe industry.
Like you would buy a box from a network vendor,
and everything in it was written by the vendor.
Every application, the operating systems, all of the A6s,
I mean, they wrapped the sheet metal around it, you know, the whole thing.
And rather than having a programming model,
they would train people to operate it.
So there wasn't even a programming model.
Again, this is 15 years ago.
It's not that long.
And what has happened since then is not only can you control these things programmatically
rather than require human beings, this is the control thing that Nick mentioned,
but also the actual per pack operations you can program,
which is actually a very, very tremendous technical challenge.
And as a result, I actually think that there's a potential for reshaping the entire industry.
I don't think it's just about, like, hyperscalers differentiating.
It's like, you know, like, I mean, do you have a shift in who provides the silicon, right?
I mean, you know, traditionally you've had companies that just focused, just on fixed, you could have a new entrant.
The big networking vendors, they're going to have to kind of rethink what it means to them.
I mean, do they kind of evolve more of an ecosystem?
Do new players arise that are kind of better at this horizontalization?
What does it mean to the OEMs?
Maybe the OEMs get empowered, like, you know, the DELs of the world or the Taiwanese ODMs.
And so I think that the broader story here is one of.
horizontalization of a previously vertical market in a large market. We're not talking about
some niche thing. This is like 40 to 100 billion dollar market. And this just highlights how
important network programmability is. And the fact that it needs to be programmable and now it's
becoming programmable. Another big change that is taking place is as these systems turn into
these kind of programmable platforms, big programmable systems where the infrastructure itself is
entirely programmable. The software that they're using is increasingly made up of open source.
And open source, which, frankly, in the infrastructure industry, apart from Linux, was considered
a bit of a joke 15 years ago. It is now taken as a given that the majority of the infrastructure
will be built by open source, open source that people pull together and assemble, and then
hopefully in many cases will push back out again to get that benefit of all of the sort of the
quality assurance that they get from multiple people working on it. And this is because a lot of
the pieces of the infrastructure are necessary but non-differentiating. And so it benefits them to get that
sharing of many, many people from outside their own company working on this together. And so as
the infrastructure becomes more and more created from open source, it itself becomes more open
to scrutiny, to it's more transparent in the sense of it becomes easier to find bad stuff
that is being placed into that open source or bugs, et cetera.
And so over time, it should actually become far more resilient,
far more secure than the infrastructure of the past.
So open source is going to play a big and an increasing part of how the infrastructure is built.
So, Nick, if we take these ideas of programmability and open source,
like what other areas will this touch?
How will consumers and enterprises see the broader effects of these trends?
I think that all of this, these same dynamics are now happening to the rollout of
5G and the investment in 5G. Cellular networks in the past were like these walled gardens.
The industry was very protective of the way things were done. But 5G was deliberately designed
and architected to feel very much like the rest of the internet. And the reason that people
get very excited about 5G is that it's essentially going to replace a lot of the wireless
communications that we use today, Wi-Fi, ways of communicating within factory.
is connecting all of those new IoT devices like, you know, our dishwasher that now communicates
online and the cameras and the things that we have in our homes, the robots inside factors.
All of that is being built out of the same ideas, programmable, disaggregated, low-cost,
open-source software that is written and owned by the mobile operators.
So 5G is becoming software defined as well.
So you're going to get this entire infrastructure inside the hypers all the way out to edge computing,
all the way out to 5G, all of this infrastructure is going to be defined by software running on
programmable hardware. That is going to change everything because it's going to open the floodgates
for a massive amounts of innovation, which is exactly what we need.
Martin, Nick, thank you so much for being with us today. Thank you. That was great.
Great chatting. Bye-bye now.