In The Arena by TechArena - Efficient and Secure Data Center Computing with AMD's Robert Hormuth and Prabhu Jayanna
Episode Date: November 28, 2023TechArena host Allyson Klein sat down with AMD’s Robert Hormuth and Prabhu Jayanna at the Open Compute Summit to discuss the advancements AMD EPYC processors have delivered in performance, performan...ce efficiency and security capability including AMD’s Infinity Guard technology.
Transcript
Discussion (0)
Welcome to the Tech Arena,
featuring authentic discussions between
tech's leading innovators and our host, Alison Klein.
Now, let's step into the arena.
Welcome to the Tech Arena. My name is Alison Klein, and I'm coming to you this week from the OCP Summit in San Jose, California. I'm so delighted to be joined by Robert Hormuth
and Prabhu Jayata from AMD. Welcome to the program, guys. Thanks, Alison. Thanks for having us.
Thank you, Alison. So why don't we just start with introductions. Robert, do you want to introduce yourself? Sure, Robert Wormuth, I'm Corporate
Vice President of the Data Center Solutions Group Architecture and Strategy team. So we're focused
on finding the yellow brick road to the future and planning the path there. And Prabhu? I'm
Prabhu Jain, Senior Director, Product Security, Architecture,
and Advanced Prototyping. I focus on developing advanced concepts and get them into products,
roadmaps, or time. You know, OCP is one of my favorite events of the year. I think it's a great
place to get a sense of where the large cloud service providers are aiming for their future infrastructure and what they mean specifically from the silicon arena. Forrest Norrod gave a talk earlier today
where he talked about the Epic architecture and how it's made incredible advancements in terms of
what these large players can do. When you look at cloud native, and we'll just start with you,
Robert, when you look at cloud native workloads and you look at what you've delivered with Epic, what do you think stands out as the real differentiating capabilities of that architecture?
And what are you hearing from your large customers about what they need as they move forward in this AI era?
Well, I think, you know, we need to start a little bit with cloud native as a Ketel phrase.
It's almost as glorious as cloud.
But when I think about cloud native, it's more about a principle of design.
It's an architecture.
It's a style.
It's a replatforming of applications that is really focused on having application scalability, flexibility, the resiliencies built in the application, not at the infrastructure.
It's about observing the applications, the speed of delivering, updating, managing, deploying.
So it's a whole bunch of things that define cloud native.
And, you know, there's a lot of workload changes that have gone over the last, you know, 40, 50 years.
When we think back to the large monolithic physical applications of yesteryear, to the
virtualization I asked, well, that's more of a delivery vehicle, but it's still a very
monolithic kind of style application development and framework.
But then as you start to move towards containers and microservices and functions
and function of the service, you know, the implications of the underlying hardware is
very profound when you think about large monolithic, multi-core threading, shared variables in
the caches, all that locality is really important. And, you know, you have maybe one per machine.
You go pivot to the far contrast of what function or the service,
and there's thousands of these little functions.
They talk to each other over TCP IP.
It's write, wants, run, anywhere kind of,
and run anywhere and when needed kind of model.
And so it really drives architectural choices at the microarchitecture
level because of that drastic difference in terms of density of functions, the data locality.
It creates new system bottlenecks in terms of memory bandwidth and branch predictors and data
locality. And you think about these functions that maybe run for milliseconds.
It's hard to get your branch predictors to train when you have thousands of them running,
and they only run for short segments of time.
And so that's what, you know, at AMD with the Zen 4 microarchitecture,
we were able to take, you know, one base microarchitecture and then optimize for that whole spectrum. If you look at the suite of the Zen 4 launches from Genoa,
large, big core, high-frequency, big cache, Genoa X, even bigger cache,
to the new Bergamo, high core density.
We tailored it more towards that cloud native with smaller caches and more cores and more defined, you know, guaranteed memory
bandwidth and all of that.
And so we are able to take one microarchitect with a common ISA, which is really key, having
one ISA, and tailor the design by changing some of the knobs on the spider charts to
go optimize it, all those things.
And that's what really what Forrest was talking about this morning with our ability to take our micro architecture, our chiplets and our packaging to
dial in more unique optimizations for the ranging workloads as they change.
Large physical monolithic applications are going away just like the mainframe
still exists there's still some mainframe and so we want to optimize to get the best performance some mainframes. And so we want to optimize
to get the best performance
for a lot across that spectrum.
We want to make those optimizations.
When you look at where we're going
in terms of pure performance capability for AI,
and it's been a big discussion today,
and I'm sure it will be for the entire week.
When you look at that workload and you
look at training and inference, where do you think CPUs come into play? And how does AMD
more strategically address the full continuum of the workloads across AI models with your full
portfolio? Well, I think, you know, there's the one thing that, you know, I heard this morning at least is when you saw that from Meta's description this morning of all the different spider charts they showed of different stages of pre-inference decode.
And it showed all these unique different optimizations.
And I think that's where the advanced architectures that we have that are in the works with our unique packages to get a mix and match.
How many to play to cover all those?
Because it's not going to be one size fits all.
Right.
There are plenty of inference, you know,
jobs and applications that are going to run perfectly well on a CPU with
AVX5 flow.
There's going to be these trillion parameter training jobs that need to run
on what we call a flagship GPU,
which is kind of the high, high end of the GPU range.
And you're going to have everything in between.
What's really going to be critical is the compatibility of the software across that spectrum.
So for AMD, we're very focused on the CPU side of delivering core scaling
with the right ingredients for AI at the CPU level.
And then on the Instinct line, we've got the flagship CPUs that we've talked about, the MI300 family that is coming out.
That will be very optimized for some of those spider charts that Ia showed around training and large-scale inference,
where the CPUs just aren't going to be able to do it.
And so we're trying to optimize across that spectrum.
Everybody's talking about Meta's spider chart today.
I know. That was the best chart I've seen, and it was great.
I'm going to have to get a picture of that for the tech arena.
We're talking about performance, but I want to make sure that we cover other vectors
of what you're delivering from a processor standpoint.
And I'm just going to start with security.
Security is something that is always ranked
as a CIO's top priority.
Beyond performance, security matters.
And hardware security is really critical.
I know that you have baked a tremendous amount of security features
into the Epic product line and take this very seriously.
Can you just start with a review of what's in the processor today?
What does it deliver in terms of core capabilities?
And then we'll get into where you're going to go in the future.
Sure, I will take that.
At the core of our SOC architecture,
one of the key principles that we have to follow
is ensuring the integrity of the very first instruction
that executes on our CPUs.
At the heart, we have our AMD security processor,
which we have built and screened across many of our portfolio products.
That walks that route, and from there spans the chain of trust that builds all the way up
to the runtime environment to start with.
And the critical part is to not just to say it is just the SOC that does all the work.
It's the combination of the SOC,
firmware that run on the SOCs within the SOC,
on top of the SOC,
and the application of the structure to ensure there is end-to-end CIA attributes
that are secure along the way.
Once you have the SOC in that form,
you still need a platform to do it.
Right.
So fundamentally, we kind of need to have a secure platform
to do any of these advanced operations,
you know, load a new AI model on it
and make sure it is not tampered with, right?
Nor somebody takes that IP away from that platform,
how do we secure that?
A platform level attestation to ensure
the integrity and the authenticity of
the platform is secure request.
Then we'll often say,
am I supposed to allow
a particular application to run on this platform?
Check, access control attributes come in, authorization comes into play, Am I supposed to allow a particular application to run on this platform?
Check, access control attributes come in, authorization comes into play, and once you check all of those, but is then the advanced features that we have on our
products for the high-end confidential compute type of environments.
The other part that we also need to comprehend in the big picture of that scheme of things is not just
adding security features, but making sure you have a high assurance bar on it. You have ample
transparency capabilities built around it. As you might have heard in this morning's keynote,
OCP announced the SAFE program. We are piloting and collaborating with a bunch of our
partners in the industry to say as we build our firmware for our resources
along the lifecycle of the development process, we do constant checks on the quality
of the code. That's your opening point before you go into any more
sophisticated attacks that arises.
We ensure that the code is audited by a third party so that our customers feel comfortable
that this is not just us building the code, but somebody else has looked at it, they have
endorsed it along the development lifecycle of the product.
And that endorsement becomes a key part in the
world platform of station where the assurance has a certain score to say
I'm going to look at this device and then look at the entire platform. Now I
can go and say hey this is a secure platform that he has been tampered with.
Now I can unleash a word more on. So, Prabhu, when you look at security, we know that bad actors are getting smarter and
smarter every day.
They're finding new ways to try to exploit organizations and get access to applications
and data.
AI is giving them some really fancy tools to make them even more sophisticated in this.
When you look at that challenge, is there anything that the hardware industry
and AMV in particular is aiming to do in the future
to give organizations even more tools to protect themselves?
Absolutely. Back in the days,
data at rest was a prime
feature that one would go after.
I need to ask the data is at rest.
Data is in motion.
How do we predict data in use?
That's where all of our competitors with the new capabilities come into play,
ensuring whatever is in use is protected from bad air. When we look at a platform,
there's something about balance and sustainability that's important.
OCP has a huge initiative around compute sustainability,
and I know that AMD takes energy efficiency
and delivering sustainable platforms very seriously.
Robert, what advancements have been made on the platform
to drive increased energy efficiency?
And how do you look at that challenge
when you're working with your architects
on what are the core capabilities to think about?
Is it driving better performance per watt?
Is it really about getting to idle faster
through pure performance? How
do you ensure that you're delivering the most sustainable product to the customer?
Yeah, so for us, it really starts at the microarchitecture. If you think about our core design, and we're
trying to make a highly efficient and performant core. And Forrest talked about it this morning,
about how if you look at our core size versus a competition,
it's significantly delta, right, between the two.
And that's because we're focused on providing core scaling
where everybody benefits from core scaling.
And we do believe in accelerators in the computer.
We have a
whole GPU lineup, but we don't believe that you should inundate your core with
things that aren't used a lot because that just adds transistors, thinking power,
and they don't tend to have the biggest bang for the buck when they're buried
deep inside the core. They're better off with a performance per watt as a
truly optimized
domain-specific architecture.
It's always going to be the best performance per watt.
So we start with a highly efficient core, give you a lot of them, keep them lean and
mean, give you a great balanced SAC of high memory bandwidth, high IO, and deliver platform-level
features to guarantee quality of service between all those cores
so that they don't interfere with one another.
So we're really focused on driving the performance per watt.
And as Forrest talked about with some of the advanced packaging technologies,
our ability to drive beyond the reticle limit of silicon by using chiplets,
if you saw his charts and you saw kind of the reticle limit of silicon and by using chiplets you you saw his charts and you saw
kind of the reticle limit of a monolithic die versus how much silicon we can put in package
using chiplets it's an order of magnitude more silicon that we can put in and the more silicon
that we put in that is collaborating to deliver performance for Watt, we're saving a lot of
energy efficiency on transferring data back and forth between chips and so forth.
And so that's really the best lever we have today to drive energy efficiency is through
deeper levels of integration, core scaling, and breaking the radical limit on silicon
design so that ultimately, yes, we are driving much
higher performance, but we're moving the performance for watt knob up significantly
versus if all we delivered was just by a monolithic die at a radical limit, the performance for
watt improvement would be very, very small.
By breaking free of that radical limit and advancing Moore's law, we can really move
the needle on performance for a lot,
which is the best way to drive data center efficiency and consolidation.
As an industry, if you turn the clock back in time and look at the big picture,
we've been consolidating from mainframe for the last 50 years, right?
From these big monolithic machines, and we've been consolidating towards these highly integrated devices with the right functions and features.
And that is the best way to drive the performance efficiency.
Thank you guys so much for being with me today.
I know it's a busy conference, and you took out some time to be on the tech arena.
One final question for you both. Where can folks engage to learn more about the Epic product portfolio
and engage with your teams to talk about their own needs
and potentially deploying some Epic servers in their data centers?
We have our AMD booth on the Explore floor.
Come drop us a contact card.
We get hold of you, walk you through what we need to do to get on board.
Run an Epic. We get hold of you, walk you through what we need to do to get on board. Run a little bit.
Yeah, and to add on that, I think one of the big principles of A&B's culture
that Lisa has really instilled in the company culture is we are customer inspired.
We listen to our customers.
We engage our customers.
We try and figure out what problems they're trying to solve.
We don't have all the answers, nor do we understand all the problems. So the best thing we can do is engage the at-scale
customers and understand their at-scale problems. And I can tell you from my experience, the at-scale
problems are different than the enterprise problem. And we have to listen to that whole
range of customers and optimize our products, our security, and our solutions based on the direction that we're getting from our customers
and then come up with innovative solutions.
And that's really the heritage of A&B is a very advanced architecture and engineering company to go solve the world's biggest problems.
You can only do that by listening.
Well, thanks, guys, for being on today.
That was really inspirational.
Thanks for the time.. That was really inspirational. Thanks
for the time. Thank you. Thanks for joining the tech arena, subscribe and engage at our website,
the tech arena.net. All content is copyright by the tech arena. it.