In The Arena by TechArena - Inside AMD’s Vision for AI and Data Center Evolution
Episode Date: October 23, 2024In this episode of Data Insights by Solidigm, Ravi Kuppuswamy of AMD unpacks the company’s innovations in data center computing and how they adapt to AI demands while supporting traditional workload...s.
Transcript
Discussion (0)
Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators
and our host, Alison Klein.
Now let's step into the arena.
Welcome in the arena. Welcome in the arena. My name is Allison Klein, and we are coming to you from the
OCP Summit in San Jose, California. And it's another Data Insight Series podcast, which means
Janice Horowski is with me. Welcome back to the program, Janice. Oh, thank you, Allison. It's
great to be back. So it's been a great day at OCP Summit. What have the highlights been thus far? Gosh, so many great things.
Power efficiency, power efficiency all day, liquid cooling.
But I'm really, really excited today to be able to have the opportunity to sit down with,
okay, Ravi Kupaswamy, who is the Senior Vice President of AMD Server Systems.
And Ravi, welcome to the show.
Thank you, Janice.
And thank you, Allison, for having me on this show.
Ravi, we have spent a lot of the time on the program discussing compute requirements for the AI era.
But data centers are looking at conventional workloads still.
I mean, you wouldn't know it at OCP, but we still have the need to service the full continuum
of workloads within the environment.
Can you just start the conversation with an introduction about your group, how it relates
to AMD's full portfolio, and how you envision computing evolving from this generational
disruption?
Oh, thank you, Alison. I'd be happy to. My name is Ravi Kukuswamy. I actually run AMD's server product and engineering divisions. Now, if you talk about servers, people usually
talk about the CPUs. And my job is essentially to go ahead and instate the new roadmap for AMD for server CPUs.
And then carry that all the way to the time when we deliver that to the customer and have them ramp.
So thank you for mentioning the general compute or conventional workload service. And because that often gets overlooked
in today's immense focus on generational AI
and broad market models, et cetera.
Especially in the Epic business,
which is our branding for server CPUs,
we have been very consistent in recognizing
the new demands of AI-enabled applications.
But we remain steadfast in making sure
Epic continues to offer leading performance
in traditional general compute workloads,
such as HPC, database, cloud native applications,
collaboration systems, finance, and more.
This has allowed us to see these traditional applications
are indeed also adapting and adding elements of AI
into their application environments.
You can look at a wide array of apps
from Microsoft, Oracle, SAP, etc.,
and see them adding AI-enhanced tools,
such as recommendation engines, chatbbots and stuff into their application.
For that reason alone, while massive AI models are indeed a significant step,
but functional and disruptive, the vast majority of real-world applications still are focused on,
are more evolutionary, are focused on general compute.
Awesome.
Gosh, you know, thank you for bringing up a lot on that, Ravi.
You know, AMD has delivered some really incredible products,
specifically at the show, even this week.
So let's talk a little bit about starting with Epic
and how you've extended that to the MI300 solutions and then Yon.
Can you tell us a little bit how your customers can apply this portfolio,
you know, given the overall evolution?
Thank you, Janice.
I would.
This question does come up a lot.
Okay.
Of course, we can be somewhat agnostic because we believe we want to offer leadership products
for whatever the customer
wants. We'd like to let the customer needs guide the discussion. Why do I say that? There are
certain companies who only have a certain portion of the portfolio, but AMB has all the different
aspects of it, whether it be CPUs, GPUs, AI NICs, and so on and so forth.
We want to essentially say it's not a hammer all the time because there isn't just a nail.
We have a diverse set of portfolio that we can actually run.
How is the AI element interacting with traditional workloads?
We mentioned previously in the earlier question,
we want to make sure we are non-disruptive to existing,
run-the-business applications that largely live on CPUs today.
Similarly, scoping out the size of the models
and how much training needs to be done,
this will also significantly impact the CPU or CPU plus GPU's choice. Morals over 13 billion parameters
or heavy training needs might likely need a combination like a CPU plus GPU. And then there
are myriad other considerations, power, data center, footprints, latency needs, et cetera,
and budgets. Some people may want to do X, but they only have a certain amount of budget.
So they may have to go choose something at the lower end of the portfolio. Being able to have
the performance and efficiency metrics for various CPU and CPU and GPU configurations is super
critical. So bottom line, we want to let our customers choose based on their needs, their TCO requirements from the broad portfolio there is.
Now, this week we're at the OCP Summit.
I know that you were at a major launch event last week, which was so fun to watch.
You're focusing here on the leading edge of hyperscale design.
When you think about OCP configurations, do you need to design in a
different way and how do you meet these specific customer domains?
No, thank you, Alison, again. I know yesterday Forrest Norrod made a keynote and he actually
showed a picture. I don't know how, if people had a chance to see it. He had a picture of
a hardware stack that goes all the way from the chip all the way up to the system rack and so on and so forth.
And similarly, you had a software stack that goes all the way up as well.
And the main point essentially is both system and data center design need to evolve and continue to support business demands and software architectures.
The evolution of systems and data center design is a major reason that technology-related global energy consumption has risen so much.
And it's so much more slowly than the amount of data that is created and distributed.
The infrastructure has become more and more efficient.
So you need to take that system black level design down,
incorporate it into the chip
or incorporate it into the lowest levels of software
to get an optimal, energy efficient,
flexible and open design.
So in my mind, for industry standard service,
this includes having a breadth of offerings optimized for the market needs
and entry-level systems to edge-optimized platforms
to a scalable line powered by either our Zen4 and Zen5 C cores.
And of course, we deliver all of this using an open ecosystem.
AMD at the heart of it
embraces open like no other
across all our teams
and all our product portfolio.
Love that. I love that.
So if we want to take a step back
and compare hyperscalers to enterprise,
how do you really see
AI influencing infrastructure on-prem?
And then how does this differ from the other big guys?
Thank you, Denise.
The learnings, technologies, and techniques implemented by the scale-out hypervisors have
really provided broad benefits beyond their own infrastructures.
If you look at it, very few enterprises will ever get to the scale
of our big hyperscale cloud vendors.
Enterprises have seen
that most things,
I would say,
in technology,
definitely,
starts at the highest
and the people with the most scale
and then waterfalls its way down
into other markets
and everybody sees the value of it as that technology
becomes more widely adopted and becomes more economical for use so enterprises have seen that
type of energy efficiency management resource efficiencies and developmental practices
initially pioneered by hyperscaters as targets to help them drive better utilization and
impact from their IT environments.
Of course, cloud has also given them another arrow in this IT quiver, and such that when
they are straining on-prem options, they can leverage the cloud.
So you build an infrastructure for what you want on premises, but suddenly if your scale goes beyond, then
you can leverage the same by using the cloud.
So I do think there is a huge dependency.
And one huge difference to contrast is budget and for these big hyperscalers have immense
amount of resources that they can deploy.
Very few companies have that.
So leveraging them to utilize their technology is super important.
Now, everything that you've talked about and everything that the collective customer base is doing depends on data.
And data stores are increasingly distributed across edge and cloud environments.
How is AMD working with your customers and ecosystems to help provide availability to extract value across this continuum?
The distribution of computational power, data, and intelligence
is ongoing and is inevitable.
In business and personal life, we have come to expect
almost what I tell my kids all the time, instant gratification.
You want to see the value. You mean instant
and virtually free access to data and services.
These are delivered in very increasingly personal ways.
This tectonic shift has made our broad portfolio of solutions
increasingly practical and important.
And we can certainly deliver optimized compute engines
from the cloud to the edge to the endpoints
to enable efficient processing.
And virtually any form factor,
any constraint that we have at that time if you look
at our own portfolio from the epic 9005 to the data center epic 8004 and embedded pc which is
the top of our line all the way down to the edge we are able to go provide a wide variety of you
know ryzen ai and virtual powered endpoints
that we can provide data and services
wherever people want and most value
with the greatest efficiency.
All right, so we got to switch gears a little bit.
So Ravi, can you tell us a little bit
about the announcement that's gone out this week?
I know everyone wants to hear more about that.
I would love your perspective. Oh, great. Okay. First and foremost, let me just say
OCP has played an important role in driving initiatives around form factor, management,
connectivity, and so much more reduced risk for customers. A role that I want to just up front
say AMD strongly supports.
Innovation can be much easier if you're willing to drive down proprietary paths,
but it is often detrimental to long-term health of customers and markets.
So we prefer to drive and support innovation with the backing of standards.
Getting to it, the big announcement of the x86 ecosystem advisory group is yet another example of how important we think this is.
We can join with one of our biggest competitors.
I think I've seen phrases, I think Forrest used the word, if pigs can fly.
I've seen other people in the press say things like, hell, freeze it over,
and statements that have been made.
But my point is, we can join
with one of our biggest competitors
to get agreement on common standards
that is your compatibility, interoperability,
and feature sets for developers and customers is vital.
In my mind, the x86 ecosystem is so rich,
it only allows the two big players in it to join hands
and actually show how customers can take advantage of it.
This new initiative builds on existing AMD-led
and OCP-supported initiatives.
It's not a standalone, right?
Ultra-invent,
ultra-accelerator, these are all initiatives that AMD supports, and it is specifically targeting
open ecosystem. It was an incredible thing for somebody that, you know, I spent
over 20 years at Intel. I never thought that I would see that happen. Yeah, exactly. So,
you know, I think that this is a moment where, you know, I love industry and innovation. I love open standard innovation. And you guys just leaned in in a way that I never thought you would. Congratulations on that. And you have done a tear. You released Turin to the market. You released your new MI300 series to the market. You released the first ultra-Ethernet adapter
to the market.
That was incredible.
Not expected by me.
And then you followed it up by this
incredible leadership from AMD.
I want you to take a look forward
and talk about 2025
and what you're expecting from data center computing
on a macro level
and how does AMD plan to play a role in that?
Well, thank you, Alison, again.
This is an exciting time in our industry.
There's a lot of change, and change means opportunity for all of us.
As you know, first there was a pandemic and the economic environment
and last few years, many businesses with a lot of old and inefficient IT infrastructure
that needs to be replaced.
In general, I am sure, Alison, you probably know too,
that people have infrastructure that they change every three to five years.
Okay?
So if you look at a lot of the infrastructure that's there today,
it's about four years on an average old. And that infrastructure, if you replace it
with today's, you're talking about current,
our fifth generation epic, the numbers that we've looked
at is if you took the top of stack four years
ago, CTUs, and you have a thousand of them,
you can do the
same amount of work with 131 of today's fifth generation epic.
Yeah, and that is reduced, we talked about energy efficiency,
energy efficiency, that's a lot less power, lot less space. Now,
if people want to, you know, fill it with AI compute, because that's the hardest thing, then want to fill it with AI compute, because that's
the hardest thing, then you can fill it with AI
compute. If you want to fill that
space and power with general compute,
like others may need it, you can fill it with
general compute. And I think
the rough math is
with this thing you saved, you can
add 1.1 million
tops of AI compute if you want it.
Wow, that's incredible.
It's pretty cool. I just wanted to
just maybe round it off
by saying our
broad portfolio, commitment to
openness and standards, and
Chimplus-based architecture has
helped us differentiate
products and then make it very
appealing for customers to
use AMD.
And then they are increasingly engaging us early because, roughly speaking, just to, you know,
as you correctly said, about six, seven years ago,
AMD virtually had less than 1% market segment share in this space.
As of the first half of this year,
we were at 34% and moving up.
So meaning, apart from the fact it's great news for AMD,
meaning we have a great responsibility
in this ecosystem to go ahead
and provide energy efficient compute.
And I think AMD is taking that very seriously.
Awesome.
Well, with all the excitement and news here at OSCP, right, it's such a big splash.
But can you tell us, Ravi, where can your audience go to look for more information on AMD?
Thank you, Denise.
We are always publishing new data.
You know, we are also always putting our blogs, studies,
and I'm here in a podcast and we continue to do podcasts like this too.
And talking about customer success stories, okay, featuring AMD's Epic products.
But if you're seeking to learn more, AMD.com slash Epic is actually a great starting point. You know, you'll be able to go ahead and follow AMD on X
or LinkedIn to get notifications of new and upcoming announcements.
For those interested in a deeper view than just my Cliff Notes version
or what I just provided here or the data center,
the video of my session would be posted to the OCP site.
I did one yesterday
and, you know,
we'd be happy to go ahead
and have people look at that.
And I'm also open.
There are always people
joining me on LinkedIn
and I'm open to sharing more data
in that space as well.
Thanks so much for being on the show, Robbie.
I know this is a hugely busy week for you,
and we really appreciate you spending time with us here on Tech Arena.
So thank you, Alyssa and Denise.
Thank you for having me on your show.
Thanks for joining the Tech Arena.
Subscribe and engage at our website, thetecharena.net.
All content is copyrighted by The Tech Arena.