In The Arena by TechArena - Breakthrough Data Center Platform Innovation with AMD
Episode Date: March 28, 2023TechArena host Allyson Klein chats with AMD senior fellow and CXL technical taskforce co-chair Mamesh Wagh regarding AMD’s entry of CXL platforms into the market gen 4 AMD EYPC processors and his or...ganization’s strategy to deliver disruptive innovation utilizing CXL capability in the years ahead.
Transcript
Discussion (0)
Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein.
Now, let's step into the arena.
Welcome to the Tech Arena. My name is Allison Klein, and I'm delighted today to be
joined by Mahesh Wagh, Senior Fellow for Server System Architecture at AMD and the Co-Chair of
the CXL Technical Task Force. Welcome to the program. Glad to be here, Allison. Looking forward to talk to you.
I have so many topics to ask you about today, but why don't we just get started with just a statement about data centers.
Data centers are at the center of innovation for everything from new breakthroughs in AI to new digital services, redefining industries.
Yet we've relied on a consistent definition of data center compute for decades
with architecture defined by rack-based pizza box servers. Why has the industry stayed true
to this architecture for so long? Yeah, that's a very good question. I think if you look into
from where the industry is going and from a data center perspective, if you're looking at the use
cases, industry is looking at what is the best way to
innovate on existing platforms and how do you bring incremental value, right? So if you look
at all of those things in terms of, you know, what is the best return of investment that you're going
to get, it's really usually on those incremental technologies that you build. So anywhere you find
an opportunity where you can recoup investment, build incremental technologies,
that's where it kind of takes off.
So from that perspective,
data center is enabling a lot of businesses, right?
To kind of transform to this data center servers.
So then you're looking within the industry,
how do we bring all of these new applications
more with just the incremental approach
and you're innovating
within that space as well. So don't get me wrong. But when you look at that, it's like,
innovation, what's the best return of value that you get on innovation? That drives you towards
incremental technologies. And when you think about something incremental, it's building on
top of what you already have. So that's what we tend to see within the industry.
Now, CXL is the topic for today.
And CXL has been introduced to data center platforms.
AMD introduced it with Genoa.
Why is this so critical of a technology? And what does it change in terms of what you can do with data center?
CXL builds on top of PCI Express. As we all know, PCI Express has been there for more than two
decades and going very, very strong on that. And from the interconnects or IO perspective,
it's giving you tremendous amount of bandwidth and is on a very great cadence to provide capabilities.
What CXL brings, the first level is new use cases, new use models on top of what exists
today is on a PCI Express sort of infrastructure.
What is it bringing to the ecosystem?
It's bringing new use cases that require the cache coherent interface and providing
opportunities to innovate on memory technology. So that is what it is bringing. So today it has
established as an industry supported cache coherent interface defined by the consortium
that works on an existing technology which is PCI Express. Now, why is it a game changer?
It's doing two things fundamentally.
First, from just a consortium perspective,
it's pretty much bringing all of the compute vendors,
all of the memory vendors, the data center, enterprise,
companies that are producing solutions
and the application developers sort of in one common place
to address the emerging requirements
of the market, right?
So that's great.
We have convergence there.
From capabilities perspective,
it's providing you the cache coherent interface
and a memory interface.
So all of the things that typically at a CPU,
your applications could take advantage
of cache coherence from a CPU core perspective.
You're now providing those same capabilities for accelerators.
Things that existed in terms of memory technology, the memory controller was always integrated
within the CPU.
So you had anything related to memory technology would go through the CPU.
What CXL is enabling is for providing innovative solutions
where now the memory controller
is outside of the CPU connected with CXL.
So in a nutshell, why is it changing the game?
It is providing the opportunity
to innovate on an existing infrastructure.
And that is big, right?
And an opportunity to innovate for different
reasons. Either folks are looking for differentiated value-added products.
People are looking at building products that would provide more TCO in terms of existing solutions.
So as a result of that, the opportunity is significant to both innovate and bring value
on a platform. And that, in my mind, is what is going
to make it a game changer. Now, I mentioned that Genoa does support CXL, which is the first platform
from AMD to support it. How did you decide to deliver CXL at this time? And what are the
specifics behind your support on that?
So when we looked at,
you know, we're sort of these two aspects
that we talked about,
accelerator attach and memory attach.
If you look into the ecosystem between the two,
there is a significant amount of pull
towards the memory attach part of it.
So from getting to the market
and bringing that into the product,
what AMD had thought about it is,
what are
all the key features that you need to enable memory expansion so from that perspective with
cxl 1.1 and with the 4th gen amd epic processor we wanted to first address the system flexibility
which is can you provide the biggest configurability and flexibility to the system vendors?
In which case you can decide to put
like a high bandwidth memory expansion device
behind a single port of CXL.
Or you could decide to bifurcate the ports.
Those are the capabilities that we provided
from a system flexibility perspective.
From a media perspective, CXL by definition is agnostic.
So when we were looking at what we can provide,
we have solutions that would provide the media type
to be either DDR5 or DDR4.
So that's giving a lot of TCO advantages for our end customers.
We're looking at recouping their investments.
So they're looking at, okay, I want to do memory expansion.
Can I put my N-1 DIMMs, for example, DDR4 behind this controller
and now provide a memory expansion solution that is very cost effective?
So we enabled that.
Security continues to be a really important piece.
So one of the differentiating things that we provide
with Genoa is all of AMD's Infinity Guard security solutions that are available today for direct
attached memory. They just extend seamlessly over CXL. And as we all know, security is the primary
technical pillar for any solution that you want to deploy on a server. CX with the Infinity Guard, with CXL, it just seamlessly you can deploy.
So that's one of the great things.
We support tiering.
So when we're bringing CXL devices, the key part about it is that its latency characteristics
are different than what it is going to be with direct attached memory.
And there were a lot of developments that have happened in the ecosystem related to that,
which is understanding of non-uniform memory accesses, NUMA nodes. And what CXL is really doing is it's bringing to the very first time this concept of headless NUMA node into the ecosystem. And there are a lot of innovations in that space to first understand how tiered memory
systems are working and then optimize for those tiered memory systems.
So one of the things that we do from an AMD side is provide all the architectural and
technical hooks to improve from in on our CPU so that we can improve the
performance of a tiered memory system. And finally, we have the ability to enable disaggregated memory
systems as a proof of concept so that we can build systems of the future that enable
disaggregated memory, if you will. And then finally, with the AMD EPYC processors,
we were able to pull in some of the features that were defined in CXL 2.0.
An example for that is persistent memory.
So we could enable persistent memory support
starting with Gen 1.
So when I look at what are we doing
and what are we bringing
with the 4GenM AMD EPYC processors,
we're really bringing these six different use cases
that are really, really important for our customers.
And bringing that on the very first generation
of the processor is unprecedented
for any technology development that I've seen.
So we're really proud about it
and the way we've brought it to the market.
Mahesh, you just described an amazing value proposition
of new capabilities with CXL.
What is the customer response been?
I know that the large cloud providers are very deeply involved in the consortium, but
how is the broader market response been?
And do you feel that enterprise by and large have really understood what is about to be
available to them
with their infrastructure?
Yeah, I think we're starting to see both, right,
from a large cloud provider's perspective.
And I think one of the key things pretty much across the board, right,
what are we doing, right?
From an AMD perspective, we're on the forefront of drive,
our core scaling, right?
We're bringing more cores, more capabilities into the system.
To support those cores, to support the, you know, bandwidth requirement and the capacity requirement, there are certain constraints on what we could do based on the existing memory
technologies. So at the very first go, CXL is addressing some of the shortcomings by
providing a flexible operation, you know, opportunity to either meet the memory capacity
or the memory bandwidth requirement by extending to CXL.
Now, it has certain TCO advantages that you can take benefit from.
And those things aren't only limited to large cloud providers.
For example, on the enterprise, if you're deploying large in-memory database sort of a system,
you can start to take advantages of what CXL has to offer from a TCO perspective. For example, on the enterprise, if you're deploying large in-memory database sort of a system,
you can start to take advantages of what CXL has to offer from a TCO perspective.
If your applications are targeting for high-performance computing or applications that require more bandwidth,
CXL is a way to provide that more bandwidth at an effective cost.
And one of the things that we're going to see as we deploy more CXL is you would be able to look at your applications and profile them in terms of their performance requirements.
And once you understand them, then there would be a set of, you know, some preliminary results
indicate that, you know, 25 to 30% of applications are not latency sensitive. So if you can map those applications into CXL,
it now allows you to deploy a solution
where for your most application performance needs,
you're targeting directed-hash memory.
For other applications,
you can target into this other tier, right?
So it's starting to open up these sort of discussions,
both in the cloud as well as in the enterprise, where people will start to understand the value that this is bringing and then understand that and then see how we can make use of them for the applications that they're going to deploy.
You know, we're at MemCon and memory is obviously central to CXL and what it can bring to the table.
Let's just take a step back for a second and ask the simple question.
Why is memory capacity so important to applications and what's driving that?
And where do you see the near term opportunity for CXL to really make a difference with memory?
Yeah, I'll start with two fronts on that one.
One, I kind of sort of addressed in the previous question, which was really from a core scaling perspective.
We're just looking at if we were to not even change the applications that you have.
And if you're looking at the amount of cores that we're adding and from a core scaling perspective,
we have to have a solution that can keep up with the bandwidth demand and the capacity demand to feed the cores.
And the memory technology isn't necessarily keeping up with that.
We have some constraints, either platform constraints, channel constraints, memory technology constraints that are not scaling at the ramp that we're scaling the cores.
So at GetGo, you need a solution that's providing you the flexibility.
Second one is from a memory capacity perspective,
one of the points that we've understood is
as the applications are improving sort of their capabilities,
we're seeing the capacity that is needed
for an application grow every year, right?
So there's a demand for more memory capacity
for a given application.
And then new use cases such as AI and ML,
that embedded emails that you need for example,
the recommendation engine and of that sort,
they're growing exponentially.
So as we look at that growth,
it's creating a demand for more capacity, right?
And then how do you address capacity?
All of those constraints that I talked about, memory scaling, and then memory is a significant cost of a data center, right?
So the price is also increasing.
So what are the ways you can optimize for that. And CXL provides you this opportunity to innovate and bring solutions to the market
that would meet the application needs
for growing capacity,
as well as the needs for the system
to feed the course,
both in capacity and bandwidth.
So that's what's driving that.
And MemCon is the perfect place
because you got all of the folks
who are focused on memory technology come together.
I do expect a lot of traction on CXL and a lot of talks related to CXL because you got all of the folks who are focused on memory technology come together.
I do expect a lot of traction on CXL and a lot of talks related to CXL that are going to be center of the discussions at MemCon.
What are AMD's plans for leadership in this space moving forward?
And how do you see the evolution of the technology in terms of deployment in the next few years?
We're putting together with 4th Gen AMD Epic.
We're leading the space with our processor with very, very innovative capabilities.
And we're really hitting on the six to seven different use cases that our customers are targeting for.
Some of them are more mature.
Others are in the development phase, right?
But we see this very sort of a nice roadmap
for how these features are going to come out.
At the heart of that,
in terms of, you know, what's the leadership?
And I keep telling all of the teams
that I engage with is,
on the forefront is we got to prove
that CXL is functional and performing, right?
Which means we would start with
memory expansion, direct attached memory expansion with DDR4, DDR5 memory. And we're working with the
entire ecosystem, with the controller vendors on their architecture very, very closely to make sure
that we can bring performance solutions to the market. And it'll start with the AMD EPYC and with a lot
of our partners. And we're seeing in this year, these solutions come to the market. We have a
production CPU. We're expecting production level devices available in 2023. That would be what
would start sort of this adoption of CXO. What follows that is just building on top of these capabilities, right?
You bring in direct attached memory expansion and you extend the capability
to bring, you know,
work with the ecosystem to enable these tiered memory solutions and then the
optimized tiered memory solutions with, you know,
developments in the ecosystem to improve performance.
And once that's established,
we see that setting a stage
for disaggregated memory, persistent memory,
and lots of the use cases that are following.
So that's how we see this as, you know,
sort of this crawl, walk, run approach,
start with directed edge memory expansion
and then build on top of that.
And, you know, it's going pretty good.
I'm pretty happy with the sort of progress that we're seeing
in the ecosystem. And it takes a village, it takes everybody, right? It takes CPU vendors,
the ecosystem, the software development, all of it to get together to lift this technology up.
This isn't just one player, it's an ecosystem that'll need to get together to drive it and events like memcon
and other events are really important because they bring people together and drive the technology
forward mahesh when you look at this consortium itself you've released a 3.0 spec. That's going to take us, you know, through a few years at least before we start
seeing 3.0 solutions at scale. What is next for the consortium in terms of making sure that this
technology is adopted well and performs as you and the technical task? I think one of the things,
the question is, you know, why, you know, what is happening with 3.0 and how did it come about?
There were all of these use cases and interests that the ecosystem had and requirements that were coming into the consortium.
But we had to look at it and put that out in the spec version of how do we do incremental
development.
So with 3.0, with CXL 1.0, you bring key features in. With CXL 2.0, you provide
some scalability, you add some extensions for what didn't exist in 1.1, persistent memory
as an example. And CXL 3.0 then finishes it with providing you the sort of scaling factor
for these capabilities. The direction for the consortium is now that we've defined that,
give it a little bit of space for all of these technologies to mature, the products to come into
the market, and then start thinking about the next generation of CXLs for data. So we're going to see
some amount of slowness in terms of the next version of the spec. And primarily, it is for
us to be able to deploy solutions,
get some feedback from what exists
and what the experience has been,
and then drive that forward.
That doesn't mean the innovation will stop.
We will continue to look at CXL 3.0 and beyond
for key features that are really important
and can't wait for the next generation
to be brought in as either
ECNs, which is engineering change notices, things like that.
But the whole direction is now that we've laid out what it looks like from an ecosystem
perspective also helps you to kind of look at it and say, what is the end goal in terms
of the overall scale out capability?
Where can you start and then build it, right?
So it's set up for that crawl, walk, run approach.
We'll still at the crawl stage from an ecosystem deployment perspective,
but then the vision is laid out,
the path is there for the ecosystem to go drive together.
That's fantastic.
One final question for you.
You've put out a lot of information,
both on CXL as a technology and AMD's plans. Where would you send folks for more information? of Memcon, CXL Consortium is a good place if people want to know more about it, they
could reach out to the consortium.
The consortium does a fantastic job of releasing webinars, training materials, tutorials for
those who are either new to the technology or those who are well entrenched in the technology
also want to learn more.
So all of that material is available.
There are periodic training sessions
that the consortium does.
If you want to find specific information
about what AMD is doing,
find me on LinkedIn.
Or if you're coming in through a company,
you can engage with your AMD rep
and they know how to connect to the technical folks.
So that would be the way to sort of get connected
with the technology.
Fantastic.
Thank you so much for being with us today, Mahesh,
and giving us this great primer on CXL
and how it's going to impact data center infrastructure.
I can't wait to see more as we continue moving forward.
Yeah, we're really excited about it.
And we're really, really excited to bring this out
into the market.
And happy to talk to as many people in bringing this technology out. excited about it and we're really, really excited to bring this out into the market and, you know,
happy to talk to as many people in bringing this technology out. So like I was saying,
it's a village that require all of us to get together to drive this forward.
Thanks for being here.
Thanks for joining the Tech Arena. Subscribe and engage at our website, thetecharena.net.
All content is copyright by The Tech Arena.