In The Arena by TechArena - Breakthrough Data Center Platform Innovation with AMD

Episode Date: March 28, 2023

TechArena host Allyson Klein chats with AMD senior fellow and CXL technical taskforce co-chair Mamesh Wagh regarding AMD’s entry of CXL platforms into the market gen 4 AMD EYPC processors and his or...ganization’s strategy to deliver disruptive innovation utilizing CXL capability in the years ahead.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein. Now, let's step into the arena. Welcome to the Tech Arena. My name is Allison Klein, and I'm delighted today to be joined by Mahesh Wagh, Senior Fellow for Server System Architecture at AMD and the Co-Chair of the CXL Technical Task Force. Welcome to the program. Glad to be here, Allison. Looking forward to talk to you. I have so many topics to ask you about today, but why don't we just get started with just a statement about data centers. Data centers are at the center of innovation for everything from new breakthroughs in AI to new digital services, redefining industries. Yet we've relied on a consistent definition of data center compute for decades
Starting point is 00:01:05 with architecture defined by rack-based pizza box servers. Why has the industry stayed true to this architecture for so long? Yeah, that's a very good question. I think if you look into from where the industry is going and from a data center perspective, if you're looking at the use cases, industry is looking at what is the best way to innovate on existing platforms and how do you bring incremental value, right? So if you look at all of those things in terms of, you know, what is the best return of investment that you're going to get, it's really usually on those incremental technologies that you build. So anywhere you find an opportunity where you can recoup investment, build incremental technologies,
Starting point is 00:01:45 that's where it kind of takes off. So from that perspective, data center is enabling a lot of businesses, right? To kind of transform to this data center servers. So then you're looking within the industry, how do we bring all of these new applications more with just the incremental approach and you're innovating
Starting point is 00:02:05 within that space as well. So don't get me wrong. But when you look at that, it's like, innovation, what's the best return of value that you get on innovation? That drives you towards incremental technologies. And when you think about something incremental, it's building on top of what you already have. So that's what we tend to see within the industry. Now, CXL is the topic for today. And CXL has been introduced to data center platforms. AMD introduced it with Genoa. Why is this so critical of a technology? And what does it change in terms of what you can do with data center?
Starting point is 00:02:48 CXL builds on top of PCI Express. As we all know, PCI Express has been there for more than two decades and going very, very strong on that. And from the interconnects or IO perspective, it's giving you tremendous amount of bandwidth and is on a very great cadence to provide capabilities. What CXL brings, the first level is new use cases, new use models on top of what exists today is on a PCI Express sort of infrastructure. What is it bringing to the ecosystem? It's bringing new use cases that require the cache coherent interface and providing opportunities to innovate on memory technology. So that is what it is bringing. So today it has
Starting point is 00:03:32 established as an industry supported cache coherent interface defined by the consortium that works on an existing technology which is PCI Express. Now, why is it a game changer? It's doing two things fundamentally. First, from just a consortium perspective, it's pretty much bringing all of the compute vendors, all of the memory vendors, the data center, enterprise, companies that are producing solutions and the application developers sort of in one common place
Starting point is 00:04:06 to address the emerging requirements of the market, right? So that's great. We have convergence there. From capabilities perspective, it's providing you the cache coherent interface and a memory interface. So all of the things that typically at a CPU,
Starting point is 00:04:21 your applications could take advantage of cache coherence from a CPU core perspective. You're now providing those same capabilities for accelerators. Things that existed in terms of memory technology, the memory controller was always integrated within the CPU. So you had anything related to memory technology would go through the CPU. What CXL is enabling is for providing innovative solutions where now the memory controller
Starting point is 00:04:48 is outside of the CPU connected with CXL. So in a nutshell, why is it changing the game? It is providing the opportunity to innovate on an existing infrastructure. And that is big, right? And an opportunity to innovate for different reasons. Either folks are looking for differentiated value-added products. People are looking at building products that would provide more TCO in terms of existing solutions.
Starting point is 00:05:17 So as a result of that, the opportunity is significant to both innovate and bring value on a platform. And that, in my mind, is what is going to make it a game changer. Now, I mentioned that Genoa does support CXL, which is the first platform from AMD to support it. How did you decide to deliver CXL at this time? And what are the specifics behind your support on that? So when we looked at, you know, we're sort of these two aspects that we talked about,
Starting point is 00:05:49 accelerator attach and memory attach. If you look into the ecosystem between the two, there is a significant amount of pull towards the memory attach part of it. So from getting to the market and bringing that into the product, what AMD had thought about it is, what are
Starting point is 00:06:05 all the key features that you need to enable memory expansion so from that perspective with cxl 1.1 and with the 4th gen amd epic processor we wanted to first address the system flexibility which is can you provide the biggest configurability and flexibility to the system vendors? In which case you can decide to put like a high bandwidth memory expansion device behind a single port of CXL. Or you could decide to bifurcate the ports. Those are the capabilities that we provided
Starting point is 00:06:39 from a system flexibility perspective. From a media perspective, CXL by definition is agnostic. So when we were looking at what we can provide, we have solutions that would provide the media type to be either DDR5 or DDR4. So that's giving a lot of TCO advantages for our end customers. We're looking at recouping their investments. So they're looking at, okay, I want to do memory expansion.
Starting point is 00:07:06 Can I put my N-1 DIMMs, for example, DDR4 behind this controller and now provide a memory expansion solution that is very cost effective? So we enabled that. Security continues to be a really important piece. So one of the differentiating things that we provide with Genoa is all of AMD's Infinity Guard security solutions that are available today for direct attached memory. They just extend seamlessly over CXL. And as we all know, security is the primary technical pillar for any solution that you want to deploy on a server. CX with the Infinity Guard, with CXL, it just seamlessly you can deploy.
Starting point is 00:07:49 So that's one of the great things. We support tiering. So when we're bringing CXL devices, the key part about it is that its latency characteristics are different than what it is going to be with direct attached memory. And there were a lot of developments that have happened in the ecosystem related to that, which is understanding of non-uniform memory accesses, NUMA nodes. And what CXL is really doing is it's bringing to the very first time this concept of headless NUMA node into the ecosystem. And there are a lot of innovations in that space to first understand how tiered memory systems are working and then optimize for those tiered memory systems. So one of the things that we do from an AMD side is provide all the architectural and
Starting point is 00:08:40 technical hooks to improve from in on our CPU so that we can improve the performance of a tiered memory system. And finally, we have the ability to enable disaggregated memory systems as a proof of concept so that we can build systems of the future that enable disaggregated memory, if you will. And then finally, with the AMD EPYC processors, we were able to pull in some of the features that were defined in CXL 2.0. An example for that is persistent memory. So we could enable persistent memory support starting with Gen 1.
Starting point is 00:09:14 So when I look at what are we doing and what are we bringing with the 4GenM AMD EPYC processors, we're really bringing these six different use cases that are really, really important for our customers. And bringing that on the very first generation of the processor is unprecedented for any technology development that I've seen.
Starting point is 00:09:34 So we're really proud about it and the way we've brought it to the market. Mahesh, you just described an amazing value proposition of new capabilities with CXL. What is the customer response been? I know that the large cloud providers are very deeply involved in the consortium, but how is the broader market response been? And do you feel that enterprise by and large have really understood what is about to be
Starting point is 00:10:03 available to them with their infrastructure? Yeah, I think we're starting to see both, right, from a large cloud provider's perspective. And I think one of the key things pretty much across the board, right, what are we doing, right? From an AMD perspective, we're on the forefront of drive, our core scaling, right?
Starting point is 00:10:21 We're bringing more cores, more capabilities into the system. To support those cores, to support the, you know, bandwidth requirement and the capacity requirement, there are certain constraints on what we could do based on the existing memory technologies. So at the very first go, CXL is addressing some of the shortcomings by providing a flexible operation, you know, opportunity to either meet the memory capacity or the memory bandwidth requirement by extending to CXL. Now, it has certain TCO advantages that you can take benefit from. And those things aren't only limited to large cloud providers. For example, on the enterprise, if you're deploying large in-memory database sort of a system,
Starting point is 00:11:04 you can start to take advantages of what CXL has to offer from a TCO perspective. For example, on the enterprise, if you're deploying large in-memory database sort of a system, you can start to take advantages of what CXL has to offer from a TCO perspective. If your applications are targeting for high-performance computing or applications that require more bandwidth, CXL is a way to provide that more bandwidth at an effective cost. And one of the things that we're going to see as we deploy more CXL is you would be able to look at your applications and profile them in terms of their performance requirements. And once you understand them, then there would be a set of, you know, some preliminary results indicate that, you know, 25 to 30% of applications are not latency sensitive. So if you can map those applications into CXL, it now allows you to deploy a solution
Starting point is 00:11:49 where for your most application performance needs, you're targeting directed-hash memory. For other applications, you can target into this other tier, right? So it's starting to open up these sort of discussions, both in the cloud as well as in the enterprise, where people will start to understand the value that this is bringing and then understand that and then see how we can make use of them for the applications that they're going to deploy. You know, we're at MemCon and memory is obviously central to CXL and what it can bring to the table. Let's just take a step back for a second and ask the simple question.
Starting point is 00:12:33 Why is memory capacity so important to applications and what's driving that? And where do you see the near term opportunity for CXL to really make a difference with memory? Yeah, I'll start with two fronts on that one. One, I kind of sort of addressed in the previous question, which was really from a core scaling perspective. We're just looking at if we were to not even change the applications that you have. And if you're looking at the amount of cores that we're adding and from a core scaling perspective, we have to have a solution that can keep up with the bandwidth demand and the capacity demand to feed the cores. And the memory technology isn't necessarily keeping up with that.
Starting point is 00:13:12 We have some constraints, either platform constraints, channel constraints, memory technology constraints that are not scaling at the ramp that we're scaling the cores. So at GetGo, you need a solution that's providing you the flexibility. Second one is from a memory capacity perspective, one of the points that we've understood is as the applications are improving sort of their capabilities, we're seeing the capacity that is needed for an application grow every year, right? So there's a demand for more memory capacity
Starting point is 00:13:46 for a given application. And then new use cases such as AI and ML, that embedded emails that you need for example, the recommendation engine and of that sort, they're growing exponentially. So as we look at that growth, it's creating a demand for more capacity, right? And then how do you address capacity?
Starting point is 00:14:08 All of those constraints that I talked about, memory scaling, and then memory is a significant cost of a data center, right? So the price is also increasing. So what are the ways you can optimize for that. And CXL provides you this opportunity to innovate and bring solutions to the market that would meet the application needs for growing capacity, as well as the needs for the system to feed the course, both in capacity and bandwidth.
Starting point is 00:14:36 So that's what's driving that. And MemCon is the perfect place because you got all of the folks who are focused on memory technology come together. I do expect a lot of traction on CXL and a lot of talks related to CXL because you got all of the folks who are focused on memory technology come together. I do expect a lot of traction on CXL and a lot of talks related to CXL that are going to be center of the discussions at MemCon. What are AMD's plans for leadership in this space moving forward? And how do you see the evolution of the technology in terms of deployment in the next few years?
Starting point is 00:15:06 We're putting together with 4th Gen AMD Epic. We're leading the space with our processor with very, very innovative capabilities. And we're really hitting on the six to seven different use cases that our customers are targeting for. Some of them are more mature. Others are in the development phase, right? But we see this very sort of a nice roadmap for how these features are going to come out. At the heart of that,
Starting point is 00:15:33 in terms of, you know, what's the leadership? And I keep telling all of the teams that I engage with is, on the forefront is we got to prove that CXL is functional and performing, right? Which means we would start with memory expansion, direct attached memory expansion with DDR4, DDR5 memory. And we're working with the entire ecosystem, with the controller vendors on their architecture very, very closely to make sure
Starting point is 00:15:59 that we can bring performance solutions to the market. And it'll start with the AMD EPYC and with a lot of our partners. And we're seeing in this year, these solutions come to the market. We have a production CPU. We're expecting production level devices available in 2023. That would be what would start sort of this adoption of CXO. What follows that is just building on top of these capabilities, right? You bring in direct attached memory expansion and you extend the capability to bring, you know, work with the ecosystem to enable these tiered memory solutions and then the optimized tiered memory solutions with, you know,
Starting point is 00:16:41 developments in the ecosystem to improve performance. And once that's established, we see that setting a stage for disaggregated memory, persistent memory, and lots of the use cases that are following. So that's how we see this as, you know, sort of this crawl, walk, run approach, start with directed edge memory expansion
Starting point is 00:16:59 and then build on top of that. And, you know, it's going pretty good. I'm pretty happy with the sort of progress that we're seeing in the ecosystem. And it takes a village, it takes everybody, right? It takes CPU vendors, the ecosystem, the software development, all of it to get together to lift this technology up. This isn't just one player, it's an ecosystem that'll need to get together to drive it and events like memcon and other events are really important because they bring people together and drive the technology forward mahesh when you look at this consortium itself you've released a 3.0 spec. That's going to take us, you know, through a few years at least before we start
Starting point is 00:17:47 seeing 3.0 solutions at scale. What is next for the consortium in terms of making sure that this technology is adopted well and performs as you and the technical task? I think one of the things, the question is, you know, why, you know, what is happening with 3.0 and how did it come about? There were all of these use cases and interests that the ecosystem had and requirements that were coming into the consortium. But we had to look at it and put that out in the spec version of how do we do incremental development. So with 3.0, with CXL 1.0, you bring key features in. With CXL 2.0, you provide some scalability, you add some extensions for what didn't exist in 1.1, persistent memory
Starting point is 00:18:33 as an example. And CXL 3.0 then finishes it with providing you the sort of scaling factor for these capabilities. The direction for the consortium is now that we've defined that, give it a little bit of space for all of these technologies to mature, the products to come into the market, and then start thinking about the next generation of CXLs for data. So we're going to see some amount of slowness in terms of the next version of the spec. And primarily, it is for us to be able to deploy solutions, get some feedback from what exists and what the experience has been,
Starting point is 00:19:10 and then drive that forward. That doesn't mean the innovation will stop. We will continue to look at CXL 3.0 and beyond for key features that are really important and can't wait for the next generation to be brought in as either ECNs, which is engineering change notices, things like that. But the whole direction is now that we've laid out what it looks like from an ecosystem
Starting point is 00:19:35 perspective also helps you to kind of look at it and say, what is the end goal in terms of the overall scale out capability? Where can you start and then build it, right? So it's set up for that crawl, walk, run approach. We'll still at the crawl stage from an ecosystem deployment perspective, but then the vision is laid out, the path is there for the ecosystem to go drive together. That's fantastic.
Starting point is 00:20:00 One final question for you. You've put out a lot of information, both on CXL as a technology and AMD's plans. Where would you send folks for more information? of Memcon, CXL Consortium is a good place if people want to know more about it, they could reach out to the consortium. The consortium does a fantastic job of releasing webinars, training materials, tutorials for those who are either new to the technology or those who are well entrenched in the technology also want to learn more. So all of that material is available.
Starting point is 00:20:44 There are periodic training sessions that the consortium does. If you want to find specific information about what AMD is doing, find me on LinkedIn. Or if you're coming in through a company, you can engage with your AMD rep and they know how to connect to the technical folks.
Starting point is 00:21:01 So that would be the way to sort of get connected with the technology. Fantastic. Thank you so much for being with us today, Mahesh, and giving us this great primer on CXL and how it's going to impact data center infrastructure. I can't wait to see more as we continue moving forward. Yeah, we're really excited about it.
Starting point is 00:21:20 And we're really, really excited to bring this out into the market. And happy to talk to as many people in bringing this technology out. excited about it and we're really, really excited to bring this out into the market and, you know, happy to talk to as many people in bringing this technology out. So like I was saying, it's a village that require all of us to get together to drive this forward. Thanks for being here. Thanks for joining the Tech Arena. Subscribe and engage at our website, thetecharena.net. All content is copyright by The Tech Arena.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.