In The Arena by TechArena - Efficient and Secure Data Center Computing with AMD's Robert Hormuth and Prabhu Jayanna

Episode Date: November 28, 2023

TechArena host Allyson Klein sat down with AMD’s Robert Hormuth and Prabhu Jayanna at the Open Compute Summit to discuss the advancements AMD EPYC processors have delivered in performance, performan...ce efficiency and security capability including AMD’s Infinity Guard technology.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein. Now, let's step into the arena. Welcome to the Tech Arena. My name is Alison Klein, and I'm coming to you this week from the OCP Summit in San Jose, California. I'm so delighted to be joined by Robert Hormuth and Prabhu Jayata from AMD. Welcome to the program, guys. Thanks, Alison. Thanks for having us. Thank you, Alison. So why don't we just start with introductions. Robert, do you want to introduce yourself? Sure, Robert Wormuth, I'm Corporate Vice President of the Data Center Solutions Group Architecture and Strategy team. So we're focused
Starting point is 00:00:54 on finding the yellow brick road to the future and planning the path there. And Prabhu? I'm Prabhu Jain, Senior Director, Product Security, Architecture, and Advanced Prototyping. I focus on developing advanced concepts and get them into products, roadmaps, or time. You know, OCP is one of my favorite events of the year. I think it's a great place to get a sense of where the large cloud service providers are aiming for their future infrastructure and what they mean specifically from the silicon arena. Forrest Norrod gave a talk earlier today where he talked about the Epic architecture and how it's made incredible advancements in terms of what these large players can do. When you look at cloud native, and we'll just start with you, Robert, when you look at cloud native workloads and you look at what you've delivered with Epic, what do you think stands out as the real differentiating capabilities of that architecture?
Starting point is 00:01:52 And what are you hearing from your large customers about what they need as they move forward in this AI era? Well, I think, you know, we need to start a little bit with cloud native as a Ketel phrase. It's almost as glorious as cloud. But when I think about cloud native, it's more about a principle of design. It's an architecture. It's a style. It's a replatforming of applications that is really focused on having application scalability, flexibility, the resiliencies built in the application, not at the infrastructure. It's about observing the applications, the speed of delivering, updating, managing, deploying.
Starting point is 00:02:33 So it's a whole bunch of things that define cloud native. And, you know, there's a lot of workload changes that have gone over the last, you know, 40, 50 years. When we think back to the large monolithic physical applications of yesteryear, to the virtualization I asked, well, that's more of a delivery vehicle, but it's still a very monolithic kind of style application development and framework. But then as you start to move towards containers and microservices and functions and function of the service, you know, the implications of the underlying hardware is very profound when you think about large monolithic, multi-core threading, shared variables in
Starting point is 00:03:18 the caches, all that locality is really important. And, you know, you have maybe one per machine. You go pivot to the far contrast of what function or the service, and there's thousands of these little functions. They talk to each other over TCP IP. It's write, wants, run, anywhere kind of, and run anywhere and when needed kind of model. And so it really drives architectural choices at the microarchitecture level because of that drastic difference in terms of density of functions, the data locality.
Starting point is 00:03:53 It creates new system bottlenecks in terms of memory bandwidth and branch predictors and data locality. And you think about these functions that maybe run for milliseconds. It's hard to get your branch predictors to train when you have thousands of them running, and they only run for short segments of time. And so that's what, you know, at AMD with the Zen 4 microarchitecture, we were able to take, you know, one base microarchitecture and then optimize for that whole spectrum. If you look at the suite of the Zen 4 launches from Genoa, large, big core, high-frequency, big cache, Genoa X, even bigger cache, to the new Bergamo, high core density.
Starting point is 00:04:41 We tailored it more towards that cloud native with smaller caches and more cores and more defined, you know, guaranteed memory bandwidth and all of that. And so we are able to take one microarchitect with a common ISA, which is really key, having one ISA, and tailor the design by changing some of the knobs on the spider charts to go optimize it, all those things. And that's what really what Forrest was talking about this morning with our ability to take our micro architecture, our chiplets and our packaging to dial in more unique optimizations for the ranging workloads as they change. Large physical monolithic applications are going away just like the mainframe
Starting point is 00:05:20 still exists there's still some mainframe and so we want to optimize to get the best performance some mainframes. And so we want to optimize to get the best performance for a lot across that spectrum. We want to make those optimizations. When you look at where we're going in terms of pure performance capability for AI, and it's been a big discussion today, and I'm sure it will be for the entire week.
Starting point is 00:05:43 When you look at that workload and you look at training and inference, where do you think CPUs come into play? And how does AMD more strategically address the full continuum of the workloads across AI models with your full portfolio? Well, I think, you know, there's the one thing that, you know, I heard this morning at least is when you saw that from Meta's description this morning of all the different spider charts they showed of different stages of pre-inference decode. And it showed all these unique different optimizations. And I think that's where the advanced architectures that we have that are in the works with our unique packages to get a mix and match. How many to play to cover all those? Because it's not going to be one size fits all.
Starting point is 00:06:29 Right. There are plenty of inference, you know, jobs and applications that are going to run perfectly well on a CPU with AVX5 flow. There's going to be these trillion parameter training jobs that need to run on what we call a flagship GPU, which is kind of the high, high end of the GPU range. And you're going to have everything in between.
Starting point is 00:06:51 What's really going to be critical is the compatibility of the software across that spectrum. So for AMD, we're very focused on the CPU side of delivering core scaling with the right ingredients for AI at the CPU level. And then on the Instinct line, we've got the flagship CPUs that we've talked about, the MI300 family that is coming out. That will be very optimized for some of those spider charts that Ia showed around training and large-scale inference, where the CPUs just aren't going to be able to do it. And so we're trying to optimize across that spectrum. Everybody's talking about Meta's spider chart today.
Starting point is 00:07:35 I know. That was the best chart I've seen, and it was great. I'm going to have to get a picture of that for the tech arena. We're talking about performance, but I want to make sure that we cover other vectors of what you're delivering from a processor standpoint. And I'm just going to start with security. Security is something that is always ranked as a CIO's top priority. Beyond performance, security matters.
Starting point is 00:08:00 And hardware security is really critical. I know that you have baked a tremendous amount of security features into the Epic product line and take this very seriously. Can you just start with a review of what's in the processor today? What does it deliver in terms of core capabilities? And then we'll get into where you're going to go in the future. Sure, I will take that. At the core of our SOC architecture,
Starting point is 00:08:29 one of the key principles that we have to follow is ensuring the integrity of the very first instruction that executes on our CPUs. At the heart, we have our AMD security processor, which we have built and screened across many of our portfolio products. That walks that route, and from there spans the chain of trust that builds all the way up to the runtime environment to start with. And the critical part is to not just to say it is just the SOC that does all the work.
Starting point is 00:09:07 It's the combination of the SOC, firmware that run on the SOCs within the SOC, on top of the SOC, and the application of the structure to ensure there is end-to-end CIA attributes that are secure along the way. Once you have the SOC in that form, you still need a platform to do it. Right.
Starting point is 00:09:31 So fundamentally, we kind of need to have a secure platform to do any of these advanced operations, you know, load a new AI model on it and make sure it is not tampered with, right? Nor somebody takes that IP away from that platform, how do we secure that? A platform level attestation to ensure the integrity and the authenticity of
Starting point is 00:09:54 the platform is secure request. Then we'll often say, am I supposed to allow a particular application to run on this platform? Check, access control attributes come in, authorization comes into play, Am I supposed to allow a particular application to run on this platform? Check, access control attributes come in, authorization comes into play, and once you check all of those, but is then the advanced features that we have on our products for the high-end confidential compute type of environments. The other part that we also need to comprehend in the big picture of that scheme of things is not just
Starting point is 00:10:25 adding security features, but making sure you have a high assurance bar on it. You have ample transparency capabilities built around it. As you might have heard in this morning's keynote, OCP announced the SAFE program. We are piloting and collaborating with a bunch of our partners in the industry to say as we build our firmware for our resources along the lifecycle of the development process, we do constant checks on the quality of the code. That's your opening point before you go into any more sophisticated attacks that arises. We ensure that the code is audited by a third party so that our customers feel comfortable
Starting point is 00:11:11 that this is not just us building the code, but somebody else has looked at it, they have endorsed it along the development lifecycle of the product. And that endorsement becomes a key part in the world platform of station where the assurance has a certain score to say I'm going to look at this device and then look at the entire platform. Now I can go and say hey this is a secure platform that he has been tampered with. Now I can unleash a word more on. So, Prabhu, when you look at security, we know that bad actors are getting smarter and smarter every day.
Starting point is 00:11:51 They're finding new ways to try to exploit organizations and get access to applications and data. AI is giving them some really fancy tools to make them even more sophisticated in this. When you look at that challenge, is there anything that the hardware industry and AMV in particular is aiming to do in the future to give organizations even more tools to protect themselves? Absolutely. Back in the days, data at rest was a prime
Starting point is 00:12:22 feature that one would go after. I need to ask the data is at rest. Data is in motion. How do we predict data in use? That's where all of our competitors with the new capabilities come into play, ensuring whatever is in use is protected from bad air. When we look at a platform, there's something about balance and sustainability that's important. OCP has a huge initiative around compute sustainability,
Starting point is 00:12:56 and I know that AMD takes energy efficiency and delivering sustainable platforms very seriously. Robert, what advancements have been made on the platform to drive increased energy efficiency? And how do you look at that challenge when you're working with your architects on what are the core capabilities to think about? Is it driving better performance per watt?
Starting point is 00:13:20 Is it really about getting to idle faster through pure performance? How do you ensure that you're delivering the most sustainable product to the customer? Yeah, so for us, it really starts at the microarchitecture. If you think about our core design, and we're trying to make a highly efficient and performant core. And Forrest talked about it this morning, about how if you look at our core size versus a competition, it's significantly delta, right, between the two. And that's because we're focused on providing core scaling
Starting point is 00:13:58 where everybody benefits from core scaling. And we do believe in accelerators in the computer. We have a whole GPU lineup, but we don't believe that you should inundate your core with things that aren't used a lot because that just adds transistors, thinking power, and they don't tend to have the biggest bang for the buck when they're buried deep inside the core. They're better off with a performance per watt as a truly optimized
Starting point is 00:14:26 domain-specific architecture. It's always going to be the best performance per watt. So we start with a highly efficient core, give you a lot of them, keep them lean and mean, give you a great balanced SAC of high memory bandwidth, high IO, and deliver platform-level features to guarantee quality of service between all those cores so that they don't interfere with one another. So we're really focused on driving the performance per watt. And as Forrest talked about with some of the advanced packaging technologies,
Starting point is 00:14:58 our ability to drive beyond the reticle limit of silicon by using chiplets, if you saw his charts and you saw kind of the reticle limit of silicon and by using chiplets you you saw his charts and you saw kind of the reticle limit of a monolithic die versus how much silicon we can put in package using chiplets it's an order of magnitude more silicon that we can put in and the more silicon that we put in that is collaborating to deliver performance for Watt, we're saving a lot of energy efficiency on transferring data back and forth between chips and so forth. And so that's really the best lever we have today to drive energy efficiency is through deeper levels of integration, core scaling, and breaking the radical limit on silicon
Starting point is 00:15:42 design so that ultimately, yes, we are driving much higher performance, but we're moving the performance for watt knob up significantly versus if all we delivered was just by a monolithic die at a radical limit, the performance for watt improvement would be very, very small. By breaking free of that radical limit and advancing Moore's law, we can really move the needle on performance for a lot, which is the best way to drive data center efficiency and consolidation. As an industry, if you turn the clock back in time and look at the big picture,
Starting point is 00:16:15 we've been consolidating from mainframe for the last 50 years, right? From these big monolithic machines, and we've been consolidating towards these highly integrated devices with the right functions and features. And that is the best way to drive the performance efficiency. Thank you guys so much for being with me today. I know it's a busy conference, and you took out some time to be on the tech arena. One final question for you both. Where can folks engage to learn more about the Epic product portfolio and engage with your teams to talk about their own needs and potentially deploying some Epic servers in their data centers?
Starting point is 00:16:54 We have our AMD booth on the Explore floor. Come drop us a contact card. We get hold of you, walk you through what we need to do to get on board. Run an Epic. We get hold of you, walk you through what we need to do to get on board. Run a little bit. Yeah, and to add on that, I think one of the big principles of A&B's culture that Lisa has really instilled in the company culture is we are customer inspired. We listen to our customers. We engage our customers.
Starting point is 00:17:20 We try and figure out what problems they're trying to solve. We don't have all the answers, nor do we understand all the problems. So the best thing we can do is engage the at-scale customers and understand their at-scale problems. And I can tell you from my experience, the at-scale problems are different than the enterprise problem. And we have to listen to that whole range of customers and optimize our products, our security, and our solutions based on the direction that we're getting from our customers and then come up with innovative solutions. And that's really the heritage of A&B is a very advanced architecture and engineering company to go solve the world's biggest problems. You can only do that by listening.
Starting point is 00:18:01 Well, thanks, guys, for being on today. That was really inspirational. Thanks for the time.. That was really inspirational. Thanks for the time. Thank you. Thanks for joining the tech arena, subscribe and engage at our website, the tech arena.net. All content is copyright by the tech arena. it.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.