In The Arena by TechArena - Inside AMD’s Vision for AI and Data Center Evolution

Episode Date: October 23, 2024

In this episode of Data Insights by Solidigm, Ravi Kuppuswamy of AMD unpacks the company’s innovations in data center computing and how they adapt to AI demands while supporting traditional workload...s.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein. Now let's step into the arena. Welcome in the arena. Welcome in the arena. My name is Allison Klein, and we are coming to you from the OCP Summit in San Jose, California. And it's another Data Insight Series podcast, which means Janice Horowski is with me. Welcome back to the program, Janice. Oh, thank you, Allison. It's great to be back. So it's been a great day at OCP Summit. What have the highlights been thus far? Gosh, so many great things. Power efficiency, power efficiency all day, liquid cooling.
Starting point is 00:00:51 But I'm really, really excited today to be able to have the opportunity to sit down with, okay, Ravi Kupaswamy, who is the Senior Vice President of AMD Server Systems. And Ravi, welcome to the show. Thank you, Janice. And thank you, Allison, for having me on this show. Ravi, we have spent a lot of the time on the program discussing compute requirements for the AI era. But data centers are looking at conventional workloads still. I mean, you wouldn't know it at OCP, but we still have the need to service the full continuum
Starting point is 00:01:28 of workloads within the environment. Can you just start the conversation with an introduction about your group, how it relates to AMD's full portfolio, and how you envision computing evolving from this generational disruption? Oh, thank you, Alison. I'd be happy to. My name is Ravi Kukuswamy. I actually run AMD's server product and engineering divisions. Now, if you talk about servers, people usually talk about the CPUs. And my job is essentially to go ahead and instate the new roadmap for AMD for server CPUs. And then carry that all the way to the time when we deliver that to the customer and have them ramp. So thank you for mentioning the general compute or conventional workload service. And because that often gets overlooked
Starting point is 00:02:25 in today's immense focus on generational AI and broad market models, et cetera. Especially in the Epic business, which is our branding for server CPUs, we have been very consistent in recognizing the new demands of AI-enabled applications. But we remain steadfast in making sure Epic continues to offer leading performance
Starting point is 00:02:52 in traditional general compute workloads, such as HPC, database, cloud native applications, collaboration systems, finance, and more. This has allowed us to see these traditional applications are indeed also adapting and adding elements of AI into their application environments. You can look at a wide array of apps from Microsoft, Oracle, SAP, etc.,
Starting point is 00:03:20 and see them adding AI-enhanced tools, such as recommendation engines, chatbbots and stuff into their application. For that reason alone, while massive AI models are indeed a significant step, but functional and disruptive, the vast majority of real-world applications still are focused on, are more evolutionary, are focused on general compute. Awesome. Gosh, you know, thank you for bringing up a lot on that, Ravi. You know, AMD has delivered some really incredible products,
Starting point is 00:03:55 specifically at the show, even this week. So let's talk a little bit about starting with Epic and how you've extended that to the MI300 solutions and then Yon. Can you tell us a little bit how your customers can apply this portfolio, you know, given the overall evolution? Thank you, Janice. I would. This question does come up a lot.
Starting point is 00:04:17 Okay. Of course, we can be somewhat agnostic because we believe we want to offer leadership products for whatever the customer wants. We'd like to let the customer needs guide the discussion. Why do I say that? There are certain companies who only have a certain portion of the portfolio, but AMB has all the different aspects of it, whether it be CPUs, GPUs, AI NICs, and so on and so forth. We want to essentially say it's not a hammer all the time because there isn't just a nail. We have a diverse set of portfolio that we can actually run.
Starting point is 00:04:59 How is the AI element interacting with traditional workloads? We mentioned previously in the earlier question, we want to make sure we are non-disruptive to existing, run-the-business applications that largely live on CPUs today. Similarly, scoping out the size of the models and how much training needs to be done, this will also significantly impact the CPU or CPU plus GPU's choice. Morals over 13 billion parameters or heavy training needs might likely need a combination like a CPU plus GPU. And then there
Starting point is 00:05:34 are myriad other considerations, power, data center, footprints, latency needs, et cetera, and budgets. Some people may want to do X, but they only have a certain amount of budget. So they may have to go choose something at the lower end of the portfolio. Being able to have the performance and efficiency metrics for various CPU and CPU and GPU configurations is super critical. So bottom line, we want to let our customers choose based on their needs, their TCO requirements from the broad portfolio there is. Now, this week we're at the OCP Summit. I know that you were at a major launch event last week, which was so fun to watch. You're focusing here on the leading edge of hyperscale design.
Starting point is 00:06:21 When you think about OCP configurations, do you need to design in a different way and how do you meet these specific customer domains? No, thank you, Alison, again. I know yesterday Forrest Norrod made a keynote and he actually showed a picture. I don't know how, if people had a chance to see it. He had a picture of a hardware stack that goes all the way from the chip all the way up to the system rack and so on and so forth. And similarly, you had a software stack that goes all the way up as well. And the main point essentially is both system and data center design need to evolve and continue to support business demands and software architectures. The evolution of systems and data center design is a major reason that technology-related global energy consumption has risen so much.
Starting point is 00:07:17 And it's so much more slowly than the amount of data that is created and distributed. The infrastructure has become more and more efficient. So you need to take that system black level design down, incorporate it into the chip or incorporate it into the lowest levels of software to get an optimal, energy efficient, flexible and open design. So in my mind, for industry standard service,
Starting point is 00:07:47 this includes having a breadth of offerings optimized for the market needs and entry-level systems to edge-optimized platforms to a scalable line powered by either our Zen4 and Zen5 C cores. And of course, we deliver all of this using an open ecosystem. AMD at the heart of it embraces open like no other across all our teams and all our product portfolio.
Starting point is 00:08:14 Love that. I love that. So if we want to take a step back and compare hyperscalers to enterprise, how do you really see AI influencing infrastructure on-prem? And then how does this differ from the other big guys? Thank you, Denise. The learnings, technologies, and techniques implemented by the scale-out hypervisors have
Starting point is 00:08:37 really provided broad benefits beyond their own infrastructures. If you look at it, very few enterprises will ever get to the scale of our big hyperscale cloud vendors. Enterprises have seen that most things, I would say, in technology, definitely,
Starting point is 00:08:56 starts at the highest and the people with the most scale and then waterfalls its way down into other markets and everybody sees the value of it as that technology becomes more widely adopted and becomes more economical for use so enterprises have seen that type of energy efficiency management resource efficiencies and developmental practices initially pioneered by hyperscaters as targets to help them drive better utilization and
Starting point is 00:09:26 impact from their IT environments. Of course, cloud has also given them another arrow in this IT quiver, and such that when they are straining on-prem options, they can leverage the cloud. So you build an infrastructure for what you want on premises, but suddenly if your scale goes beyond, then you can leverage the same by using the cloud. So I do think there is a huge dependency. And one huge difference to contrast is budget and for these big hyperscalers have immense amount of resources that they can deploy.
Starting point is 00:10:05 Very few companies have that. So leveraging them to utilize their technology is super important. Now, everything that you've talked about and everything that the collective customer base is doing depends on data. And data stores are increasingly distributed across edge and cloud environments. How is AMD working with your customers and ecosystems to help provide availability to extract value across this continuum? The distribution of computational power, data, and intelligence is ongoing and is inevitable. In business and personal life, we have come to expect
Starting point is 00:10:47 almost what I tell my kids all the time, instant gratification. You want to see the value. You mean instant and virtually free access to data and services. These are delivered in very increasingly personal ways. This tectonic shift has made our broad portfolio of solutions increasingly practical and important. And we can certainly deliver optimized compute engines from the cloud to the edge to the endpoints
Starting point is 00:11:18 to enable efficient processing. And virtually any form factor, any constraint that we have at that time if you look at our own portfolio from the epic 9005 to the data center epic 8004 and embedded pc which is the top of our line all the way down to the edge we are able to go provide a wide variety of you know ryzen ai and virtual powered endpoints that we can provide data and services wherever people want and most value
Starting point is 00:11:50 with the greatest efficiency. All right, so we got to switch gears a little bit. So Ravi, can you tell us a little bit about the announcement that's gone out this week? I know everyone wants to hear more about that. I would love your perspective. Oh, great. Okay. First and foremost, let me just say OCP has played an important role in driving initiatives around form factor, management, connectivity, and so much more reduced risk for customers. A role that I want to just up front
Starting point is 00:12:22 say AMD strongly supports. Innovation can be much easier if you're willing to drive down proprietary paths, but it is often detrimental to long-term health of customers and markets. So we prefer to drive and support innovation with the backing of standards. Getting to it, the big announcement of the x86 ecosystem advisory group is yet another example of how important we think this is. We can join with one of our biggest competitors. I think I've seen phrases, I think Forrest used the word, if pigs can fly. I've seen other people in the press say things like, hell, freeze it over,
Starting point is 00:13:05 and statements that have been made. But my point is, we can join with one of our biggest competitors to get agreement on common standards that is your compatibility, interoperability, and feature sets for developers and customers is vital. In my mind, the x86 ecosystem is so rich, it only allows the two big players in it to join hands
Starting point is 00:13:33 and actually show how customers can take advantage of it. This new initiative builds on existing AMD-led and OCP-supported initiatives. It's not a standalone, right? Ultra-invent, ultra-accelerator, these are all initiatives that AMD supports, and it is specifically targeting open ecosystem. It was an incredible thing for somebody that, you know, I spent over 20 years at Intel. I never thought that I would see that happen. Yeah, exactly. So,
Starting point is 00:14:01 you know, I think that this is a moment where, you know, I love industry and innovation. I love open standard innovation. And you guys just leaned in in a way that I never thought you would. Congratulations on that. And you have done a tear. You released Turin to the market. You released your new MI300 series to the market. You released the first ultra-Ethernet adapter to the market. That was incredible. Not expected by me. And then you followed it up by this incredible leadership from AMD. I want you to take a look forward and talk about 2025
Starting point is 00:14:38 and what you're expecting from data center computing on a macro level and how does AMD plan to play a role in that? Well, thank you, Alison, again. This is an exciting time in our industry. There's a lot of change, and change means opportunity for all of us. As you know, first there was a pandemic and the economic environment and last few years, many businesses with a lot of old and inefficient IT infrastructure
Starting point is 00:15:09 that needs to be replaced. In general, I am sure, Alison, you probably know too, that people have infrastructure that they change every three to five years. Okay? So if you look at a lot of the infrastructure that's there today, it's about four years on an average old. And that infrastructure, if you replace it with today's, you're talking about current, our fifth generation epic, the numbers that we've looked
Starting point is 00:15:36 at is if you took the top of stack four years ago, CTUs, and you have a thousand of them, you can do the same amount of work with 131 of today's fifth generation epic. Yeah, and that is reduced, we talked about energy efficiency, energy efficiency, that's a lot less power, lot less space. Now, if people want to, you know, fill it with AI compute, because that's the hardest thing, then want to fill it with AI compute, because that's the hardest thing, then you can fill it with AI
Starting point is 00:16:07 compute. If you want to fill that space and power with general compute, like others may need it, you can fill it with general compute. And I think the rough math is with this thing you saved, you can add 1.1 million tops of AI compute if you want it.
Starting point is 00:16:24 Wow, that's incredible. It's pretty cool. I just wanted to just maybe round it off by saying our broad portfolio, commitment to openness and standards, and Chimplus-based architecture has helped us differentiate
Starting point is 00:16:39 products and then make it very appealing for customers to use AMD. And then they are increasingly engaging us early because, roughly speaking, just to, you know, as you correctly said, about six, seven years ago, AMD virtually had less than 1% market segment share in this space. As of the first half of this year, we were at 34% and moving up.
Starting point is 00:17:11 So meaning, apart from the fact it's great news for AMD, meaning we have a great responsibility in this ecosystem to go ahead and provide energy efficient compute. And I think AMD is taking that very seriously. Awesome. Well, with all the excitement and news here at OSCP, right, it's such a big splash. But can you tell us, Ravi, where can your audience go to look for more information on AMD?
Starting point is 00:17:39 Thank you, Denise. We are always publishing new data. You know, we are also always putting our blogs, studies, and I'm here in a podcast and we continue to do podcasts like this too. And talking about customer success stories, okay, featuring AMD's Epic products. But if you're seeking to learn more, AMD.com slash Epic is actually a great starting point. You know, you'll be able to go ahead and follow AMD on X or LinkedIn to get notifications of new and upcoming announcements. For those interested in a deeper view than just my Cliff Notes version
Starting point is 00:18:18 or what I just provided here or the data center, the video of my session would be posted to the OCP site. I did one yesterday and, you know, we'd be happy to go ahead and have people look at that. And I'm also open. There are always people
Starting point is 00:18:36 joining me on LinkedIn and I'm open to sharing more data in that space as well. Thanks so much for being on the show, Robbie. I know this is a hugely busy week for you, and we really appreciate you spending time with us here on Tech Arena. So thank you, Alyssa and Denise. Thank you for having me on your show.
Starting point is 00:18:54 Thanks for joining the Tech Arena. Subscribe and engage at our website, thetecharena.net. All content is copyrighted by The Tech Arena.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.