In The Arena by TechArena - Cornelis Networks & AMD on Scaling AI Without the Chaos

Starting point is 00:00:00 Welcome to Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Kline. Now, let's step into the arena. Welcome in the arena. My name's Allison Kline, and we are coming to you from OCP Summit in San Jose, California. I have two amazing guests with me. The first one is Lisa Spelman, CEO of Cornellus Networks. And the second is Robbie Kupaswami, back again on Techore. Thank you. Senior Vice President and GM of Server Product and Engineering at AMD. Welcome to the program to both of you.

Starting point is 00:00:41 Thank you. Thank you. Yeah, good to be here. Yeah. Now, I want to just start with a question about AI infrastructure. There's often what Lisa has called on previous episodes of Tech Arena, brute force approach to building out and maximizing GPUs. From your perspectives, what's the most complete story that enterprises need to understand?

Starting point is 00:01:02 to understand about building AI infrastructure that delivers results? I'll go first. Yeah, let me start by saying that thank you for having me over. Oh, yeah, for sure. OK, so yes, brute force is clearly not the right way to go. Most things here needs to have the right type of compute and the right size of compute. What do I mean by that? Whether it be CPUs, GPUs, accelerators, you need to not have most of one,

Starting point is 00:01:32 and little of the other, you need to have the right mix and the right size, okay? That's number one. Number two is, how do you make sure there's end-to-end synchronization, meaning it's not just about compute, it's also about memory bandwidth, it's about network sizing.

Starting point is 00:01:50 It is about how do you have all of this mish-mash together in a way in which they're truly giving the output that you desire. And then of course, above all, TCR rules all, okay? And in any measure, you can have what you want to have, but your use case needs to have the right mix and the right type so that you are able to get the best ECR. I think that the root force comment often applies to think of like a hyperscale or a neocloud, the pursuit of the frontier model. And when you think about an enterprise and you're delivering tied to a specific business case, and so you might be doing inference or you might be doing some fine tuning, but you actually have an opportunity to build.

Starting point is 00:02:32 your infrastructure and think through that a little bit more in an end-to-end way for your business. So that's why I do think the brute force exists. I just think for enterprises, they actually have a little bit more space and time to build something more attuned to their actual requirements and what they need. And we do see this happening where people spend maybe the first year of an AI journey or whatever year, 18 months, and they might pursue their workloads in the cloud for that flexibility, then they learn how their product team or their design team or their go-to-market team or their HR team actually is using these tools. And so you can take actual data and information about the use case and then work with partners to build the infrastructure

Starting point is 00:03:17 out. So I think you actually have a little bit more time in an enterprise to study the problem and then be thoughtful about it. Now, we know that from the compete side, you know, as you scale, so grows complexity. Lisa, I have a question for you. On the this. How do you address that in terms of the way that you build out the network to handle that complexity? So I do view our job as a network solution provider. Our goal is to be the simplest, most performant portion of your stack. So we really see ourselves tying everything together. So whatever compute solution you choose, whether you go CPU-based, GPU-based combination, whatever it is, we want to have the solution that's ready to go and be adapted to whatever

Starting point is 00:03:57 has been chosen. And then if you look at just the core. of our architecture and what we're building, that's one of our foundational, fundamental, base design factors is how much scale can you add before you start seeing performance tail off? Sure. In a lot of current clusters, I've said this before, you can start to see scaling problems start in as little as an 8-GPU cluster. Obviously, it gets worse when you get to 100 and 1,000 and 10,000, but it can get bad really fast. So we've put a tremendous amount of effort to the design and the architecture to ensure that you have what we call scaling efficiency. You add more and more compute to your problem as it gets bigger and your network just keeps up.

Starting point is 00:04:40 So we actually spend a really significant amount of time designing for this and building for that so that we remove complexity for our customers. Ravi, let's dive into compute. You know, scaling with compute is not easy. Tell me how you're tackling that at AMD. You know, that's great. If you really look at it, the CPU has by far been the most important compute circuit there has been for a long time to come, till recent times.

Starting point is 00:05:07 So what has happened essentially is it's taken on this role in which it does this end-to-end synchronization. What I mean by that is it looks at every aspect of the platform and determines essentially what aspects need to be operated where. And most of the time, in recent times, we've got accelerators and GPUs now, but most of the time, it's taking. upon itself to go ahead and be the arbiter of the solver of overall problems that happens in the compute space. Now, because of that, what has happened is there's been a software ecosystem, which has created a software foundation with the X86 Foundation essentially has helped it do this in a more

Starting point is 00:05:52 efficient manner. There are lots and lots of applications that have been developed on top of it, that you don't have to go in, you could make minor changes through multiple millions of users everywhere and they'd be able to get the scaling that somebody desires. And then above all, over time, it has also hosted the ability to go ahead and have a secure environment. It has all of the control functions and the logic in it that helps us to scale this through ends of the places so that we are operating not just from a functional standpoint or a performance

Starting point is 00:06:25 standpoint, we are also operating from a secure standpoint. It's interesting that you're introducing the CPU as the orchestrator, and one of the things that I think about is the added complexity of increased heterogeneity, if I can say that were, different types of accelerators are being deployed, different types of workloads in different parts of the data center. How does the CPU tackle that? So if you really look at, sometimes we talked about this conversation on brute force, and people think when you need to do a certain acceleration job, oh my God, I need to get myself

Starting point is 00:06:57 a ton of GPUs. Not really. Okay. What you really need is the right mix. Yeah. Okay. What I mean by that is when we started by talking saying, hey, if you have a set in use case at a word load, what is the right mix of compute? What is the associated network that you need to have for it? What do you need to have with respect to memory bandwidth and your access to resources that can actually get it? And when you put all of this together, You need the CPU to play an orchestration job, meaning like a traffic car, that essentially is going to go ahead and say, hey, this is fine-tuning, this is coordination activity, I'll take care of it.

Starting point is 00:07:39 This is deep math. I'm going to send this to the GPU, and it does that really, really well. Or I'm going to send it to some accelerator. It does that really, really well. And oh, by the way, I'm just finding out that I have memory close by that I can utilize and do this. I'll call it even inferencing jobs. quicker than sending this over to the GPU and then getting the data back and then reporting

Starting point is 00:08:01 what the answer is going to be. That to me, if you get this right balance together, the CPU can really operate in a way in which you can scale systems. I think you're talking about a much more of a real world environment versus how we might set things up in a benchmark. Yeah, yeah, you can always prove that something's faster than something else, but you're talking about this real world where you might not take the time, you might not take the power to go do these other things or there's so many environments, especially again, on the enterprise side where instantaneous response is not always the requirement, but fitting within the power envelope of the data center infrastructure that I have and within

Starting point is 00:08:38 my OPEC constraints is a top-notch concern. So you can do a lot of balance there across your whole compute storage network system. Now, Lisa, I know that you know compute well from your history, but when you look at the network, how does the network assist with the compute challenge here? Yeah. So when we think of network architecture, we think of it as servicing workloads with bandwidth, very important, the network bandwidth, the message rate, the latencies, the overlap or the like collectives and communications. And how does that all come together to service the specific workload? And the unique thing and fun thing about the enterprise environment is that you can be running AI in a variety of areas through your enterprise. and each one of those is pulling on a little bit differently. And so there are certain parts of high performance computing or AI workloads that are very latency sensitive or very message rate sensitive.

Starting point is 00:09:38 And others, the thing they care the most about is how fast you can get to half bandwidth. When you put together a network architecture that can handle all of those, our capability allows the application to pull what it needs most. Right. And so that's, I think, again, an offer of balance that we provide. And if you look at our hiring, I mean, yeah, we're a growing startup, so we're hiring for lots of different things. But one of the biggest things is this solution architecture space. So we completely recognize that solution architects that are focused in on the enterprise use case in workloads is not the same people that are going and working on a frontier model problem for a hyperscale customer.

Starting point is 00:10:16 And so we've actually separated that out because the requirements are very different. Even if you're using the same compute, same storage, same network, you're putting it together very differently. Now, I know that we're at OCP Summit, OCP organization, put out an open letter saying that we need to move faster, open industry innovation. I was like, wow. You need to move slower. Yeah, exactly. It's like we're sprinting, but like bring it out. Now, both of your companies in all seriousness are very committed to open industry engagement.

Starting point is 00:10:47 Tell me how that looks like in this age. Robbie, do you want to take it first? Sure. Let's just start by saying open means innovation acceleration. There's just more people working on it, and more people are going to be innovating and innovating together. Okay, so clearly openness is attributed directly to that. Then I think, secondly, it provides customer choice. At the end of the day, it doesn't matter what a compute provider, an accelerator provider, or an network provider, however good they may be, is at the end of the day,

Starting point is 00:11:19 is going to be dictated by the customer as to what choice they want to do. So an open ecosystem is going to provide them the choice, and it evolves to a better space over time. And then long-term sustainability of an open architecture, it has proven time and time again where an open architecture reaches a place which allows for you what you would call as a more optimized space much faster than anything else. I honestly think we're a living, breathing example of the power of OCPs, ethos of the open standard. So I don't know if everyone knows, but Cronellas Networks originally

Starting point is 00:11:55 was built on top of a proprietary network architecture that has phenomenal performance. But it is proprietary. And we have shifted our company strategy. So now that we're using the advantages of that architecture, but we are building Ethernet and Ultra Ethernet compatible and compliant products that are interoperable. So our Supernix will interoperate with another Ethernet switch. And vice versa. And so we've recognized that these problems that we're facing today are way too big for anyone to go out of loan. I do mean anyone. Right. No matter how big or powerful the company. So it is taking the whole industry. It is taking the whole ecosystem. And we made a fundamental choice. We are going to be a part of it. And we're going to use all our learnings from our base architecture

Starting point is 00:12:39 in all those areas. I talked about the message rate, the latency, all of that, to then accelerate the progress that happens through standards. So that's been a really exciting journey for us as a company and to be able to start bringing out these Ethernet, Ultra Ethernet Ready products is great. While we're in plug mode, let me also do one just to help here. What Lisa said, that's a lot of sense. And in that way, if you really look at MD with its promotion of UA, or whether it be UEC or Ultra Ethernet Bank, and all of these standards around which we need to build these really complex high-performance data centers collectively through the power of the innovation that happens across all of the brilliant people around the world, I think is what is going to make

Starting point is 00:13:27 us all successful. Yep. That's fantastic. And I think that what's interesting is how standards are being redefined in terms of how quickly standards have to come together. What scope of standard needs to be defined for different applications? It's fascinating to see having been in this space for a couple of decades. One question that I have for you, we're talking about enterprise, and, you know,

Starting point is 00:13:51 I can just imagine being a CTO trying to navigate this space right now. I'm sure that there is a tremendous amount coming to them in terms of what they need to do, priorities from inside the company, infrastructure providers talking to them about what they need to buy. What is the one principle that you would suggest? And I'm going to ask both of you to give your perspectives on this that would actually cut through the noise and guide them to a successful journey of adoption and proliferation of AI. Okay, that's a toddler one piece of advice. It's going to cut through all the noise.

Starting point is 00:14:23 Yeah. And, okay, go to an hour. I have high expectations of both of you. Okay, man, well, I hope I don't disappoint because I was maybe going to go for one bar lower than perfection, but I was thinking about it, you know, from their perspective. So aside from being the CEO at Cornellis, I actually spent five years in IT infrastructure. So literally providing infrastructure to 100,000 person company. And I had a lot of learnings from that. But my biggest thing is you need to be acting fast and quickly to support your business, whatever your business unit is, your business infrastructure, because if you don't, they will start looking everywhere and causing chaos around you. But the more specific it can be about biting off your problem, solve for that and then move to the next use case. And Robbie and I have had a lot of good conversation around this. You're not going to get anywhere with your AI deployments inside of a company if you're trying to.

Starting point is 00:15:11 to do the entire, everything all together all at once, just like the movie. And so it's like demonstrating massive amounts of success in an area and then moving to the next and having that rolling thunder. I think that's the only way to do it. And I say that both as having a history of owning providing the infrastructure. And I also say it from the perspective of inside our company, we're incredibly high growth users of AI ourselves to design our products. And that's how we've tackled it. The team is largely the same. But I'll maybe say it in a different way. Number one, you want to assess your use case. What's good for X, Y, Z may not be good for you. Okay? So first, you should assess your use case. And then keep

Starting point is 00:15:54 it simple with your existing infrastructure and see how much of your existing infrastructure actually can solve a bulk of the use case. And then look and understand and leverage your partnership. If there are ecosystem partners you've been working with already, leverage your partnership to determine the incremental amount you have to do to go optimize your use case work. That's how I would say before then you're crawling or maybe even you're walking by that time and then you can start running shortly. I do like that because this concept, you can always find a way to be more performant or cost effective, but first prove you can do it and just don't worry about it and you'll refine it and make it better over time. So you both did well on that

Starting point is 00:16:36 one. So we're going to raise the bar one more time. Now, if you look ahead, We are in this massively changing landscape of how AI is being adopted and what's coming next. So what do you see as the next major inflection in this space that's going to drive infrastructure requirements even further? Lisa, I'm going to ask you to go first. Okay. See, again, you do these high pressure questions of like the next singular most. Okay. So I think that we have so much more to go still on this adoption curve. So to us, we're in the industry. We're like living and breathing. We're with 11,000 other people that are doing the same thing here at OCP. So we think this is the whole world. I would like to inform all of us. It is not. There is so much more to be done as far as adoption use cases, you know, all of that. So I've talked about it before. I think we're coming upon the rise of the enterprise and actually finding high productivity use cases that.

Starting point is 00:17:37 that deliver fundamentally different business outcomes for people. And that's where I see it going next. I think the frontier model stuff will continue to astound us all as what's being built there. But the actual fundamental finding better human outcomes is still yet to come. My take is essentially what we've seen so far, I think, is experimentation. So I think scale for productizing AI is going to be super-prone. critical because everybody is experimented by building this, building that, and then seeing

Starting point is 00:18:12 what they can actually do. I can train a bunch of models. You need to go to productize AI. Now, to productize AI, it is no longer just going to be innovation in compute infrastructure. It's going to be innovation also in data movement infrastructure. When I say data movement infrastructure, you're no longer about, hey, the best compute unit is a GPU. How do you get the data to feed that decent, get it out?

Starting point is 00:18:37 You need innovation, AI innovation in that space. And then it's going to diversify into the edge and not everything is going to be centralized and they're going to be different means of doing this. So if you want to look at say, productizing AI with innovation and not just compute, but also data movement and also diversified compute towards the edge and then the data. This interview did not disappoint.

Starting point is 00:19:03 Not surprising given the all-star panel I've got next to me. Thank you both sincerely for spending some time with Tech Arena today. One final question. Where can folks engage to find out more about the solutions that you're offering in this space and to engage your teams? Well, for us, you can visit us at cornelistnetworks.com and find us on LinkedIn. And you can find me on LinkedIn, too. Hi.

Starting point is 00:19:27 You always know you're going to reach us at AMD.com. And you can certainly find me on LinkedIn as well. Thanks so much, Lisa and Ravi. It was a pleasure. All right. Thanks, Allison. Thank you. Thanks for joining Tech Arena.

Starting point is 00:19:39 Subscribe and engage at our website, Techorina.AI. All content is copyright by Tech Arena.

In The Arena by TechArena - Cornelis Networks & AMD on Scaling AI Without the Chaos

From CPU orchestration to scaling efficiency in networks, leaders reveal how to assess your use case, leverage existing infrastructure, and productize AI instead of just experimenting....

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.