In The Arena by TechArena - Cornelis Networks & AMD on Scaling AI Without the Chaos
Episode Date: January 20, 2026From CPU orchestration to scaling efficiency in networks, leaders reveal how to assess your use case, leverage existing infrastructure, and productize AI instead of just experimenting....
Transcript
Discussion (0)
Welcome to Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Kline.
Now, let's step into the arena.
Welcome in the arena. My name's Allison Kline, and we are coming to you from OCP Summit in San Jose, California.
I have two amazing guests with me. The first one is Lisa Spelman, CEO of Cornellus Networks.
And the second is Robbie Kupaswami, back again on Techore.
Thank you.
Senior Vice President and GM of Server Product and Engineering at AMD.
Welcome to the program to both of you.
Thank you.
Thank you.
Yeah, good to be here.
Yeah.
Now, I want to just start with a question about AI infrastructure.
There's often what Lisa has called on previous episodes of Tech Arena,
brute force approach to building out and maximizing GPUs.
From your perspectives, what's the most complete story that enterprises need to understand?
to understand about building AI infrastructure that delivers results?
I'll go first.
Yeah, let me start by saying that thank you for having me over.
Oh, yeah, for sure.
OK, so yes, brute force is clearly not the right way to go.
Most things here needs to have the right type of compute and the right size of compute.
What do I mean by that?
Whether it be CPUs, GPUs, accelerators, you need to not have most of one,
and little of the other, you need to have the right mix
and the right size, okay?
That's number one.
Number two is, how do you make sure
there's end-to-end synchronization,
meaning it's not just about compute,
it's also about memory bandwidth,
it's about network sizing.
It is about how do you have all of this mish-mash together
in a way in which they're truly giving
the output that you desire.
And then of course, above all,
TCR rules all, okay?
And in any measure, you can have what you want to have, but your use case needs to have the right mix and the right type so that you are able to get the best ECR.
I think that the root force comment often applies to think of like a hyperscale or a neocloud, the pursuit of the frontier model.
And when you think about an enterprise and you're delivering tied to a specific business case, and so you might be doing inference or you might be doing some fine tuning, but you actually have an opportunity to build.
your infrastructure and think through that a little bit more in an end-to-end way for your business.
So that's why I do think the brute force exists. I just think for enterprises, they actually
have a little bit more space and time to build something more attuned to their actual
requirements and what they need. And we do see this happening where people spend maybe the first
year of an AI journey or whatever year, 18 months, and they might pursue their workloads in the
cloud for that flexibility, then they learn how their product team or their design team or their
go-to-market team or their HR team actually is using these tools. And so you can take actual
data and information about the use case and then work with partners to build the infrastructure
out. So I think you actually have a little bit more time in an enterprise to study the problem
and then be thoughtful about it. Now, we know that from the compete side, you know, as you scale,
so grows complexity. Lisa, I have a question for you. On the
this. How do you address that in terms of the way that you build out the network to handle that
complexity? So I do view our job as a network solution provider. Our goal is to be the simplest,
most performant portion of your stack. So we really see ourselves tying everything together.
So whatever compute solution you choose, whether you go CPU-based, GPU-based combination,
whatever it is, we want to have the solution that's ready to go and be adapted to whatever
has been chosen. And then if you look at just the core.
of our architecture and what we're building, that's one of our foundational, fundamental,
base design factors is how much scale can you add before you start seeing performance tail off?
Sure. In a lot of current clusters, I've said this before, you can start to see scaling problems
start in as little as an 8-GPU cluster. Obviously, it gets worse when you get to 100 and 1,000 and 10,000,
but it can get bad really fast. So we've put a tremendous amount of effort to the design and the
architecture to ensure that you have what we call scaling efficiency.
You add more and more compute to your problem as it gets bigger and your network just keeps up.
So we actually spend a really significant amount of time designing for this and building for that
so that we remove complexity for our customers.
Ravi, let's dive into compute.
You know, scaling with compute is not easy.
Tell me how you're tackling that at AMD.
You know, that's great.
If you really look at it, the CPU has by far
been the most important compute circuit there has been for a long time to come, till recent times.
So what has happened essentially is it's taken on this role in which it does this end-to-end
synchronization. What I mean by that is it looks at every aspect of the platform and determines
essentially what aspects need to be operated where. And most of the time, in recent times,
we've got accelerators and GPUs now, but most of the time, it's taking.
upon itself to go ahead and be the arbiter of the solver of overall problems that happens
in the compute space.
Now, because of that, what has happened is there's been a software ecosystem, which has created
a software foundation with the X86 Foundation essentially has helped it do this in a more
efficient manner.
There are lots and lots of applications that have been developed on top of it, that you don't
have to go in, you could make minor changes through multiple millions of users everywhere
and they'd be able to get the scaling that somebody desires.
And then above all, over time, it has also hosted the ability to go ahead and have a secure
environment.
It has all of the control functions and the logic in it that helps us to scale this through
ends of the places so that we are operating not just from a functional standpoint or a performance
standpoint, we are also operating from a secure standpoint.
It's interesting that you're introducing the CPU as the orchestrator, and one of the things
that I think about is the added complexity of increased heterogeneity, if I can say that
were, different types of accelerators are being deployed, different types of workloads
in different parts of the data center.
How does the CPU tackle that?
So if you really look at, sometimes we talked about this conversation on brute force,
and people think when you need to do a certain acceleration job, oh my God, I need to get myself
a ton of GPUs. Not really. Okay. What you really need is the right mix. Yeah. Okay. What I mean by that
is when we started by talking saying, hey, if you have a set in use case at a word load,
what is the right mix of compute? What is the associated network that you need to have for it?
What do you need to have with respect to memory bandwidth and your access to resources that can
actually get it? And when you put all of this together,
You need the CPU to play an orchestration job, meaning like a traffic car,
that essentially is going to go ahead and say,
hey, this is fine-tuning, this is coordination activity, I'll take care of it.
This is deep math.
I'm going to send this to the GPU, and it does that really, really well.
Or I'm going to send it to some accelerator.
It does that really, really well.
And oh, by the way, I'm just finding out that I have memory close by that I can utilize
and do this.
I'll call it even inferencing jobs.
quicker than sending this over to the GPU and then getting the data back and then reporting
what the answer is going to be. That to me, if you get this right balance together, the CPU
can really operate in a way in which you can scale systems. I think you're talking about a
much more of a real world environment versus how we might set things up in a benchmark. Yeah,
yeah, you can always prove that something's faster than something else, but you're talking about
this real world where you might not take the time, you might not take the power to go do these
other things or there's so many environments, especially
again, on the enterprise side where instantaneous response is not always the requirement,
but fitting within the power envelope of the data center infrastructure that I have and within
my OPEC constraints is a top-notch concern. So you can do a lot of balance there across your whole
compute storage network system. Now, Lisa, I know that you know compute well from your history,
but when you look at the network, how does the network assist with the compute challenge here?
Yeah. So when we think of network architecture, we think of it as servicing workloads with bandwidth, very important, the network bandwidth, the message rate, the latencies, the overlap or the like collectives and communications.
And how does that all come together to service the specific workload? And the unique thing and fun thing about the enterprise environment is that you can be running AI in a variety of areas through your enterprise.
and each one of those is pulling on a little bit differently.
And so there are certain parts of high performance computing or AI workloads
that are very latency sensitive or very message rate sensitive.
And others, the thing they care the most about is how fast you can get to half bandwidth.
When you put together a network architecture that can handle all of those,
our capability allows the application to pull what it needs most.
Right.
And so that's, I think, again, an offer of balance that we provide.
And if you look at our hiring, I mean, yeah, we're a growing startup, so we're hiring for lots of different things.
But one of the biggest things is this solution architecture space.
So we completely recognize that solution architects that are focused in on the enterprise use case in workloads is not the same people that are going and working on a frontier model problem for a hyperscale customer.
And so we've actually separated that out because the requirements are very different.
Even if you're using the same compute, same storage, same network, you're putting it together very differently.
Now, I know that we're at OCP Summit, OCP organization, put out an open letter saying that we need to move faster, open industry innovation.
I was like, wow.
You need to move slower.
Yeah, exactly.
It's like we're sprinting, but like bring it out.
Now, both of your companies in all seriousness are very committed to open industry engagement.
Tell me how that looks like in this age.
Robbie, do you want to take it first?
Sure.
Let's just start by saying open means innovation acceleration.
There's just more people working on it, and more people are going to be innovating and innovating together.
Okay, so clearly openness is attributed directly to that.
Then I think, secondly, it provides customer choice.
At the end of the day, it doesn't matter what a compute provider, an accelerator provider, or an network provider, however good they may be, is at the end of the day,
is going to be dictated by the customer as to what choice they want to do.
So an open ecosystem is going to provide them the choice,
and it evolves to a better space over time.
And then long-term sustainability of an open architecture,
it has proven time and time again where an open architecture reaches a place
which allows for you what you would call as a more optimized space much faster than anything else.
I honestly think we're a living, breathing example of the power of OCPs,
ethos of the open standard. So I don't know if everyone knows, but Cronellas Networks originally
was built on top of a proprietary network architecture that has phenomenal performance. But it is
proprietary. And we have shifted our company strategy. So now that we're using the advantages of that
architecture, but we are building Ethernet and Ultra Ethernet compatible and compliant
products that are interoperable. So our Supernix will interoperate with another Ethernet switch. And
vice versa. And so we've recognized that these problems that we're facing today are way too big for
anyone to go out of loan. I do mean anyone. Right. No matter how big or powerful the company. So it is
taking the whole industry. It is taking the whole ecosystem. And we made a fundamental choice.
We are going to be a part of it. And we're going to use all our learnings from our base architecture
in all those areas. I talked about the message rate, the latency, all of that, to then accelerate
the progress that happens through standards. So that's been a really exciting journey for us as a
company and to be able to start bringing out these Ethernet, Ultra Ethernet Ready products is great.
While we're in plug mode, let me also do one just to help here. What Lisa said, that's a lot of sense.
And in that way, if you really look at MD with its promotion of UA, or whether it be UEC or
Ultra Ethernet Bank, and all of these standards around which we need to build these really
complex high-performance data centers collectively through the power of the innovation that
happens across all of the brilliant people around the world, I think is what is going to make
us all successful.
Yep.
That's fantastic.
And I think that what's interesting is how standards are being redefined in terms of how
quickly standards have to come together.
What scope of standard needs to be defined for different applications?
It's fascinating to see having been in this space for
a couple of decades. One question that I have for you, we're talking about enterprise, and, you know,
I can just imagine being a CTO trying to navigate this space right now. I'm sure that there is a
tremendous amount coming to them in terms of what they need to do, priorities from inside the
company, infrastructure providers talking to them about what they need to buy. What is the one
principle that you would suggest? And I'm going to ask both of you to give your perspectives on this
that would actually cut through the noise and guide them to a successful journey of adoption and
proliferation of AI.
Okay, that's a toddler one piece of advice.
It's going to cut through all the noise.
Yeah.
And, okay, go to an hour.
I have high expectations of both of you.
Okay, man, well, I hope I don't disappoint because I was maybe going to go for one bar lower
than perfection, but I was thinking about it, you know, from their perspective.
So aside from being the CEO at Cornellis, I actually spent five years in IT infrastructure.
So literally providing infrastructure to 100,000 person company.
And I had a lot of learnings from that. But my biggest thing is you need to be acting fast and quickly to support your business, whatever your business unit is, your business infrastructure, because if you don't, they will start looking everywhere and causing chaos around you. But the more specific it can be about biting off your problem, solve for that and then move to the next use case. And Robbie and I have had a lot of good conversation around this. You're not going to get anywhere with your AI deployments inside of a company if you're trying to.
to do the entire, everything all together all at once, just like the movie. And so it's like
demonstrating massive amounts of success in an area and then moving to the next and having that
rolling thunder. I think that's the only way to do it. And I say that both as having a history
of owning providing the infrastructure. And I also say it from the perspective of inside our
company, we're incredibly high growth users of AI ourselves to design our products. And that's how
we've tackled it. The team is largely the same. But I'll
maybe say it in a different way. Number one, you want to assess your use case. What's good for
X, Y, Z may not be good for you. Okay? So first, you should assess your use case. And then keep
it simple with your existing infrastructure and see how much of your existing infrastructure
actually can solve a bulk of the use case. And then look and understand and leverage your
partnership. If there are ecosystem partners you've been working with already,
leverage your partnership to determine the incremental amount you have to do to go optimize your
use case work. That's how I would say before then you're crawling or maybe even you're walking by
that time and then you can start running shortly. I do like that because this concept, you can
always find a way to be more performant or cost effective, but first prove you can do it and just
don't worry about it and you'll refine it and make it better over time. So you both did well on that
one. So we're going to raise the bar one more time. Now, if you look ahead,
We are in this massively changing landscape of how AI is being adopted and what's coming next.
So what do you see as the next major inflection in this space that's going to drive infrastructure requirements even further?
Lisa, I'm going to ask you to go first.
Okay.
See, again, you do these high pressure questions of like the next singular most.
Okay.
So I think that we have so much more to go still on this adoption curve. So to us, we're in the industry. We're like living and breathing. We're with 11,000 other people that are doing the same thing here at OCP. So we think this is the whole world. I would like to inform all of us. It is not. There is so much more to be done as far as adoption use cases, you know, all of that. So I've talked about it before. I think we're coming upon the rise of the enterprise and actually finding high productivity use cases that.
that deliver fundamentally different business outcomes for people.
And that's where I see it going next.
I think the frontier model stuff will continue to astound us all
as what's being built there.
But the actual fundamental finding better human outcomes is still yet to come.
My take is essentially what we've seen so far, I think, is experimentation.
So I think scale for productizing AI is going to be super-prone.
critical because everybody is experimented by building this, building that, and then seeing
what they can actually do.
I can train a bunch of models.
You need to go to productize AI.
Now, to productize AI, it is no longer just going to be innovation in compute infrastructure.
It's going to be innovation also in data movement infrastructure.
When I say data movement infrastructure, you're no longer about, hey, the best compute unit
is a GPU.
How do you get the data to feed that decent, get it out?
You need innovation, AI innovation in that space.
And then it's going to diversify into the edge
and not everything is going to be centralized
and they're going to be different means of doing this.
So if you want to look at say, productizing AI with innovation
and not just compute, but also data movement
and also diversified compute towards the edge and then the data.
This interview did not disappoint.
Not surprising given the all-star panel I've got next to me.
Thank you both sincerely for spending some time with Tech Arena today.
One final question.
Where can folks engage to find out more about the solutions that you're offering in this space
and to engage your teams?
Well, for us, you can visit us at cornelistnetworks.com and find us on LinkedIn.
And you can find me on LinkedIn, too.
Hi.
You always know you're going to reach us at AMD.com.
And you can certainly find me on LinkedIn as well.
Thanks so much, Lisa and Ravi.
It was a pleasure.
All right.
Thanks, Allison.
Thank you.
Thanks for joining Tech Arena.
Subscribe and engage at our website,
Techorina.AI.
All content is copyright by Tech Arena.
