In The Arena by TechArena - Cornelis and OCP Examine Why AI Infrastructure Must Evolve

Starting point is 00:00:00 Welcome to Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein. Now, let's step into the arena. Welcome in the arena. My name's Allison Klein. We are coming to you from the OCP Summit in San Jose, and I am so delighted because we are with Lisa Spellman, CEO of Cornellus Networks, and Zane Ball, CTO of OCP, for a fun and delightful conversation about everything around AI computing. Welcome to the program, guys. Thank you. Yeah, thanks for having us.

Starting point is 00:00:40 OCP Summit, avoiding the future of AI. When you think about leading versus following in AI infrastructure, what does this distinction mean to you? Maybe I'll go first on this one. But in the world of AI, I feel like, gosh, I hate to say that there almost is no follow. You have to be leading, you have to be foot forward. and you have to be solving not just today's pressing problem, but tomorrow's pressing problem.

Starting point is 00:01:06 The hardest part, I think, that we're all facing is seeing that future in a world that moves so fast. Anyone can think of two years ago, what we were talking about a year ago, six months ago, and that pace just keeps accelerating. So for those of us working on long-term projects, Silicon Innovation to support AI, it really is a big thinking challenge

Starting point is 00:01:26 to stay on that front foot and be leading. I think the follow is hard. For my respect, clue, that feels like up until now, and the infrastructure, well, AI has been kind of inflicted on everyone. Yeah, let's listen. I was talking about the keynote yesterday that, you know, we've run out of words because for years, people come to these conferences and talk about unprecedented, exponential. And then this is like a hundred times bigger than that was saying.

Starting point is 00:01:49 So, yeah, you've been saying unprecedented for years. But actually, no, it's really unprecedented. And it's like civilizational infrastructure building out. And it's the degree to which it's really hard to put a supercomputer on the ground every week at some of these companies. And leading on AI means getting on top of this for a while. It's like, you know, we've really had, especially at the facility data center level, we've got to change our minds. No, we've got to think very differently about how this is done. And instead of running around with our hair on fire, it's time to get really proactive and lead on these things.

Starting point is 00:02:30 and do it, collaboration. Yeah, I like what Zane said, but I think there's so much pressure when you're an operator in these roles. So this separating out the hair on fire, which has led to amazing advancements, even like I said, over the last six months in what we're able to do

Starting point is 00:02:46 and finding the time and the brain power to step back from that and think about it more systemically. It is a real challenge because the hair on fire doesn't stop. And so, yeah, you're trying to create space for that type of thinking inside,

Starting point is 00:03:00 of an incredibly constrained talent world as well. So it's a big civilization challenge. I wanted to ask you this question because I know that you worked very closely at a previous company, building foundational technology with the industry and delivering this broad proliferation of data center capability. When you look at this moment, what are we doing right and what do we still need to work on in terms of showing up for this moment in civilization, if you will. The way I can think about it, like when we go through big technological changes,

Starting point is 00:03:36 they're almost always built on top of the last one. Like, you know, we didn't get automobiles. It would have never been possible without railroads. But the ecosystem had to belong very significantly to do that, but you wouldn't have had the ability to build roads or finance things, had you not gone through one, and you couldn't have done mobile had you not done PCs. And I don't think you could do AI if you hadn't built out cloud infrastructure before. But it's not to say that it's at all the same. Cloud infrastructure is really quite different

Starting point is 00:04:07 than infrastructure. And so while we build on all the smart things we did in the past, we have to solve a bunch of new problems. But we wouldn't even be where we are today if we weren't using the supply chain and the ecosystem that has already been built and the problems that have already been solved. We're doing things wrong. today that we did right before because we're still doing a lot of those same things. We're just a bunch of new problems that we've got to solve on top of it. I reflect back on our time working together to build out that cloud infrastructure. And the pace felt immensely challenging.

Starting point is 00:04:44 It pushed us. It pushed our teens every single day to stay on top of. And what we're doing today pales in comparison. I remember working on taking design cycles for Silicon, trying to get them within five years of concept through to production. And then I look at Cornellus now, much smaller team, much more focused in a singular area of networking. And our design cycles are 12 to 18 months. And so it's a whole pace change that goes along with it. But I do have to agree with what Zane said about the foundations of each prior experience are what gives you the launching pad for the

Starting point is 00:05:23 next one, even if the technological problem of a scale-out respond to everything versus a highly parallel computing challenge is different architecturally. There's a lot of foundational in there. Now, you mentioned that you're working in networking. I think that one of the most interesting things about AI compute is that it really takes the concept of balanced computing and dialed in performance equilibrium between compute, network, and storage to the next level. Why is network becoming so critical at this point. And how do you see the industry working on that together to deliver core capabilities within the network? Yeah, if I'm being fair, when I was really focused on compute, I really said compute was the center of the system. And now that I spend all my days

Starting point is 00:06:07 in networking, I say networking's the center of the system. So that's either a me problem or that's actually really a part of what's happening. But I just think that in the world we're living in, compute advancements have had a tremendous amount of attention and resourcing and dollars and innovation put towards them and have shown just massive scale improvements, like X factors of improvements in order to keep up with the challenge. But the networking, a lot of it's still running on architectures that were built. They're super proficient, but were designed for totally different problems. And so this parallel computing opens up like a new set of work wire. that we haven't fully adapted to.

Starting point is 00:06:49 And that's why you see all these systems with 50% utilization of the compute. And for me, when you step back from that, it's a business challenge, it's a ecosystem challenge. But actually, it is a fundamental civilization challenge to build again on what Zane said, because we simply can't afford from a power perspective, from an infrastructure perspective,

Starting point is 00:07:10 from a supply chain perspective, from a minerals perspective, we can't afford to leave that much capacity underutilized, not helping solve these problems. So the thing that gets me excited every day is working with this ecosystem to actually get after it and solve that and drive that utilization of compute forward. And I see the network as one of the biggest unlockers of that potential. I think the physics is on the side of the common network's becoming the center of. It's not just me. Okay. Cool. Cool. It's like how much power is in the network versus in the compete, right? I mean, it's out of course.

Starting point is 00:07:46 Exactly. People throw out like 30%. Yeah, it's a lot. But if you look at the very nature of the transformer model and how you train and how you do inference, you have multiple networking problems. You have to solve. You have a scale up network where you have to get all these GPs acting like they are one machine fully coherent. That's very difficult challenge, but it's also different than a scale out problem where we have to still maintain some level of coherence, but in a more relaxed latency domain. And you know, it's like, okay, that's, getting it to be a bigger and bigger set of challenges. And then now we realize, oh, now I need to think about the

Starting point is 00:08:22 hall and multi-haul networking. And now I'm actually looking at scale across so that I can marshal global compute resources solving on a problem and make that problem incredibly large. And those connections, I think, a little bit like neural connections, like, you know, in your brain, your brain is a connectivity problem. And AI is fundamentally a connectivity problem.

Starting point is 00:08:44 If you can do the networking better, You're going to save tons of power. You're going to get more interesting AI possibilities. So it's just a very natural place for the center of gravity industry to be right. I think so many people don't realize at how small of scale, the challenge of scale actually really shows up in AI systems and in networking. And you hear in news people putting together 500,000 GPU systems, building to a million GPU systems.

Starting point is 00:09:11 But the real problems at scale literally start once you put eight together. And it only gets worse from there. It's shocking how small it starts at and how much opportunity there is to improve and the complexity of scale up as well. So, yeah, it's an exciting problem to be a part of. And one of the things I talk about, too, not just my team at Cornellus and all of us, but in this domain and space and area we play in, like, you know your work matters. You have that significance of we can really make a difference and impact here by improving the scalability of these AI systems. I know that you guys understand market segmentation really well.

Starting point is 00:09:48 You understand the difference between enterprise and hyperscale intimately. Enter the neocloud and providing new challenges in terms of the ways they operate and the ways they want to engage with tech vendors. How do you look at serving, from your perspectives, serving these different deployment models, these different types of customers, and the broad proliferation of AI adoption that's coming in the enterprise? as well. And what are the key points that you think about in terms of delivering infrastructure capabilities that address those markets? I think the neoclouds, and it's an interesting term because

Starting point is 00:10:27 what is open AI? Is that really a neocloud? Is that a hyperscaler now? Or what is that definition anymore? But it's interesting having been in the industry for as long as we have, that no matter how big the incumbents are, there's always a challenge you're coming with a slight take, a pivot off of both the business model as well as the underlying infrastructure model. And I think that's what we've seen with the neoclouds. Like what happens when you are building infrastructure first and foremost and solely for this problem, for this application, not saying, oh, I have this massive book of business and how do I also do this? Right. Right. There's a focus thing there that is interesting because you're not trying to bolt a solution on to a bunch of existing infrastructure

Starting point is 00:11:14 and then also partially optimized. It's very focused and we see that a lot. The enterprise is a bit different in that what we're seeing a lot with our customer base is a lot of folks that have HPC applications in their business model. So whether that's oil and gas or automotive and drug discovery, all those types of things, those are the ones that are moving most quickly towards also having AI integrated into how they deliver their business. And I think some of it, yeah, comes from their history of using compute as a fundamental resource to drive their business. And so they have that mindset built in. And that's been an interesting transformation as a lot of those companies are choosing to actually drive that

Starting point is 00:12:03 portion of their business in an on-prem or colo model because of the power. of the data. I think from an OCP perspective, it's a community. We have lots of members of the community come together to work out solutions in a lot of different segments, including Enterprising, not surprises people sometimes. At an engineering level, I think what's being deployed in Neo-Clouds

Starting point is 00:12:22 and what's being deployed in hyperscale increasingly look very similar. And as we saw the letter called to action yesterday, it's like there's a real desire out there to standardize elements of these really large scale liquid cooled DC voltage distributing like massive systems. I don't think you're going to see lots of those systems in some small enterprise data same, right? That's like a completely different thing. And it doesn't mean there isn't going to be AI there because AI is just like a very big

Starting point is 00:12:53 thing. It sort of boils down to the frontier model world and other AI applications. I think there's still a lot of ink to be written on what that tail of applications is and what role smaller models play, what kind of inference solutions are going to be out there closer to deployment and lower latencies to end users and most kind of things. And I think we'll see a lot more experimentation. I feel like enterprises are moving from a lot of experimentation more into like, no, I'm really getting business value in the next 12 months. So I think we will see that segmentation, I think, come up a little bit more into clarity. But in short term, you're still going to

Starting point is 00:13:28 read about just massive gigawatt data center. And it could be built by a neopode. It could be built by Oracle, it could be built by Google, but you may see people deploying their infrastructure in all kinds of different modes. We're already seeing that, obviously, and I think we'll see more and more of that, because finding the power is the thing and finding a place where you can deploy and scale up or scale down your capacity. If young people are going to want lots of options for doing them. And people, I don't know if they fully realize how much of the hyperscale and meocloud

Starting point is 00:13:58 is a shared business model. I mean, the neoclods are being used to provide excess capacity. and capabilities to try out new infrastructure styles. I do think as an industry, we are relying a lot more on standards than ever before. OCP, the Ultra Ethernet Consortium, UAL, things like that, coming together. It's like these problems are too big to go out alone.

Starting point is 00:14:20 And so bringing people towards standards and much more open, open source, open methodologies will allow for faster solving and scaling of these problems. So I think that trend will continue. You know, it's interesting. I was preparing for our interviews this week, and I was thinking about what was OCP Summit like last year versus this year? And there were seedings of exciting things happening in the industry. I recently had a conversation with you, Lisa, where you said, we're doing this with brute force. We need to do it more elegantly. That line has stuck with me. You mentioned

Starting point is 00:14:54 the open letter of needing to move faster. So as we look forward, and there's this strong imperative to move faster as an industry, to work together, to utilize the things that we've started to actually drive broad proliferation of technology. What do you think we're going to be talking about in 26? I've called this wrong pretty much every year. And he's the CTO. I know what we're not.

Starting point is 00:15:18 It does feel like every year has a little zeitgeist to it, right? A lot of it seems like in the open letter, feels like that's kind of in the zeit case, but this show. I feel like we're going to decisively pivot from a training-dominated ecosystem conversation to an inference dominated ecosystem. And I think that's going to create new products,

Starting point is 00:15:38 new opportunities, new standards that need to be developed. And we're going to say a lot more conversation about that because that's going to become the first sort of problem. You already see people with like workloads splitting and some of the cool technical advances where people are really architecting for that already. And I think that's going to move center stage. And then like I mentioned in my last comment, I think we're really going to see a lot more enterprise use cases and people really making money with AI.

Starting point is 00:16:01 that's going to go hand in hand with that inference, and that may surface new problems and opportunities. Hopefully, we're celebrating that building out this extraordinary infrastructure has gotten so much easier because of the collaboration across the different companies that everybody seems so eager to do like that. Yeah. He gave two and he was supposed to give one. So I know this don't want to mind, which I think in a year we are going to be talking

Starting point is 00:16:24 about the rise of the enterprise. And that's really something I see so many advancements. And I look at, again, just even at Cornellus, like, we're not encouraging employees to try out new AI tools. We've redefined the workflow of how work gets done. And so that utilization of AI is a part of it. And so, yeah, we're a silicon and a system and a solution vendor, but we're an enterprise ourselves. And I see that. And I see the way in the last even, I'll say six months, we've reshaped our workforce, our workforce expectations.

Starting point is 00:16:59 and how the actual work gets done. And I think that we can probably move a bit faster because we are smaller, but the enterprise is right behind us in general. It has to happen and it will. I want to add a bonus question in here. You guys have talked about the new from training to inference. How do we look at agented computing within that context?

Starting point is 00:17:22 And where do you think it fits in terms of changing workload dynamics? Maybe I'm going to say something contrarian, but actually Agentic, I see it as a continuation of this momentum that's already started versus a complete redo where you used to have your entire workflow without AI in it, and then you added it. And then Agentic goes on top. So I don't think it's as big of a step function change as no AI to AI built into everything you do. So I see it as additive.

Starting point is 00:17:53 Like we're already seeing adopters are taking steps there. Okay. Yeah. So much possibility. but I think it's more of an evolution than a revolution. I tend to agree, but I think one implication that it's going to have, and I don't know if this is really going to be true, not, but in my gut, I think that we've maybe neglected a little bit

Starting point is 00:18:11 the front side of the machine, the front network, the CPU, you know, as we've engineered this like incredible high bandwidth fabric around the backside, the GPU of the machine. You know, we have things like rag and we have all kinds of interaction between that machine and traditional database. And I think as the agentic builds out, There's going to be more bandwidth on the front end. There's going to be security.

Starting point is 00:18:33 It's going to be really challenged. And I think we're going to have to engineer the front end of the machine. And I don't know that's just like a pastor. Nick feels to me like an insufficient response to the moment. I think we think the architecture of the front end of the machine quite a bit. And probably smart people are working on that. But I haven't seen a lot of discussion in the open yet. But I think it's going to demand more of that.

Starting point is 00:18:56 And it's going to create interesting opportunities for that side of the, equation in that part of the ecosystem, which is some players that maybe haven't been like in that LLM training ecosystem, I'm going to get some opportunities. Yep. The demand profile is changing. So, yeah, it'll be interesting. So I can't wait to start our interview next year with a summary of these pernostications. Only if we were right.

Starting point is 00:19:17 If I scored too bad, I might not be able to. So one final question for both of you. I'm sure that people online want to engage with you and learn more about what you're doing and how they can get involved in things that you're doing. How can they engage with you and learn more? Jess being me on LinkedIn. Okay. And for me, find us at cornellus networks.com or also on LinkedIn, both the company and myself,

Starting point is 00:19:42 we welcome the conversation on building the future of AI. Well, Lisa and Zane, it's been a pleasure. Thanks so much. All right. Thank you. Thanks for joining Tech Arena. Subscribe and engage at our website, Techorina.AI. All content is copyright by Techorina.

In The Arena by TechArena - Cornelis and OCP Examine Why AI Infrastructure Must Evolve

From the OCP Global Summit, hear why 50% GPU utilization is a “civilization-level” problem, and why open standards are key to unlocking underutilized compute capacity....

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.