In The Arena by TechArena - Roy Chua of AvidThink on AI, Networking, and Sustainability at OCP

Episode Date: October 18, 2024

Allyson Klein and Roy Chua, founder & principal of AvidThink, explore AI-driven networking, sustainability challenges, energy efficiency, and more in this insightful episode, recorded at OCP Summi...t 2024.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein. Now, let's step into the arena. Welcome in the arena. My name is Alison Klein, and we're coming to you from OCP Summit in San Jose, California. I'm delighted to be joined by Roy Chua, founder of Avid Thing. Welcome to the program, Roy.
Starting point is 00:00:35 Thank you, Alison. Always a pleasure. So we haven't talked in a while. I think the last time we were on a podcast was for MWC. Is that about right? That feels about right. Yeah, a couple of months ago. Actually, many months ago. So there's a lot of dense technology to unpack from what's happened since then. One of the things that I'm seeing is a ton of network innovation happening in the data
Starting point is 00:00:54 center space. Why is that so important right now? Yeah, and I think fundamentally network innovation in data centers specifically are usually driven by the workloads and the size of those workloads. And it's no surprise, the system called AI and machine learning, it's not that it wasn't there before. We had HPC, we had big data analytics, but I think the scale is different and I think the hype is different. It's even more hype than it used to be previously. And so I think given the size of the workloads, given that we're having data sets that are 100 times, 1,000 times, 100,000 times what we had to deal with previously, and the fact that it's not just a capacity and throughput, then latency actually counts.
Starting point is 00:01:35 Because in your training, your pre-training or inference, that low latency is useful and helpful and valuable. And so I think that's what's driving the new wave of innovation within the data center. I think that's one of that, the workload itself. And along with that workload, we have this new focus on energy efficiency. And so whether it's AI or even the cloud workloads previously, there was a desire for more energy efficient capabilities within the data center. So that's no surprise. The other element that we did see as well is that some of the cooling needs in the data center. So that's no surprise. The other element that we did see as well is that some of the cooling needs in the data center, AI and otherwise,
Starting point is 00:02:09 are driving some of the racks further apart. And so you need high throughput, low latency networking over a slightly longer distance. So not within the rack itself, but interact. And so that also drives that network innovation. And finally, all the scalability and all these other things that typically we see.
Starting point is 00:02:26 So I think that's part of what we're seeing in the data center. Now, when you look at scale-up and scale-out fabric control in the hyperscale space, there's been a tremendous move to industry standards, first with ultra-Ethernet and now with UAL. What do you make of this? And are these technologies going to actually give NVIDIA a run for its money? The interesting thing is, I think, from an NVIDIA standpoint, I think they're basically hedging both sides, right? We saw NVIDIA's, I wouldn't say secretly or not secretly, right? They joined AlterEightNet
Starting point is 00:02:56 some time ago. And from their perspective, I expect them to have product lines across both sites, right? And certainly they make premiums from InfiniBand, for instance. And so I think from an NVIDIA standpoint, I think you'll see an ongoing investment in proprietary inter-GPU connect technologies like NVLink. So I don't think that's going to go away. Very proprietary, very specialized, super high throughput. I think it was 1.8 terabytes per second. Right.
Starting point is 00:03:20 That's crazy. Right. So yeah. Yeah, it is crazy actually. And with 72 times that, it's just 130 terabyte switch. Crazy. Anyway, so I think that's going to stay proprietary for some time. But I think outside of that, in standing band Ethernet, as everyone
Starting point is 00:03:33 says, don't bet against Ethernet, even though Ethernet doesn't look like what Ethernet used to be, because we keep redefining what Ethernet is. So it can be anything you want. I think standards-based in terms of inter-server, interact, I think that will probably end up winning. And I think NVIDIA is just going to play both sides and extract value as long as they can. I think that's the reality that I see. But I think eventually standards-based will win. And people like Meta are driving that forward.
Starting point is 00:03:59 Some of the other folks out there building very large-scale GPU as a service clusters are likely going to go down that route as well. When you think about scale up and scale out, how do you differentiate those requirements and what kind of topologies are being deployed today in data centers? Within the data center, I think what initially last year was maybe not fully understood and this year you see it everywhere is within the rack itself, within the CPU complex, the GPU complex itself, or the AI accelerator, because they're not always GPUs per se, there's going to be some level of proprietary technology
Starting point is 00:04:35 like NVLink, right? And that scaling up your GPU or AI acceleration cluster is going to be some level of that. There's been CXL touted. We'll see what happens with some of those elements, but I think likely we'll see that. And then within the rack or within the training clusters, you'll see 800 gigs becoming sort of dominant.
Starting point is 00:04:56 And that's the backend network, right? Expensive. And architecturally, we'll see some different topologies potentially, but a lot of it is still your standards of the clause architectures, some of those things that you see typically. I don't know if we'll get that weird toroidal thing that Google has. I think there's some more esoteric ones that are coming and we'll see some of that. But I think typically 800 gigs back end for the trading cluster. And then the front end will be a typical 400 to 100 gig, you know, standard multi-tier clause architecture, because they'll probably end up being co-located in some cases or adjacent
Starting point is 00:05:31 to your cloud workloads and your typical architecture. So I think we're seeing that. And this front end backend network seems to be the voice of architecture that I'm seeing now on the show floor. Last year, we were seeing a little early parts of it. And this year we're seeing early parts of it. And this year, we're seeing pretty much, I think, that's becoming standard, if you look at a lot of that.
Starting point is 00:05:50 Now, another thing that always gets talked about at a CP Summit is Sonic. Yes. And I've heard about it this week as well. What do you make of Sonic? And does that have legs to grow beyond hyperscale? Possibly. I would say that when it first came on, it was very interesting, right?
Starting point is 00:06:07 So you had Microsoft and some of the other folks, Meta, you know, driving some of that. And then there was a little slight hiatus. It wasn't clear what it was going to do. This was many years back, right? And some of the more commercial providers of technology were saying, look, Sonic's a toy, right? It's not real.
Starting point is 00:06:22 It's not robust. And even some of the other open-source NOS were pointing fingers at, ah, Sonic's a toy, right? It's not real. It's not robust. And even some of the other open source NOS were pointing fingers at, ah, not robust, not scalable. Fast forward a couple of years later, I think when Broadcom embraces it as the builder to use to test the latest chips with it,
Starting point is 00:06:36 when Juniper and Arista was there relatively early because it was pushed by Microsoft to do it. And when Cisco, Juniper and Arista offer Sonic loads on firmware on their switches, then it's more mainstream. And when companies like Walmart and Target and start embracing it at scale, then you're like, well, maybe it's got more likes than we've participated.
Starting point is 00:06:56 So I would say in the last 18 months, I would say that Sonic's gotten a lot more traction and not only the large proprietary switch manufacturers, not only the Broadcom support, but you have also Marvell support as well. So I think it finally has reached, I think, a tipping point where you actually had multiple startups in the space that will either support the Broadcom Sonic load or the Community Sonic for that. And it's not just the hyperscalers anymore. I'm seeing a lot more enterprises be open to taking a white box approach, you know,
Starting point is 00:07:24 pick your favorite edge cost, or pick your favorite product. community Sonic for that. And it's not just the hyperscalers anymore. I'm seeing a lot more enterprises be open to taking a white box approach, you know, pick your favorite edge cross-holistic or pick your... Right. And then loading Sonic on it and having one of these third parties, startups even,
Starting point is 00:07:35 providing the orchestration and the management of these. And I would say that Dell, to some extent, has helped that ecosystem by anointing a couple of those startups and also with Dell Enterprise Sonic. So supporting it as well.
Starting point is 00:07:48 So I would say, I think 2025, I would see Sonic going more mainstream for data center workloads. And now there's talk of the Edge Sonic. And so whether we'll see it in the campus and remote locations remains to be seen, but it's now a force to contend with finally. Yeah, I think it's an exciting update and I think it's great for the networking space too.
Starting point is 00:08:08 I think that that's going to be great for competition, ultimately great for innovation. What other updates seem foundational to what folks will be seeking from R&D efforts to deployments as we turn to the second half of the decade? What caught your eye here? Unfortunately, energy-related and heat dissipation stuff. And that's part of me that says that's kind of cool. Well, cool, okay. It's a bad pun. But I'm seeing a lot of liquid cooling on the show floor.
Starting point is 00:08:35 And what used to be exotic is now more mainstream or seems to be more acceptable. And here, as you walk the expo floor, there's also chatter of new ways of powering data centers, right? So nuclear, small modular reactors being talked about. And so it appears that given this move towards AI ML workloads, a lot of people are realizing that one of the biggest bottlenecks is energy availability. And it seems like we are willing in the pursuit of this AI goal to pause our sustainability goals that we stated before.
Starting point is 00:09:07 I'm seeing a lot of R&D around how you design data centers, at least on the show floor here, for cooling, for power. Immersion cooling is no longer exotic. It's now being talked about. And there's multiple liquid cooling approaches, right? So you've got Coldplay, you've got immersion,
Starting point is 00:09:21 you've got all these things that are still being worked out. A lot of things that have to be standardized, right? How do you do an attachment and so on and so forth. So that's all being talked about. I expect to see a lot of some physical improvements, heat exchange improvements, what to do with all the excess heat as well. And then energy sources, right? I expect to see a lot of investment in terms of hydrogen.
Starting point is 00:09:40 There's actually data center nearby, ECL, that's doing hydrogen power. And then of course, nuclear pilots. I mean, when we reactivate Female Island, you know that clearly we have an energy shortage, right? Because we're willing to put aside all those fears in the pursuit of this AI and now go, I worry that we're maybe overdoing it. We're not thinking through the impacts. We're not. The externalities to humans are not being affected into any of these decisions, as far as I can tell. And when people are willing to push their goals from 2025 to 2030 or whatever it is, their original sustainability goals, you've got to wonder, right? Every time you look at the Gemini
Starting point is 00:10:14 ads and Google create me or Gemini create me this cool cat video, right? Every time you create a cat video, I raise the temperature of the earth by 0.00001 degrees Celsius. And none of us do that and we'll die laughing at cat videos as we burn up. Not the best outcome, I'd say. It's an interesting challenge. I did a panel on this topic earlier this week, and I published my Compete Sustainability Report for 2024. And it's an interesting moment because even just in the last two weeks, we saw hurricanes ripping through the East Coast
Starting point is 00:10:45 and like unprecedented. Obviously, climate change influenced those moments. Yes. But at the same week, we saw the DeepMind team get the first Nobel Prize for AI delivered work in a really important field in medical research and protein folding and all of the things that will come out of that. And one question that I have is, does the opportunity for AI deliver so many tools to us in terms of helping mitigate climate issues that the electricity consumed is worth it? And I'm on the fence on this topic. I'm not proposing that. But I do think that there needs to be a massive improvement in the efficiency of platforms where you're absolutely right. We will kill the planet before we can save it. And so I think that everybody has energy efficiency on the mind. This was, you know, sustainability a few years ago felt like a nice to have. Now it feels like
Starting point is 00:11:42 an imperative. And I'm really hoping that we see the right action from industry, starting with silicon design and moving forward into how we write code more efficiently and power and cool more efficiently. There's so much that can be done. I agree. I mean, last year, Andy Bechtelstein was here. He gave a talk. It was 10 minutes only.
Starting point is 00:12:03 Yeah. Talking about his aversion towards liquid cooling because he says, look, there's other ways of doing it. We should be redesigning the chips better, as he just pointed out. I'm sure there's still that thread that's ongoing. But what's more dominant is find new energy, right? Right. Liquid cooling.
Starting point is 00:12:19 Let's just make the heat go somewhere else, right? And I do worry. And you're right. AI leaders are telling us like, oh, AI could help with climate change. We could use AI to help us make data centers more efficient. AI to help us design chips that are more energy efficient. AI to come up with solutions to climate change. And I would like to believe that, but I worry that we're basically saying, again, AI is the panacea. AI will solve all. And therefore, we should, again, it's just AI is the penance here. AI will solve all. And therefore, we should, again, slave everything to building AI overlords so that they can help save us.
Starting point is 00:12:51 I'm not sure that's sound thinking. Appropriate or rational thinking. I worry. I do worry. It goes back to that moment in Jurassic Park of you were so focused on if you could, you didn't think about it, you shit. Yes. And Roy, you and I can postulate about that as they continue to build because nothing's going to slow down.
Starting point is 00:13:11 They wouldn't stop them. Right. They wouldn't stop them. There's a competition at stake. You have FOMO at the sovereign nation level. NVIDIA and Jensen is the ultimate salesperson is amazing. It's close around and convinces every nation that need their own GPU data centers and clusters and the new unit of computer data center.
Starting point is 00:13:28 So everyone needs more of that. That's not going to stop. Everyone's going to build up. It's like an arms race. And so we need a disruptor to come in and say, you can do it at a hundredth of the power. I'm hoping. Yeah. I haven't seen an OCP yet, but I'm hopeful.
Starting point is 00:13:42 There's analog computing, reversible computing, and all these different approaches. It's still too early, but maybe in the next five to 10 years, something happens. Now, I want to take it back into the networking space, because you just recently wrote a paper on AI and networking. So is AI going to solve networking too? And what are we looking at for network administrators from the hype perspective? Is AI and networking real? Is this going to actually solve part of the problems for overseeing networks? I think AI and networking is real.
Starting point is 00:14:12 Again, as always, the vendors tend to overhype, and we've spot patterns that have been there before and guiding them towards faster root cause analysis. So I would say root cause suggestions from the AIs I've seen actually being helpful. So that I agree with. I've seen it. It actually does work. I think AI associated with that sort of customer service and the carrier space, trying to figure out what's going on with the network or even being predictive and proactive about it, I've seen some elements of success already. So that I think is reasonable. The AI planning, design, configuration, I think is a work in process. I'm seeing promising early results
Starting point is 00:15:02 for small scale networks where you basically say, look, here are my goals. Here's my intent. Can you translate that to a configuration that looks like this? And I'm seeing examples where it is 90% correct or even 100% correct in typical sort of scenarios. So I'm seeing some of those elements in terms of that. Large scale design of networks, that still unclear, but AI monitoring, AI troubleshooting, definitely. AI optimization of existing networks, yes, as well.
Starting point is 00:15:32 I'm seeing that. And we've had that in the rank for some time. It's not unusual, right? Right. So we've had that for some time. That continues. So I would say promising. And across day one, day two operations, I think we're in production, in limited
Starting point is 00:15:44 production, seeing value, day zero sort of design, I think we're in production, in limited production, seeing value. Day zero sort of design, I think in process. Okay. Well, this is a space that I think I'm going to be closely monitoring. And I can't wait to see how that conversation is updated once we get to MWC next year. Yep. Last time you were on, we talked about a lot of chatbots trying to help people and force fitting additional functionality. I want to see what's real next time. One final question for you. Avid Think,
Starting point is 00:16:11 how can folks engage with you and where are you going next with your publications and the work that you're doing? Yeah, sure. No, thank you. Avidthink.com, A-V-I-D-T-H-I-N-K.com is the easiest way to find us. We have a couple of reports all in play. There's one on Gen AI, no surprise, just like everyone else. And then as well, we have a Telco Cloud on Edge that still exists. One on Karen NAS, Network as a Service. And then one on Datacenter Networking for AI and cloud workloads. No surprise. Thanks so much for being on.
Starting point is 00:16:39 It's always so fun to talk to you. I hope you have a great week at OCP. Likewise. Thanks, Alison. Thanks for joining the Tech Arena. Subscribe and engage at our website, thetecharena.net. All content is copyright by the Tech Arena.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.