SemiWiki.com - Podcast EP343: How Ethernet is Enabling Advances in AI with Dr. Mohan Kalkunte

Starting point is 00:00:07 Hello, my name is Daniel Nenny, founder of semi-wiki, the open forum for semiconductor professionals. Welcome to the Semiconductor Insiders podcast series. My guest today is Dr. Mohan, Calcante, Vice President of Architecture and Technology in the Core Switch Products Group at Broadcom, where he leads architecture for Ethernet switching and Nick products across data center, enterprise, and service provider markets. With over 35 years of industry experience, his previous stints include AT&T Bell Labs, AMD, Nortel Networks. He has over 150 patents, was named a Broadcom fellow in 2009, elected IA Fellow in 2013, and elected in 2025 as NAE member for his contributions to Ethernet Switching. This is quite an honor. Welcome to the podcast, Mohan. Thank you. First, let's start out. Can you tell me

Starting point is 00:00:59 what brought you to Broadcom? Yeah, so I joined a startup back in 1998 called Maverick Network. So at that time, we were working on a switch chip that was 24 ports of fast Ethernet, which is 100 gig. So if you look collectively, that was 2.4 gigabits for the entire chip. That was the bandwidth back in early 2000. So that was the interesting part because nobody was doing that kind of integration back in early 2000. So that kind of attracted me to the startup company, which eventually was bought by Broadcom. Now, if you fast forward today, we have Tomahawk 6, which is a 102T bandwidth. Now, just imagine, compare this to where I started.

Starting point is 00:01:59 That's like five orders of magnitude. You know, it's like unbelievable. and we have come like really a long way, right? So what attracted me to Broadcom was really the engineering excellence, the commitment to excellence, the execution part of it. And over the years, the switching group has consistently executed on the switching silicon. And that's one of the reasons we've been successful out in the marketplace today. Yeah, I remember Maverick Networks.

Starting point is 00:02:33 I was here in Silicon Valley at the time. So what is the current typical architecture of AI networks? I mean, we're talking about AI all the time. What does a typical architecture look like? Yeah, so before we get into the architecture, we should understand why AI networking is different. So first of all, let me go back a little bit, the evolution in the last two decades leading to where we are today, right?

Starting point is 00:03:03 So back in those days, the focus was enterprise. Merchant silicon was still in its infancy. Customers were mainly OEMs. So you do the silicon, OEMs will take the silicon, vertically integrate that into their rack or the chassis design or the pizza box designs. They have their own operating systems and that's what they would sell it to the end customers. Then came the hyperscalers, right, or the cloud data centers. And they wanted to move at a much faster cadence. Now, with the merchant silicon and the enterprise with the OEMs providing these solutions, that would take typically three, four years. So if cloud data center wanted some new feature, it would take like three to four years.

Starting point is 00:03:54 And slowly the cloud data centers, so the whole concept of ODM, the white boxes, the disaggregation of the hardware, disaggregation of the NOS, the whole ecosystem started developing. Now we get into the AI networking, which is kind of, it's like on steroids, so to speak, right? So, AI, so before we get into the architecture, you know, AI changed the traffic pattern

Starting point is 00:04:20 before it changed the bandwidth requirements. So what do I mean by that? So if you look at AI training, the traffic is very large bandwidth flows. They're highly synchronized traffic. There are deterministic phases. There are large flows and small latency flows. And in this case, the GPUs are very expensive.

Starting point is 00:04:45 So you do not want to keep them idle. So what that means is that the network, which we call network is the computer. You must have heard the term network is the computer, which interconnects all these. GPUs, you want to keep it at a very high utilization. You want the latency of this network, latency variation matters more than the average latency.

Starting point is 00:05:11 You don't want to have any kind of congestion collapse. So therefore, the network architecture must be, I would call it intentional. It's not a generic network architecture that you would take, but rather you need to design the network for these AI work. So going back to your question, right? So if you look at it from that point of view,

Starting point is 00:05:34 the network architecture, I would call it, there are three tiers of network. The first one is what I call as scale up. Now scale up, this is what I call as a tightly coupled domain. Either, so usually you have a rack of GPUs or XPUs, and these are very tightly coupled. So very high IOU bandwidth. between these GPUs. In essence, what you're trying to do is that this entire rack of GPUs

Starting point is 00:06:05 access one gigantic GPU with one gigantic memory. Now, if you look at it, each of these GPUs are HBM memory, so they're all distributed, but you want it to behave as if like it's one gigantic memory so that the GPU, you know, can access the memory attached to another GPU and all at very high bandwidth and low latency. So that is the scale of domain typically limited to one rack, now going to two racks and so on. But then and those are usually the domain sizes are like 64, 128, we see 72 going to 144 and so on. Now you take multiple of those racks and then you create what's called as a scale out cluster. So here you are taking multiple of those. pods and interconnecting them so that you can build several thousands of these

Starting point is 00:07:05 interconnected together and typically this is within the building this is called the scale-out architecture and that's where things like load balancing congestion management and all of that stuff comes into play now beyond this you if you want to scale then you have to go across buildings and that is called as scale across which is regional or interdial data center connectivity, especially if you want to build very large training clusters and so on. So that's called the scale out. So you have the typical architecture, you have the scale up, which is high bandwidth, primarily with proprietary interconnects, but that is moving to the open

Starting point is 00:07:45 ecosystem now. Then you have the scale out, which is largely Ethernet, and then you have scale across, which is also Ethernet Network. So those are the three typical AI networking evolution that we are seeing right now. Okay, good background. So what is Ultra Ethernet and how does it address the needs of AI networking? Yeah, so when we talk about AI network today, it's no longer about talking traditional data center traffic. Right. So we are talking about massive accelerator clusters, just like the way I talked about, right,

Starting point is 00:08:23 going from a few tens of GPUs within the scale of domain. to thousands to 100,000s within the building. The speeds are going up like 400 gig to 800 gig. You want to have full fabric utilization and so on. So historically, if you look, the Ethernet was not designed for those kind of workloads. So Ultra Ethernet is the industry's answer to say, how can I redefine Ethernet to handle these gigantic AI workloads. So that's what the initiative is and things like, you know,

Starting point is 00:09:05 how do you have, you know, advanced congestion control, what kind of transport would you have? How do you run this on large scale kind of stuff? So that was what alternate was. And, you know, it does not replace Ethernet. It modernizes Ethernet. And it's about making Ethernet behave like a purpose-built AI, fabric. So that's the intent of ultra-eternet. Okay. And one more. What is the UEC? Yeah, so UEC stands for

Starting point is 00:09:39 Ultra Ethernet Consortium. So this is a consortium or a standards body. It is a full-stack standards body. Unlike, you know, traditional standards, you know, that usually focus on a single layer. UEC spans multiple areas, all the APIs and the software interfaces, the transport protocol. So completely redos the transport protocol, congestion control mechanisms, how should the Nick behave, how should the switch behave. So in effect, it is completely defining what you would call as a complete AI network stacking over Ethernet. Now, within UEC, there's a couple of main working groups. I mean, there are like eight or so working groups, but the transport working group, this is the one that redefines the transport protocol so that for AI workloads, this

Starting point is 00:10:40 is for scale out, congestion control mechanisms, like what kind of congestion control, sender based, receiver based, and so on. So UEC is the consortium that drives the standard so that you can have AI networking, complete definition of the AI networking stack for scale out Ethernet. Now I would like to add a couple of more standards organization here that are happening along with UEC. So in the OCP, there are a couple of efforts that are going on. One is called the Sue Transport or the Scale Up Ethernet Transport. And more recently, there was this thing called ESAN, Ethernet, ScaleUp Networking, where quite a few companies, Broadcom included, came together and said, hey, we need to define how Ethernet operates for ScaleUp domain, including things like optimized headers, including things like link layer reliability, flow control, and system level behavior. So if you think about it, U.S.C is a full-stack architecture for scale-out.

Starting point is 00:11:53 ESAN Suti is more of a specialization, Ethernet, for scale-up networking. And together, you know, they are aligning towards a common goal that is making Ethernet viable across the entire AI network stack. Right. So how do alternatives such as NVLink and UAL compare? for scale up, you know, how do factors like open standards and time to market influence these choices? Actually, that's a very good question. I think the way to answer this is, I think we best frame along two axes. One is the vertical integration versus openness. And the second one is

Starting point is 00:12:38 time to market. And this is where I believe things get interesting because it's really a tradeoff between optimization and openness. With vertically integrated scale of fabrics, obviously you get extremely optimized. It's, you know, you have a tight hardware, software code design, high bandwidth density, and so on. But then the tradeoffs there are you have vendor lock-ins, there is roadmap coupling, ecosystem concentration, and so on.

Starting point is 00:13:12 So versus the open approach, right, you have a broader ecosystem participation, long-term supply flexibility, and you will have competition because you have multiple vendors. Because it's open, you know, there are multiple vendors that want to compete. So the Ethernet-based approach through UEC, ESAN, and Sue Transport, they take a different path. So while they are slightly less, tightly coupled at the outset, but they are rapidly improving performance. And more critically, it's open and interoperable. So the real advantage of Ethernet is, you know, time to market and scalability. So instead of waiting for new proprietary fabric to scale, Ethernet evolves incrementally, but very quickly. So just to give an example here, the ESAN 1.0 was started like late November, December of last year. And we already have an ESAN 1 Dotto specification that was announced in the last few weeks.

Starting point is 00:14:28 So that's the speed at which things are moving, right? And also the performance gap is also narrowing to a point where for many deployments, you know, Ethernet can deliver comparable efficiency with far greater flexibility. Yeah, I saw some of this GTC a couple of weeks ago. So a final question, Ma'an, what is the current state of scale out and scale up silicon availability, you know, at 100 T and 50T?

Starting point is 00:14:59 I mean, how does our 12-month cadence for the next generation products reflect accelerating spin cycles and broader industry trends that we're seeing today? Actually, we're at a very fascinating inflection point. So we are seeing both 50T and 100T, both in scale up and scale out. So for instance, Tomahawk Ultra, it's a 50T device, delivers high throughput, low latency, optimized headers. So that's already there. That's used for HPC and AI scale up. We announced Tomahawk 6, which is 100 terabit switching.

Starting point is 00:15:37 And this device has extremely high radix. It has a radix of 512, massive bisection bandwidth, advanced congestion control. And interestingly enough, this is used both for scale up and scale out. And we have the Jericho family of devices, which is used in scale across with its deep buffers and congestion control and telemetry and a host of other features. So historically, if you have seen, the network silicon has followed a two to three year cycle. Now that is getting very compressed.

Starting point is 00:16:17 And the reason is there is a faster cadence of bigger models. So which in turn drives the AI workload. There's also competitive pressure across the industry. There's advances in process technology and packaging. And not only just. the raw bandwidth, but you're also seeing innovations in co-packaged optics and linear plugables. Recently, we announced an MSA for optical compute interface. This is the slow and wide interface, and this for scale up to reduce power. So it's not just faster chips alone, but it is a system-level

Starting point is 00:17:00 integration. So if you look at it, networking is inheriting the tempo. of AI compute. In fact, networking has to be ready and ahead of compute. Even in six months, we are hearing from some of our customers. So this is a major structural industry shift that we are saying. So in other words, you do not want networking to slow compute down. When compute is there, you want the best networking available so that you can take the maximum advantage. of the compute not slow the compute down so so this is why it's accelerating the networking cycle more to like a 12-month cycle so every 12 months we'll have to come up with newer products to satisfy this AI compute industry yeah it's just an

Starting point is 00:17:56 amazing time in this semiconductor industry AI has just you know turn us upside down it's good you know it's disruptive but it's amazing to see the solutions you guys are coming up with I I'm just blown away by it. And I want to thank you for taking your time out and spending some time with us today. And hopefully you'll come back and we can continue this conversation. Yeah, my pleasure. I think this is one of the exciting transitions happening in today. And thank you for the time today. I agree completely. Thank you. That concludes our podcast. Thank you all for listening and have a great day.

SemiWiki.com - Podcast EP343: How Ethernet is Enabling Advances in AI with Dr. Mohan Kalkunte

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.