Semiconductor Insiders - Podcast EP337: The Importance of Network Communications to Enable AI Workloads with Abhinav Kothiata

Starting point is 00:00:07 Hello, my name is Daniel Nenny, founder of semi-wiki, the open forum for semiconductor professionals. Welcome to the Semiconductor Insider's podcast series. My guest today is Abanov Kotalia, a principal product manager for the Synopsis Ethernet IP portfolio. He has over 12 years of experience across engineering and product management spanning SOC design, functional verification, and building wireless connectivity platforms and IoT products. He holds two patents in circuit design. Welcome to the podcast, Abanoff. Thanks, Stan. Glad to be here. So first, can you tell us what brought you to Synopsis?

Starting point is 00:00:42 Do you have a story you can share? Sure. I came into this through a pretty hands-on path. I did my bachelor's in electronics engineering because I was always fascinated by physical devices. As a teenager, I would be taking apart consumer electronics and trying to understand what's inside them and what makes them tick. I then started my career as a digital design engineer at Qualcomm. I started off right in the semiconductor industry and have been in that since. I later went on to do my master's in engineering at Cornell and worked in multiple engineering roles before moving into program management and then product management. That shift was driven

Starting point is 00:01:23 by a desire to understand how businesses operate at scale and how product level and customer-centric thinking can drive value creation. What really motivated me over time was realizing that technologies like Excelkin, interconnects and computing infrastructure are really the foundation that shapes how innovation reaches people's lives, whether that's through AI, cloud services, or the products we attract with every day. So synopsis felt like a place where I could take my longstanding interest in electronics and semiconductors and be very close to that future shaping layer of technology. Working on these foundational building blocks that enable large-scale AI systems felt like a way to connect deep engineering with real long-term impact.

Starting point is 00:02:08 Great story. So let's start with the big picture. AI workloads have exploded in scale, and they stress the data center in very different ways than traditional cloud workloads. From your perspective, why is conventional Ethernet no longer sufficient, especially for communication inside accelerator dense AI pods?

Starting point is 00:02:28 Sure. If you zoom out, traditional Ethernet was designed as a best effort network. It's very good at moving large volumes of data efficiently, but it fundamentally assumes that the occasional packet loss retries and variable delay are acceptable. That works well for most general cloud workloads, but AI workloads are very different.

Starting point is 00:02:51 Training and infant's jobs are highly synchronized, hundreds or even thousands of accelerators, compute and parallel, exchange data, and then wait for everyone to finish up before moving to the next step. In that world, the network isn't really judged by the average performance, it's judged by the slowest packet. This is what people mean by tail latency. Even if 99% of the data arrives quickly, a single delayed packet can stall the entire cluster,

Starting point is 00:03:20 leaving accelerators sitting idle and significantly slowing training or entrance time. Packet loss makes this even worse. It leads to retransmissions, which stretches out the overall job completion time. That's why AI workloads tend to demand lossless or near lossless behavior and much more deterministic latency than traditional Ethernet was ever designed to provide. Conventional Ethernet can be pushed in that direction, but it wasn't built ground up for these communication patterns. That's why the industry has started looking at more specialized scale-up fabrics that optimized

Starting point is 00:03:56 for treating the cluster as a single tightly coupled system. Right. So given those constraints, what does the network need to do so that a multi-node AI cluster behaves more like a single tightly coordinated machine rather than all the loose connection of servers? The simplest way to think about scale up is that the network has to entirely disappear from the programmers' point of view. In a true scale up system, accelerators shouldn't feel like they're talking over a network

Starting point is 00:04:26 at all. They should behave as if they're part of one unified machine, with communication that feels local not like every interaction has to be packaged transmitted in the unpacked again you make that possible the network needs to have extremely low and predictable latency it's not enough to be fast on average the communication has to be deterministic enough so that all accelerators make forward progress together efficiency is another key piece bandwidth efficiency directly impacts power cost and scalability. If the network wastes bandwidth or forces retries, you have to compensate by adding more links and more power, which quickly becomes unsustainable at rack scale. So scale of networking

Starting point is 00:05:14 is really about enabling ultra-fast, deterministic, and bandwidth-efficient communication scale in a way that lets thousands of accelerators operate like a single logical system. That's what allows rack scale designs to deliver performance gains that go far beyond what you can achieve from chip level improvements alone. Okay. So to meet these performance demands, new networking approaches and open standards such as UALink and ESUN are emerging. Can you explain what the problems these standards are trying to solve are and how they differ? Absolutely. At a high level, both are trying to solve the same core problem. How do you make an open multi-vender fabric that lets hundreds or even thousands of

Starting point is 00:06:01 accelerators operate as a single entity without relying on proprietary vertically integrated solutions? However, they have very different starting points. ESAN begins at Ethernet and asks how far can we push Ethernet to support scale up workloads. It focuses on evolving Ethernet behavior, adding things like lossless delivery, efficient headers and a more predictable switching so that Ethernet can reliably support scale up traffic patterns. It also takes a broader view of traffic flow, such as cases where the traffic needs to transition seamlessly between scale up and Ethernet-based scale-out domains. It's also a good fit for environments that value the reuse of existing Ethernet infrastructure. UiLink on the other hand is designed ground-out,

Starting point is 00:06:50 ground up for workloads that want the pod itself to behave like a single tightly couple machine. It provides memory semantic communication, allowing accelerators to directly read and write each other's memory. That makes it especially well suited for synchronized AI training workloads where fine-grained coordination and deterministic behavior are critical to performance and efficiency. In practice, these approaches are often complementary. Ethernet-based KEL-up networking provides operational familiarity and ease of translation to Ethernet-based scale-out networks, while memory semantic fabrics like UALink enable tightly coupled accelerator to accelerate the communication within a port. Together, they reflect how the industry is matching different

Starting point is 00:07:37 scale-up technologies to optimize for different types of AI workloads or deployment environments. Okay, so from your perspective, what role do IP providers play in accelerating adoption of these new fabrics? and in helping the ecosystem move towards real deployment. IP providers play a critical translation rule in the ecosystem. Defining a new fabric or framework specification is one thing, but turning it into something teams can actually build, integrate, and deploy its scale is another. From an IP provider's perspective, the job is to take the architectural intent behind these new scale-up fabrics and turn it into pre-verified silicon ready building blocks such as high-speed Phi IP or controller IP that vendors can adopt and integrate with confidence without any proprietary lock-in.

Starting point is 00:08:29 That work typically starts early in close collaboration with standard bodies and ecosystem partners so that when a specification is finalized, silicon-ready IP and validation collateral are already in place. This early alignment is what allows ecosystems to move quickly once the standard is published. Interoperability is another key piece. Open Fabrics only succeed if implementations from different vendors work together predictably and seamlessly. IP providers help enable that by aligning with ecosystem partners, participating in compliance initiatives, and making sure designs converge on common interoperable behavior before deployment in the data. center. In that sense, IP providers help bridge the gap between innovation and real deployment, accelerating time to market while giving system designers the flexibility to mix and match

Starting point is 00:09:23 components in an open multi-vender ecosystem. Great. So looking ahead, what capabilities do you think will matter most for future scale-up fabrics as AI data centers continue to grow in scale? Looking ahead, a few teams consistently show up as scale up scale up, as scale up fabric evolved. One is the need to go faster, higher per lane physical layer bandwidth, moving from 200 gigabit lane speeds to 400 gigabit lanes speeds and beyond, as training workloads demand more performance and efficiency from the underlying fabric. The second is going wider. We're seeing a push towards larger scale up domains, with more accelerators participating in a single fabric,

Starting point is 00:10:07 as the AI model sizes continue to grow in size. That puts pressure on building higher radix switch designs with more lanes. The third is going farther. Scale-up is no longer confined to a single rack as scale-up domains growing size physically reach starts to matter a lot which is why you see increasing interest in optical interconnects and new packaging technologies to extend the reach of scalar fabrics beyond traditional copper limits. Finally there's a growing focus on moving intelligence into the fabric itself. One example of this is in-network collectors, also referred to as ink sometimes, where operations

Starting point is 00:10:48 are partially executed on the network rather than entirely at accelerator nodes. This way, the fabric can cut down on redundant data movement, which keeps accelerators busy and minimizes wasted compute or idle time. All of these trends directly influence how future AI data centers would be built. The network is no longer a background connectivity, earlier, but instead a critical part of the compute architecture that shapes how accelerator scale, synchronize, and deliver value. The most successful designs will be the ones that treat the fabric as a co-design decision, balancing performance, efficiency, and openness as AI workloads continue to evolve.

Starting point is 00:11:34 Great conversation. Thank you very much. It's a pleasure to meet you and hopefully we can talk again sometime. Thank you, Dan. Likewise. It was great being here and talking about these open standards such as UALink and ESAN and how they are shaping the AI infrastructure world. Yeah, you know, I can't think of a more exciting time in the semiconductor industry. I've been here for over 40 years. You are very lucky to be in this position because so many things are going to happen

Starting point is 00:11:58 in the next 10 to 20 years. It's going to be incredible. Yeah, I totally agree. I totally agree. That concludes our podcast. Thank you all for listening and have a great day.

Semiconductor Insiders - Podcast EP337: The Importance of Network Communications to Enable AI Workloads with Abhinav Kothiata

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.