Semiconductor Insiders - Podcast EP337: The Importance of Network Communications to Enable AI Workloads with Abhinav Kothiata
Episode Date: March 27, 2026Daniel is joined by Abhinav Kothiata, a principal product manager for the Synopsys Ethernet IP portfolio. He has over 12 years of experience across engineering and product management, spanning SoC des...ign, functional verification, and building wireless connectivity platforms and IoT products. He also holds two patents in … Read More
Transcript
Discussion (0)
Hello, my name is Daniel Nenny, founder of semi-wiki, the open forum for semiconductor professionals.
Welcome to the Semiconductor Insider's podcast series.
My guest today is Abanov Kotalia, a principal product manager for the Synopsis Ethernet IP portfolio.
He has over 12 years of experience across engineering and product management spanning SOC design, functional verification, and building wireless connectivity platforms and IoT products.
He holds two patents in circuit design.
Welcome to the podcast, Abanoff.
Thanks, Stan. Glad to be here.
So first, can you tell us what brought you to Synopsis?
Do you have a story you can share?
Sure. I came into this through a pretty hands-on path.
I did my bachelor's in electronics engineering because I was always fascinated by physical devices.
As a teenager, I would be taking apart consumer electronics and trying to understand what's inside them and what makes them tick.
I then started my career as a digital design engineer at Qualcomm.
I started off right in the semiconductor industry and have been in that since.
I later went on to do my master's in engineering at Cornell and worked in multiple engineering
roles before moving into program management and then product management. That shift was driven
by a desire to understand how businesses operate at scale and how product level and customer-centric
thinking can drive value creation. What really motivated me over time was realizing that technologies like
Excelkin, interconnects and computing infrastructure are really the foundation that shapes how innovation
reaches people's lives, whether that's through AI, cloud services, or the products we attract
with every day. So synopsis felt like a place where I could take my longstanding interest in
electronics and semiconductors and be very close to that future shaping layer of technology.
Working on these foundational building blocks that enable large-scale AI systems felt like a way
to connect deep engineering with real long-term impact.
Great story.
So let's start with the big picture.
AI workloads have exploded in scale,
and they stress the data center in very different ways
than traditional cloud workloads.
From your perspective, why is conventional Ethernet
no longer sufficient, especially for communication inside
accelerator dense AI pods?
Sure.
If you zoom out, traditional Ethernet was designed
as a best effort network.
It's very good at moving large volumes of data efficiently,
but it fundamentally assumes that the occasional packet loss
retries and variable delay are acceptable.
That works well for most general cloud workloads,
but AI workloads are very different.
Training and infant's jobs are highly synchronized,
hundreds or even thousands of accelerators,
compute and parallel, exchange data,
and then wait for everyone to finish up
before moving to the next step.
In that world, the network isn't really judged by the average performance, it's judged by the slowest packet.
This is what people mean by tail latency.
Even if 99% of the data arrives quickly, a single delayed packet can stall the entire cluster,
leaving accelerators sitting idle and significantly slowing training or entrance time.
Packet loss makes this even worse.
It leads to retransmissions, which stretches out the overall job completion time.
That's why AI workloads tend to demand lossless or near lossless behavior and much more
deterministic latency than traditional Ethernet was ever designed to provide.
Conventional Ethernet can be pushed in that direction, but it wasn't built ground up for
these communication patterns.
That's why the industry has started looking at more specialized scale-up fabrics that optimized
for treating the cluster as a single tightly coupled system.
Right.
So given those constraints, what does the network need to do so that a multi-node AI cluster
behaves more like a single tightly coordinated machine rather than all the loose connection
of servers?
The simplest way to think about scale up is that the network has to entirely disappear
from the programmers' point of view.
In a true scale up system, accelerators shouldn't feel like they're talking over a network
at all.
They should behave as if they're part of one unified machine, with communication that feels
local not like every interaction has to be packaged transmitted in the unpacked again you make that
possible the network needs to have extremely low and predictable latency it's not enough to be fast
on average the communication has to be deterministic enough so that all accelerators make forward
progress together efficiency is another key piece bandwidth efficiency directly impacts power
cost and scalability. If the network wastes bandwidth or forces retries, you have to compensate by
adding more links and more power, which quickly becomes unsustainable at rack scale. So scale of networking
is really about enabling ultra-fast, deterministic, and bandwidth-efficient communication scale in a way
that lets thousands of accelerators operate like a single logical system. That's what allows
rack scale designs to deliver performance gains that go far beyond what you can achieve from
chip level improvements alone.
Okay. So to meet these performance demands, new networking approaches and open standards such
as UALink and ESUN are emerging. Can you explain what the problems these standards are trying
to solve are and how they differ? Absolutely. At a high level, both are trying to solve the same
core problem. How do you make an open multi-vender fabric that lets hundreds or even thousands of
accelerators operate as a single entity without relying on proprietary vertically integrated solutions?
However, they have very different starting points. ESAN begins at Ethernet and asks how far can we
push Ethernet to support scale up workloads. It focuses on evolving Ethernet behavior,
adding things like lossless delivery, efficient headers and a more predictable switching so that
Ethernet can reliably support scale up traffic patterns. It also takes a broader view of traffic flow,
such as cases where the traffic needs to transition seamlessly between scale up and Ethernet-based
scale-out domains. It's also a good fit for environments that value the reuse of existing
Ethernet infrastructure. UiLink on the other hand is designed ground-out,
ground up for workloads that want the pod itself to behave like a single tightly couple machine.
It provides memory semantic communication, allowing accelerators to directly read and write each
other's memory. That makes it especially well suited for synchronized AI training workloads
where fine-grained coordination and deterministic behavior are critical to performance and efficiency.
In practice, these approaches are often complementary. Ethernet-based KEL-up networking
provides operational familiarity and ease of translation to Ethernet-based scale-out networks,
while memory semantic fabrics like UALink enable tightly coupled accelerator to accelerate
the communication within a port. Together, they reflect how the industry is matching different
scale-up technologies to optimize for different types of AI workloads or deployment environments.
Okay, so from your perspective, what role do IP providers play in accelerating adoption of these new fabrics?
and in helping the ecosystem move towards real deployment.
IP providers play a critical translation rule in the ecosystem.
Defining a new fabric or framework specification is one thing,
but turning it into something teams can actually build, integrate, and deploy its scale is another.
From an IP provider's perspective, the job is to take the architectural intent behind these new scale-up fabrics
and turn it into pre-verified silicon ready building blocks such as high-speed Phi IP or controller IP that vendors can adopt and integrate with confidence without any proprietary lock-in.
That work typically starts early in close collaboration with standard bodies and ecosystem partners so that when a specification is finalized, silicon-ready IP and validation collateral are already in place.
This early alignment is what allows ecosystems to move quickly once the standard is published.
Interoperability is another key piece.
Open Fabrics only succeed if implementations from different vendors work together predictably and seamlessly.
IP providers help enable that by aligning with ecosystem partners, participating in compliance initiatives,
and making sure designs converge on common interoperable behavior before deployment in the data.
center. In that sense, IP providers help bridge the gap between innovation and real deployment,
accelerating time to market while giving system designers the flexibility to mix and match
components in an open multi-vender ecosystem.
Great. So looking ahead, what capabilities do you think will matter most for future scale-up
fabrics as AI data centers continue to grow in scale?
Looking ahead, a few teams consistently show up as scale up scale up, as scale up fabric
evolved. One is the need to go faster, higher per lane physical layer bandwidth, moving from 200
gigabit lane speeds to 400 gigabit lanes speeds and beyond, as training workloads demand more
performance and efficiency from the underlying fabric. The second is going wider. We're seeing
a push towards larger scale up domains, with more accelerators participating in a single fabric,
as the AI model sizes continue to grow in size. That puts pressure
on building higher radix switch designs with more lanes. The third is going
farther. Scale-up is no longer confined to a single rack as scale-up domains
growing size physically reach starts to matter a lot which is why you see
increasing interest in optical interconnects and new packaging technologies to
extend the reach of scalar fabrics beyond traditional copper limits. Finally there's a
growing focus on moving intelligence into the fabric itself.
One example of this is in-network collectors, also referred to as ink sometimes, where operations
are partially executed on the network rather than entirely at accelerator nodes.
This way, the fabric can cut down on redundant data movement, which keeps accelerators busy and
minimizes wasted compute or idle time. All of these trends directly influence how future
AI data centers would be built. The network is no longer a background connectivity,
earlier, but instead a critical part of the compute architecture that shapes how accelerator
scale, synchronize, and deliver value.
The most successful designs will be the ones that treat the fabric as a co-design decision,
balancing performance, efficiency, and openness as AI workloads continue to evolve.
Great conversation. Thank you very much. It's a pleasure to meet you and hopefully we can
talk again sometime.
Thank you, Dan. Likewise. It was great being here and
talking about these open standards such as UALink and ESAN
and how they are shaping the AI infrastructure world.
Yeah, you know, I can't think of a more exciting time in the semiconductor industry.
I've been here for over 40 years.
You are very lucky to be in this position because so many things are going to happen
in the next 10 to 20 years.
It's going to be incredible.
Yeah, I totally agree. I totally agree.
That concludes our podcast.
Thank you all for listening and have a great day.
