SemiWiki.com - Podcast EP272: An Overview How AI is Changing Semiconductor and System Design with Dr. Sailesh Kumar
Episode Date: January 31, 2025Daniel Nenni is joined by Dr. Sailesh Kumar, CEO of Baya Systems. With over two decades of experience, Sailesh is a seasoned expert in SoC, fabric, I/O, memory architecture, and algorithms. Previously..., Sailesh founded NetSpeed Systems and served as its Chief Technology Officer until its successful acquisition by Intel. Sailesh… Read More
Transcript
Discussion (0)
Hello, my name is Daniel Nenny, founder of SemiWiki, the open forum for semiconductor
professionals. Welcome to the Semiconductor Insiders podcast series. My guest today is
Dr. Selesh Kumar, CEO of BIA Systems. With over two decades of experience, Selesh is a seasoned
expert in SOC, fabric, IO, memory architecture, and algorithms.
Previously, Selesh founded Netspeed Systems and served as its chief technology officer until its successful acquisition by Intel.
Selesh is also a prolific author with more than two dozen highly cited papers and over 150 patents.
Welcome to the podcast, Selesh.
Thank you, Daniel.
By the way, congratulations on your
recent funding round. That's quite impressive. Yeah, thanks a lot. I mean, this was quick. I
keep saying that, you know, at Netspeed, you know, raising money was one of the difficult parts of
the journey. And this time, raising money has been the easiest part of our journey, both in Series A
as well as series B.
No, that's great. So let's start with what first brought you
to the semiconductor industry.
Yeah, sure.
So I started off as an engineer.
I went to IIT Kanpur.
At that time, I was not really focused
on becoming a semiconductor engineer,
but I ended up in EE department.
So that's how I got introduced to electrical engineering, chip design, CAD tools.
At that time, I think most of the EE folks used to end up either in chip design or in
CAD companies.
So I decided to get in chip design space.
Then I did my PhD in networking, specifically in the networking hardware design. And I went
to Cisco and at Cisco I was working on network algorithms. Within one and a half years,
fortunately I got an opportunity to lead the highest performance network processor project
at a company called Huawei. And I took that opportunity.
And that's how I started to really design chips, high-performance chips.
And that journey is still ongoing in some ways because during that design process, I
realized that network, on-chip network is a critical problem.
And that's how the NetSpeed idea also started.
And since then, I know, I have been
focused on building high-performance chips, high-performance fabrics, and high-performance
I.O. and memory architectures. Great. Yeah, I remember we met before at NetSpeed, and I guess
we were at the Chiplet Summit last week, right? Yes. So, in your view, what is the biggest challenge
facing the semiconductor industry right now? Yeah. So as you know, Daniel,
that AI is changing the entire semiconductor architecture.
The way compute is done is totally being disrupted.
Historically for last 50 years,
compute has been centered on CPU based architecture where you take general
purpose programming model, you compile
it to an instruction set, and then you run it across high performance CPUs.
And the way scaling was done is through transistor scaling, of course, but also through multi-core
scaling.
But the concept was general purpose programming model.
As you know, the last three, four years, AI has really taken the central
stage and a large number of application workloads are being realized through LLM models. And
this trend is accelerating as LLM becomes more powerful. The trend is accelerating only.
And it's very clear that as the compute gets disrupted, AI will take the center stage and
LLMs will practically run most of the workloads.
And the traditional compute based on CPUs will become more of a control processor, which
will manage all the LLMs and all the AI agents that will be running.
So as this transformation happens, semiconductor industry has to also transform because this
new type of compute architecture requires complete transformation of how chips are designed,
how systems are constructed.
As you know, today, the GPUs have taken central stage where they have a very large degree
of parallelism and therefore they can run these
LLMs at significantly higher speeds and performance levels. But as we make more progress, I think the
architecture is going to shift further where the architecture will become highly heterogeneous,
where different types of computes will be used for different types of workloads. For training,
you may need a specialized architecture. For training, you may need a specialized architecture.
For inference, you may need a different type of architecture. And when you build a system,
you need to mix and match the best type of compute that is needed for various types of workloads.
This transformation is already happening right now. And as we change the entire compute stack,
semiconductor architecture, one of the critical bottleneck that is already there is really the data movement in this architecture.
As you scale the system, as you build more parallel systems, as you grow your LLM models,
you have a lot of data in high-performance memories, and you have a lot of parallel
compute capabilities. But how do you move data efficiently between memory and compute is where the bottleneck
is going to be.
And that is actually one of the big strengths that NVIDIA has.
If you look at NVIDIA architecture, I think the biggest strength they have is really two
ways.
One is, of course, the CUDA monopoly that they have from the software
perspective. But the other advantage they have is they have built an amazing system that can
scale to very large number of nodes while maintaining high performance and low latency.
So they have really mastered the data movement and that's their biggest strength.
So as other people catch up and start to build high performance systems that can exceed, you know, NVIDIA's performance levels, they need to solve the data movement problem as well.
So I believe these are some of the big problems.
And, you know, as we solve these problems, there are new technologies that are coming in to help.
Like Chiplet is one way to scale the semiconductor system number of transistors that you can pack in a single package. So, chiplets are a vehicle to kind of scale the system which is definitely
going to help with the data movement. There is going to be challenges around
packaging as we build these high-performance systems. 3D is another
opportunity, you know, when you're building these high-performance, highly
scalable systems, you've got to kind of think about 3D stacking of transistors because
that gives you a new dimension of scaling. So these are some of the opportunities and
challenges that I think we will see as we start to transform the computing.
And what are the biggest developments you see happening this year?
So this year, I believe that most of the innovations is already focused on the AI and chiplets.
And I do believe that, you know, a lot of innovation will be focused on that.
And as we scale these systems, you know, there's going to be innovations around how do you build packages with very large number of chiplets?
How do you connect these chiplets efficiently?
3D is another area where I think there is going to be
quite a bit of innovation and disruption.
At Bayer, we believe that as we scale these systems,
the data movement will still be the most central problem.
And I do expect a lot of innovation
in the data movement architectures as well.
As you know that new standards are coming in to assist with building high-performance, scalable data movement.
For example, NVIDIA has historically used NVLink, which is a custom proprietary protocol.
But now more open protocols are coming in.
ULink has become somewhat of a standard for scale-up architectures that people are using.
UltraEthernet is another one for scale-out architectures.
So I do see a lot of innovation around these types of data movement solutions using standard protocols like UA-Link and UltraEthernet.
So that's another opportunity that I see where there's going to be a lot of innovation. And we are really focused on using all these new standards,
taking on these opportunities and challenges,
and building the best data movement solution
as we are designing these new types of architectures.
I agree completely.
So I forgot to ask, but can you give us a brief overview of Biosystems
and how you fit in the industry?
Yeah, so as I said earlier, Biosystems, our mission is to solve the data movement problem in the industry
because we truly believe that as the systems are scaling, data movement is one of the central problems.
So when we started, that was our mission.
And then we kind of looked at what's the best way
to solve this problem.
So we obviously leaned on the prior experience
and knowledge that we had.
And we are building a platform that can allow you
to describe the data movement requirement
at a high level of abstraction,
at the system level abstraction, where you can specify all the agents that you have in
the system you can specify their interfaces you can specify what kind of
bandwidth and latency they need to operate efficiently and once you
describe the system level specification we have built an advanced software stack
that can take all of those requirements
and translate those requirements into low-level constructs, microarchitecture constructs.
So you can basically start to kind of design system at a much higher level of abstraction.
And then there's an automated platform that translates that abstraction into low-level
design that satisfies all of your performance requirements while making
sure that the system is correct. Because it's an automated process, we have added formal methods,
we have added various graph algorithm, various correctness methods to make sure that whatever
results we produce is always correct by construction. It satisfies the requirement, doesn't have any deadlock or live locks. So we do all of that. And if the requirements can't be met
within the constraints that you have provided, then we will guide the user to refine the
requirements so we can realize a solution that can be built. So that's the software stack we
have built. And then today, product that we create from the software stack is the data movement IP. So you specify your requirements and protocols, and then we build all the data movement solution and productize it as an IP that we produce from our software stack. Our IP can support non-coherent flows, cache coherent flows, high performance data
movement flows. So all protocols, different types of flows, all of those are supported.
And there are a lot of details in the IP that allows you to be physically aware, be PD friendly,
support different types of design methodologies.
So all of those are kind of big in the design flow and in the IP.
And those are somewhat of a table stakes nowadays anyway.
So that's the solution that we have today, where we have built a platform that allows
you to design the architecture, analyze the architecture, build the best organization
of chiplets, build the best topology.
And then at that abstraction level, you design the system, optimize the system that gives you time to market advantages and acceleration.
And then from that abstraction, our software stack can go all the way down and design all the collaterals that are required to design the actual chiplets and packages.
So that's the high-level definition of the product that we have built. I think we're talking about
the Bayer name earlier. Bayer is a weaver bird and our products are inspired by that. So our
software is called Weaver Pro and our IPs are called Weave IPs.
And, you know, our IP can really scale from single legacy SOC fabric all the way to extremely large-scale multi-package network that we call Neural Scale IP.
So we have different tiers of products, and, you know, all of these are realized through the same software flow.
So what is your take on UA-Link and the NV-Link?
You mentioned it earlier and other standards.
And more importantly, do you think they'll continue to be adopted?
Yeah, that's a great question.
So protocols have a very interesting history in the compute world.
So today, if you look at the protocol and interface standards, practically most of the IP ecosystem talks ARM standards. So ARM has a non-coherent
standard called AXI and a coherent standard called CHI. So most of the IP ecosystem talks
that standard. When you go to external interfaces, that's where I think a lot of, over the years, a lot of changes have
happened. Like, you know, when Intel was dominant in the data center, there were protocols like UPI,
QPI, and those were scale-up protocols that were used. Intel recently came up with, you know,
CXL protocol, which is a combination of three types of protocols, CXL.cache for coherent flows,
CXL.mem for memory non-coherent flows,
and CXL.io for PCI flows. So that had some traction a few years ago. And with all the
Intel struggles, recently AMD and Google and a few other large companies came up with a new protocol
standard called UALink and UltraEthernet, which is a new protocol to do scale up and scale out systems.
So this is a protocol that I think has a lot of backing right now, and that's where a lot
of new design starts are happening.
In parallel with these open standards, NVidia has their own proprietary standard called
NVLink.
So NVLink is again, suite of standards, non-coherent, coherent, all of that is supported.
And it's a proprietary standard that NVIDIA uses.
So that's where we are today, you know, where I think, you know, external interfaces, NVIDIA has a custom standard that obviously, you know, is used in NVIDIA systems.
And then for the rest of the industry, you know, UA-Link and Ultra Ultra Ethernet are a new standard that is seeing wide adoption. For on-chip interfaces you know ARM
remains to be the most dominant one. Coherent, non-coherent, AXI and CHI and
for chiplet interfaces UCI has become somewhat of a standard today and UCI is
the basically defines the physical and the link layer, and you can run any protocol
on top of it.
So you can translate, transport XC or CHI or even UA-Link across UCI.
So that's the protocol landscape today.
And I think protocols are very important because it allows the ecosystem to be built, ecosystem
of IPs, ecosystem of chiplets ecosystem of
verification IPs so the fact that in all these products have become somewhat of a
standard today is definitely helping us a lot in building our product so we
support every protocol that we talked about we support all of these except
NVLink which is a proprietary protocol right yeah I think standards are going
to be even more important with chiplets right i want to ask you about deep seek uh it's all in the news it looks like they're taking a less
data heavy lower cost approach to llms you know what can the semiconductor industry to do to
facilitate the development of more models like that given the current costs of chips you know
gpus like you mentioned nvidiaIA. Yeah, yeah, definitely.
I mean, DeepSeq has really, you know, created quite a stir in the industry.
So I believe that, you know, it's definitely a great, you know, progress improving the
efficiency of AI compute, LLMs.
They have shown that, you know, you can get the same level of fidelity and quality with,
you know, 100x lower resources, which is
amazing. Having said that, I believe that there is always a need of most capable AI. We are way
behind, for example, AGI capabilities that humans can do. AI is still so far behind that. So there's always need to expand the
capabilities of AI, which is basically increasing the model size, increasing the model accuracy and
all. So the fact that we can run the current models with 100x less resources is actually
amazing because it gives you an opportunity to build 100x bigger models with the same
hardware that we have today.
So I believe that there was quite a bit of stir recently, but I believe very quickly
people will realize that, oh, now I can build 100x bigger models.
So I don't think it is going to change the trajectory.
We are still on the same trajectory.
We are all about building bigger systems, more scale up, more scale out, more compute,
more data, more memory. There's always a need of that type of application.
At the same time, there's going to be, I think, growth of the edge compute, edge AI, where we need
more efficiency and we don't need unlimited infinite intelligence. We need very focused intelligence. So that's where this
efficiency will lead to the most impact. Because now you can have intelligence practically
everywhere. You can have intelligence in your equipment, intelligence in car, intelligence in
every edge device. So making AI 100x more efficient is amazing for that entire segment.
So I believe that it is only going to accelerate AI because once you make AI 100x more efficient,
now it's going to accelerate the edge, but also accelerate the core data center AI because we are way behind where we ideally need to be to get to AGI. So I look at this as a great opportunity.
And to be honest, I was quite surprised
that when this news came out, the stock market plummeted.
I was quite surprised because as you know that
as compute becomes more efficient,
the compute only grows, it doesn't decline.
So that's my view on this whole change.
Yeah, I agree completely. I think it was an interesting that they were so open about this and open source, but I think there's a lot more investigation we're
going to need to do. You know, final question is, I was looking at your website and you have an
impressive set of investors and board members. I mean, Jim Keller is on your board. Can you talk
a little bit about the importance of relationships as an entrepreneur in the semiconductor industry?
Yeah, I think, you know, it's super important. You know, as a founder, as a startup guy, you know,
I think this whole Silicon Valley, you know, obviously there's a lot of innovation,
there's a lot of talent, all of that. but I think there's also a strong network effect. The fact that engineers, innovators, investors,
board members, customers, they're all connected,
that really makes this machine going.
So it's absolutely critical, I think.
That's the reason that it's very difficult
to recreate or reproduce Silicon Valley
because of the network effect that we have here. Jim is absolutely an amazing guy. I
first met Jim back at Netspeed when he was still at Tesla, and he chose to use Netspeed fabric in
Tesla Autopilot Silicon. And he was so pleased with the decision and that you know when he went to Intel he immediately put an offer to acquire
Netspeed so that's how we ended up over there at Intel I worked for about four
years very closely with Jim and we really got along I really really liked
the way he thinks Jim had a lot of ideas, by the way, to save Intel or fix Intel. Unfortunately,
he did not get to execute a lot of those things. But that's where I saw the depth and insights that
Jim has. And that's the reason that the moment Jim called that and asked me to start this company you know I literally that day you know I jumped and started to kind of you know
collect the right set of people and you know started the whole journey so
absolutely you know the the kind of people you haven't you know in your you
know network is is super important for success Stan is either you know key
person in this company he was the first investor from,
he's a GP at Metrix. So he's absolutely amazing. And as you probably heard, as part of Series B,
we are expanding our board. We are bringing in new set of investors. And these are not just
investors. These are the people who are very seasoned execs who understand how the system
works. And they are going to help us and guide us in growing the company.
That's great advice. Thank you for your time, Silesh, and I look forward to talking to you again soon.
Yeah, absolutely. Pleasure talking to you. Thanks, Daniel.
That concludes our podcast. Thank you all for listening and have a great day.