SemiWiki.com - Podcast EP272: An Overview How AI is Changing Semiconductor and System Design with Dr. Sailesh Kumar

Starting point is 00:00:00 Hello, my name is Daniel Nenny, founder of SemiWiki, the open forum for semiconductor professionals. Welcome to the Semiconductor Insiders podcast series. My guest today is Dr. Selesh Kumar, CEO of BIA Systems. With over two decades of experience, Selesh is a seasoned expert in SOC, fabric, IO, memory architecture, and algorithms. Previously, Selesh founded Netspeed Systems and served as its chief technology officer until its successful acquisition by Intel. Selesh is also a prolific author with more than two dozen highly cited papers and over 150 patents. Welcome to the podcast, Selesh. Thank you, Daniel.

Starting point is 00:00:44 By the way, congratulations on your recent funding round. That's quite impressive. Yeah, thanks a lot. I mean, this was quick. I keep saying that, you know, at Netspeed, you know, raising money was one of the difficult parts of the journey. And this time, raising money has been the easiest part of our journey, both in Series A as well as series B. No, that's great. So let's start with what first brought you to the semiconductor industry. Yeah, sure.

Starting point is 00:01:12 So I started off as an engineer. I went to IIT Kanpur. At that time, I was not really focused on becoming a semiconductor engineer, but I ended up in EE department. So that's how I got introduced to electrical engineering, chip design, CAD tools. At that time, I think most of the EE folks used to end up either in chip design or in CAD companies.

Starting point is 00:01:37 So I decided to get in chip design space. Then I did my PhD in networking, specifically in the networking hardware design. And I went to Cisco and at Cisco I was working on network algorithms. Within one and a half years, fortunately I got an opportunity to lead the highest performance network processor project at a company called Huawei. And I took that opportunity. And that's how I started to really design chips, high-performance chips. And that journey is still ongoing in some ways because during that design process, I realized that network, on-chip network is a critical problem.

Starting point is 00:02:21 And that's how the NetSpeed idea also started. And since then, I know, I have been focused on building high-performance chips, high-performance fabrics, and high-performance I.O. and memory architectures. Great. Yeah, I remember we met before at NetSpeed, and I guess we were at the Chiplet Summit last week, right? Yes. So, in your view, what is the biggest challenge facing the semiconductor industry right now? Yeah. So as you know, Daniel, that AI is changing the entire semiconductor architecture. The way compute is done is totally being disrupted.

Starting point is 00:02:56 Historically for last 50 years, compute has been centered on CPU based architecture where you take general purpose programming model, you compile it to an instruction set, and then you run it across high performance CPUs. And the way scaling was done is through transistor scaling, of course, but also through multi-core scaling. But the concept was general purpose programming model. As you know, the last three, four years, AI has really taken the central

Starting point is 00:03:25 stage and a large number of application workloads are being realized through LLM models. And this trend is accelerating as LLM becomes more powerful. The trend is accelerating only. And it's very clear that as the compute gets disrupted, AI will take the center stage and LLMs will practically run most of the workloads. And the traditional compute based on CPUs will become more of a control processor, which will manage all the LLMs and all the AI agents that will be running. So as this transformation happens, semiconductor industry has to also transform because this new type of compute architecture requires complete transformation of how chips are designed,

Starting point is 00:04:13 how systems are constructed. As you know, today, the GPUs have taken central stage where they have a very large degree of parallelism and therefore they can run these LLMs at significantly higher speeds and performance levels. But as we make more progress, I think the architecture is going to shift further where the architecture will become highly heterogeneous, where different types of computes will be used for different types of workloads. For training, you may need a specialized architecture. For training, you may need a specialized architecture. For inference, you may need a different type of architecture. And when you build a system,

Starting point is 00:04:55 you need to mix and match the best type of compute that is needed for various types of workloads. This transformation is already happening right now. And as we change the entire compute stack, semiconductor architecture, one of the critical bottleneck that is already there is really the data movement in this architecture. As you scale the system, as you build more parallel systems, as you grow your LLM models, you have a lot of data in high-performance memories, and you have a lot of parallel compute capabilities. But how do you move data efficiently between memory and compute is where the bottleneck is going to be. And that is actually one of the big strengths that NVIDIA has.

Starting point is 00:05:34 If you look at NVIDIA architecture, I think the biggest strength they have is really two ways. One is, of course, the CUDA monopoly that they have from the software perspective. But the other advantage they have is they have built an amazing system that can scale to very large number of nodes while maintaining high performance and low latency. So they have really mastered the data movement and that's their biggest strength. So as other people catch up and start to build high performance systems that can exceed, you know, NVIDIA's performance levels, they need to solve the data movement problem as well. So I believe these are some of the big problems.

Starting point is 00:06:13 And, you know, as we solve these problems, there are new technologies that are coming in to help. Like Chiplet is one way to scale the semiconductor system number of transistors that you can pack in a single package. So, chiplets are a vehicle to kind of scale the system which is definitely going to help with the data movement. There is going to be challenges around packaging as we build these high-performance systems. 3D is another opportunity, you know, when you're building these high-performance, highly scalable systems, you've got to kind of think about 3D stacking of transistors because that gives you a new dimension of scaling. So these are some of the opportunities and challenges that I think we will see as we start to transform the computing.

Starting point is 00:06:56 And what are the biggest developments you see happening this year? So this year, I believe that most of the innovations is already focused on the AI and chiplets. And I do believe that, you know, a lot of innovation will be focused on that. And as we scale these systems, you know, there's going to be innovations around how do you build packages with very large number of chiplets? How do you connect these chiplets efficiently? 3D is another area where I think there is going to be quite a bit of innovation and disruption. At Bayer, we believe that as we scale these systems,

Starting point is 00:07:35 the data movement will still be the most central problem. And I do expect a lot of innovation in the data movement architectures as well. As you know that new standards are coming in to assist with building high-performance, scalable data movement. For example, NVIDIA has historically used NVLink, which is a custom proprietary protocol. But now more open protocols are coming in. ULink has become somewhat of a standard for scale-up architectures that people are using. UltraEthernet is another one for scale-out architectures.

Starting point is 00:08:09 So I do see a lot of innovation around these types of data movement solutions using standard protocols like UA-Link and UltraEthernet. So that's another opportunity that I see where there's going to be a lot of innovation. And we are really focused on using all these new standards, taking on these opportunities and challenges, and building the best data movement solution as we are designing these new types of architectures. I agree completely. So I forgot to ask, but can you give us a brief overview of Biosystems and how you fit in the industry?

Starting point is 00:08:47 Yeah, so as I said earlier, Biosystems, our mission is to solve the data movement problem in the industry because we truly believe that as the systems are scaling, data movement is one of the central problems. So when we started, that was our mission. And then we kind of looked at what's the best way to solve this problem. So we obviously leaned on the prior experience and knowledge that we had. And we are building a platform that can allow you

Starting point is 00:09:19 to describe the data movement requirement at a high level of abstraction, at the system level abstraction, where you can specify all the agents that you have in the system you can specify their interfaces you can specify what kind of bandwidth and latency they need to operate efficiently and once you describe the system level specification we have built an advanced software stack that can take all of those requirements and translate those requirements into low-level constructs, microarchitecture constructs.

Starting point is 00:09:50 So you can basically start to kind of design system at a much higher level of abstraction. And then there's an automated platform that translates that abstraction into low-level design that satisfies all of your performance requirements while making sure that the system is correct. Because it's an automated process, we have added formal methods, we have added various graph algorithm, various correctness methods to make sure that whatever results we produce is always correct by construction. It satisfies the requirement, doesn't have any deadlock or live locks. So we do all of that. And if the requirements can't be met within the constraints that you have provided, then we will guide the user to refine the requirements so we can realize a solution that can be built. So that's the software stack we

Starting point is 00:10:41 have built. And then today, product that we create from the software stack is the data movement IP. So you specify your requirements and protocols, and then we build all the data movement solution and productize it as an IP that we produce from our software stack. Our IP can support non-coherent flows, cache coherent flows, high performance data movement flows. So all protocols, different types of flows, all of those are supported. And there are a lot of details in the IP that allows you to be physically aware, be PD friendly, support different types of design methodologies. So all of those are kind of big in the design flow and in the IP. And those are somewhat of a table stakes nowadays anyway. So that's the solution that we have today, where we have built a platform that allows you to design the architecture, analyze the architecture, build the best organization

Starting point is 00:11:43 of chiplets, build the best topology. And then at that abstraction level, you design the system, optimize the system that gives you time to market advantages and acceleration. And then from that abstraction, our software stack can go all the way down and design all the collaterals that are required to design the actual chiplets and packages. So that's the high-level definition of the product that we have built. I think we're talking about the Bayer name earlier. Bayer is a weaver bird and our products are inspired by that. So our software is called Weaver Pro and our IPs are called Weave IPs. And, you know, our IP can really scale from single legacy SOC fabric all the way to extremely large-scale multi-package network that we call Neural Scale IP. So we have different tiers of products, and, you know, all of these are realized through the same software flow.

Starting point is 00:12:50 So what is your take on UA-Link and the NV-Link? You mentioned it earlier and other standards. And more importantly, do you think they'll continue to be adopted? Yeah, that's a great question. So protocols have a very interesting history in the compute world. So today, if you look at the protocol and interface standards, practically most of the IP ecosystem talks ARM standards. So ARM has a non-coherent standard called AXI and a coherent standard called CHI. So most of the IP ecosystem talks that standard. When you go to external interfaces, that's where I think a lot of, over the years, a lot of changes have

Starting point is 00:13:26 happened. Like, you know, when Intel was dominant in the data center, there were protocols like UPI, QPI, and those were scale-up protocols that were used. Intel recently came up with, you know, CXL protocol, which is a combination of three types of protocols, CXL.cache for coherent flows, CXL.mem for memory non-coherent flows, and CXL.io for PCI flows. So that had some traction a few years ago. And with all the Intel struggles, recently AMD and Google and a few other large companies came up with a new protocol standard called UALink and UltraEthernet, which is a new protocol to do scale up and scale out systems. So this is a protocol that I think has a lot of backing right now, and that's where a lot

Starting point is 00:14:12 of new design starts are happening. In parallel with these open standards, NVidia has their own proprietary standard called NVLink. So NVLink is again, suite of standards, non-coherent, coherent, all of that is supported. And it's a proprietary standard that NVIDIA uses. So that's where we are today, you know, where I think, you know, external interfaces, NVIDIA has a custom standard that obviously, you know, is used in NVIDIA systems. And then for the rest of the industry, you know, UA-Link and Ultra Ultra Ethernet are a new standard that is seeing wide adoption. For on-chip interfaces you know ARM remains to be the most dominant one. Coherent, non-coherent, AXI and CHI and

Starting point is 00:14:53 for chiplet interfaces UCI has become somewhat of a standard today and UCI is the basically defines the physical and the link layer, and you can run any protocol on top of it. So you can translate, transport XC or CHI or even UA-Link across UCI. So that's the protocol landscape today. And I think protocols are very important because it allows the ecosystem to be built, ecosystem of IPs, ecosystem of chiplets ecosystem of verification IPs so the fact that in all these products have become somewhat of a

Starting point is 00:15:31 standard today is definitely helping us a lot in building our product so we support every protocol that we talked about we support all of these except NVLink which is a proprietary protocol right yeah I think standards are going to be even more important with chiplets right i want to ask you about deep seek uh it's all in the news it looks like they're taking a less data heavy lower cost approach to llms you know what can the semiconductor industry to do to facilitate the development of more models like that given the current costs of chips you know gpus like you mentioned nvidiaIA. Yeah, yeah, definitely. I mean, DeepSeq has really, you know, created quite a stir in the industry.

Starting point is 00:16:11 So I believe that, you know, it's definitely a great, you know, progress improving the efficiency of AI compute, LLMs. They have shown that, you know, you can get the same level of fidelity and quality with, you know, 100x lower resources, which is amazing. Having said that, I believe that there is always a need of most capable AI. We are way behind, for example, AGI capabilities that humans can do. AI is still so far behind that. So there's always need to expand the capabilities of AI, which is basically increasing the model size, increasing the model accuracy and all. So the fact that we can run the current models with 100x less resources is actually

Starting point is 00:17:00 amazing because it gives you an opportunity to build 100x bigger models with the same hardware that we have today. So I believe that there was quite a bit of stir recently, but I believe very quickly people will realize that, oh, now I can build 100x bigger models. So I don't think it is going to change the trajectory. We are still on the same trajectory. We are all about building bigger systems, more scale up, more scale out, more compute, more data, more memory. There's always a need of that type of application.

Starting point is 00:17:30 At the same time, there's going to be, I think, growth of the edge compute, edge AI, where we need more efficiency and we don't need unlimited infinite intelligence. We need very focused intelligence. So that's where this efficiency will lead to the most impact. Because now you can have intelligence practically everywhere. You can have intelligence in your equipment, intelligence in car, intelligence in every edge device. So making AI 100x more efficient is amazing for that entire segment. So I believe that it is only going to accelerate AI because once you make AI 100x more efficient, now it's going to accelerate the edge, but also accelerate the core data center AI because we are way behind where we ideally need to be to get to AGI. So I look at this as a great opportunity. And to be honest, I was quite surprised

Starting point is 00:18:28 that when this news came out, the stock market plummeted. I was quite surprised because as you know that as compute becomes more efficient, the compute only grows, it doesn't decline. So that's my view on this whole change. Yeah, I agree completely. I think it was an interesting that they were so open about this and open source, but I think there's a lot more investigation we're going to need to do. You know, final question is, I was looking at your website and you have an impressive set of investors and board members. I mean, Jim Keller is on your board. Can you talk

Starting point is 00:19:04 a little bit about the importance of relationships as an entrepreneur in the semiconductor industry? Yeah, I think, you know, it's super important. You know, as a founder, as a startup guy, you know, I think this whole Silicon Valley, you know, obviously there's a lot of innovation, there's a lot of talent, all of that. but I think there's also a strong network effect. The fact that engineers, innovators, investors, board members, customers, they're all connected, that really makes this machine going. So it's absolutely critical, I think. That's the reason that it's very difficult

Starting point is 00:19:43 to recreate or reproduce Silicon Valley because of the network effect that we have here. Jim is absolutely an amazing guy. I first met Jim back at Netspeed when he was still at Tesla, and he chose to use Netspeed fabric in Tesla Autopilot Silicon. And he was so pleased with the decision and that you know when he went to Intel he immediately put an offer to acquire Netspeed so that's how we ended up over there at Intel I worked for about four years very closely with Jim and we really got along I really really liked the way he thinks Jim had a lot of ideas, by the way, to save Intel or fix Intel. Unfortunately, he did not get to execute a lot of those things. But that's where I saw the depth and insights that

Starting point is 00:20:36 Jim has. And that's the reason that the moment Jim called that and asked me to start this company you know I literally that day you know I jumped and started to kind of you know collect the right set of people and you know started the whole journey so absolutely you know the the kind of people you haven't you know in your you know network is is super important for success Stan is either you know key person in this company he was the first investor from, he's a GP at Metrix. So he's absolutely amazing. And as you probably heard, as part of Series B, we are expanding our board. We are bringing in new set of investors. And these are not just investors. These are the people who are very seasoned execs who understand how the system

Starting point is 00:21:24 works. And they are going to help us and guide us in growing the company. That's great advice. Thank you for your time, Silesh, and I look forward to talking to you again soon. Yeah, absolutely. Pleasure talking to you. Thanks, Daniel. That concludes our podcast. Thank you all for listening and have a great day.

Your Ad Here

SemiWiki.com - Podcast EP272: An Overview How AI is Changing Semiconductor and System Design with Dr. Sailesh Kumar

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.