In The Arena by TechArena - Connectivity for the AI Era with Alphawave Semi’s Tony Chan Carusone

Episode Date: October 17, 2023

TechArena host Allyson Klein chats with Alphawave Semi’s CTO Tony Chan Carusone regarding the unique opportunity for connectivity innovation to fuel the era of AI and why Alphawave is perfectly pois...ed for IP, chiplet and custom solution delivery.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Allison Klein. Now, let's step into the arena. Welcome to the Tech Arena. My name is Allison Klein. I'm delighted to be joined by Tony Chan-Carossoni. He's a researcher of high-performance connectivity and integrated circuit design for the last two decades, so he's pretty knowledgeable in this space. He's been a professor at the University of Toronto, and right now he's the CTO of AlphaWave Semi.
Starting point is 00:00:43 Congratulations on that, and welcome to the program, AlphaWave Semi. Congratulations on that. And welcome to the program, Tony. Thanks. Pleasure to be here. So Tony, AlphaWave has been on the program before. We chatted at MemCon earlier this year. And the Tech Arena audience knows a bit about the company. But why don't we just start with a foundation for those who didn't listen to that episode
Starting point is 00:01:02 on an introduction to AlphaWave and what your role is in the industry. And then you have an incredible wealth of experience in technology. What does it mean to be the CTO of AlphaWave, Samoy? Thanks. Yeah, I mean, being a CTO at AlphaWave is really the culmination and it's about a high point and a long journey for me through the industry that began at the University of Toronto over 20 years ago, where I went through school actually with a lot of the founders of AlphaWave. And then I became a faculty member at the University of Toronto. And for 20 years, I've done R&D on optical connectivity and integrated circuits for high
Starting point is 00:01:38 speed data communication. And during that time, I worked with many companies, large and small startups and some of the biggest semiconductor companies in the world, working on solutions in this space, building up wealth expertise and my own ideas about the industry and challenges it's facing. And then a couple of years ago, around the time that AlphaWave IPO'd, I got back together with my old friends and heard about their exciting plans for turning Offwave into a vertically integrated solutions provider for connectivity. And it was a really exciting opportunity. And so since then, I've been focused in particular on working on strategic technology areas for the company.
Starting point is 00:02:20 And that includes AI hardware and connectivity solutions for including optimal connectivity. And that's apropos of where we are today, which is the AI hardware and edge AI summit. And there's so much interest in AI hardware right now. I was thinking earlier today, it's less than a year from the moment that we first heard the term chat GPT and saw what that core capability of generative AI could bring. I think it brought AI even more into the forefront in terms of the industry's pursuit. And that's really put a lot of pressure on the semiconductor industry. Tell me how you see that landscape today and why connectivity is so important as we look at that. Yeah, I mean, obviously, it's true.
Starting point is 00:03:08 It's hard to believe it's been less than a year since Chet GPD sort of entered the lexicon. I mean, it captured so many people's imaginations. I have to admit, mine included. It's just amazing to be able to play with that kind of technology. And if you think about it, it feels like the progress there has been so rapid. And the reason that's the case is because it's arisen out of our, there was a lot of ideas and foundational research was in place. But then finally, when hardware systems based on silicon became possible, they can process these massive data sets. that's really what's made the capability just moonshot.
Starting point is 00:03:46 So you could say that progress, in a sense, is written on the back of CMOS technology scaling following Moore's Law for a long time, which is exponential in its own way. Now, of course, there's also been progress on software and as well as new hardware architectures. You know, there's been this transition from use of general purpose CPUs to GPUs and now dedicated hardware accelerators for AI. So all this progress on the hardware side now is just finally able to see it, the amazing things it can do. Most recently, you know, connectivity has become the bottleneck, right? Now that there's this massive compute capability in the silicon available, it's really feeding it enough data
Starting point is 00:04:30 that's limiting further progress in AI. So the same way we saw AI-specific compute architectures arise over time, now we're starting to see AI-specific networking architectures be developed to just address that massive demand there. That's my way.
Starting point is 00:04:45 It's such a strategic technology area for AlphaWood. Now, I want to get into that a little bit more. When we look at AI supercomputers, the large supercomputers that cloud service providers are building to train AI algorithms, they're lending a lot of their design from high-performance computing and high-performance compute clusters. You know, when we think about that, we think about Ethernet and InfiniBand. Obviously, NVIDIA purchased Mellanox to drive InfiniBand as a key technology in this space. AlphaWave obviously has an IP portfolio in here and you're an optical expert. So I wanted to ask you, what characterizes an AI connectivity solution and what are the key things that people are looking for there?
Starting point is 00:05:34 Yeah, so AI does have some unique connectivity requirements. interesting is that for a long time, AI and connectivity technologies used for AI were piggybacking on all the significant amount of R&D that was going on to support networking infrastructure for data centers, which has progressed on a pretty rapid cadence over the last decade or so, data rates doubling every two to three years. And AI was just riding off that. But what we're seeing is that the demand for connectivity in AI is increasing even faster than that now. And so AI is overtaking the data center networking as the main driver behind the development of new connectivity technologies. So we're going to see now AI be the leader in terms of driving investment in new connectivity
Starting point is 00:06:23 technologies and everything else we'll have to piggyback on it. And the new technologies, the new features of connectivity that are required by AI is an emphasis on low latency. And as well as going together with the fact that the types of mathematical operations and computations required for AI can be massively parallelized. So we want to spread the training jobs over hundreds, thousands, tens of thousands, even of processors and accelerators in parallel, and have them all be able to fetch memory from each other over really low latency. So you get this combination of massive bandwidth requirement over really low latency and over longer physical distances because you've just got thousands of these things. everything from new types of error control coding and the physical layer all the way up to different networking topologies flatter network architectures with higher rating switches so that data can get from one processor to another by going through fewer hops and therefore lower
Starting point is 00:07:38 latency right it's driving all these innovations at all layers of the stack. When you describe that, you know, from my experience, it makes me think of high performance compute clusters and the parallelism that is required there. For those who work in general data center computing, how would you characterize the difference between that data center connectivity for, you know, traditional load store applications or, you know, bursty web applications and what you're talking about here for AI? that data center connectivity for traditional load store applications or bursty web applications and what you're talking about here for AI. Yeah, when you think about training some of the largest neural networks, we're talking about things like these large language models, the number of parameters there, that's so large,
Starting point is 00:08:19 the amount of data that has to be processed, it means that the training job can take months and over that time you've got extended periods of just continuous data flow point to point of sort of all to all bandwidth flowing across this massive network of thousands tens of thousands of processors in parallel and requiring that sustained throughput over time and any and because there's so much investment in those processors, each unit may cost thousands, cents, thousands of dollars. And having them sit around
Starting point is 00:08:52 waiting for data to show up just represents a massive cost, massive investment that's not being efficiently used. You know, Meta came out with a study, I think it was at OCP last year, where they looked at their own internal hardware and showed that for some workloads, 20, 30, 40, even 50% of the time for some workloads,
Starting point is 00:09:12 the hardware is just sitting there waiting for the networking to do its thing. And so when you're talking about the scale of the job, of the investment in the hardware to run the jobs. There's so much wasted equipment sitting around that it really justifies massive investment on the connectivity side to prevent it from becoming a bottleneck. Another technology that's talked a lot about in this space is chiplets. And this is an area where AlphaWave is a huge player. How can chiplets enable AI and what are your plans in this direction? There's a number of ways that chiplets are really key enabler for future AI computing. First of all, we want to be able to desire an AI compute to just pack as many of these cores as possible onto a die. But there's a couple of limitations there one is just a radical limit
Starting point is 00:10:06 of cmos fabrication um that limits the maximum monolithic chip size that we can make and having multiple compute tiles interconnected together in a package just lets us go beyond that radical limit there's a there's a practical limit that arises even at even smaller die sizes, especially in the most advanced technology nodes, which is just due to yields. Turns out that we can improve yield by taking some of these large dies and chopping it up into four or more smaller dies. So that's an economics argument really for saying we can build these large processors at lower cost. And remember that AI processors, especially because they rely on a lot of memory, a lot
Starting point is 00:10:50 of local memory, we're motivated to implement them in the most advanced CMOS technology nodes where any lost yield is a real added cost. So again, extra motivation to make use of a chiplet design paradigm. And then there's other reasons, too, by reusing pre-designed, pre-validated or in the extreme case, off the shelf chiplets. You're lowering time to market. You're lowering design risk. You're allowing yourself to make custom variants of systems and package by mixing and matching these pre-validated chiplets quicker and again with
Starting point is 00:11:26 lower risk lower cost so um you put all these things together it's chiplets really an enabling enabling technology for the future ai and that's why we're we've invested heavily in this area you know we're we're developing io chiplets to perform the connectivity for these systems in package, allowing them to provide their Ethernet, PCI Express, CXL connections in and out of the package, compute accelerate tiles, and memory as well, providing memory interfaces like HBM for in-package memory. You know, I think a lot of people have heard about chiplets from what the big guys have done, Intel, AMD, both building their solutions and chiplet architectures. But we've got kind of a new era of chiplets coming,
Starting point is 00:12:16 and that means a huge opportunity for AlphaWave and companies like yours. Can you talk a little bit about the change in chiplets and why there's going to be some open innovation in this space? Yeah, we're very bullish on this, really excited by what we've seen in new developments in the last year or so. We see an ecosystem for chiplets developing really rapidly now. a half ago, we saw the introduction of the first time standard for die to die interconnect that was embraced by almost the whole industry UCIE standard. And we've been a big proponent of that and we're active in getting
Starting point is 00:12:55 the details of that standard defined. There's still some work to do on that. There's just so much momentum behind it now. Having a standard for die to die interfaces is what's going to allow multiple players to come to market with chiplets and you can be confident that these will play nice with each other, talk to each other right out of the box. We also have every indication that chiplets are going to be a key area of investment in the US Chips Act specifically to try to foster this ecosystem and create an environment
Starting point is 00:13:26 where this technology can proliferate out to a wider set of companies beyond the big companies that you're hearing about now, but it's going to take, it's going to take a lot of work by the whole industry and to make this happen. We're talking about new EDA solutions, more capacity for advanced packaging in order to be able to perform both prototyping of these types of systems, as well as take them all the way to mass production. And finally, another key ingredient is people who go out and develop a library of pre-validated chiplet designs. And that's an area where we're investing just to help enable this and seed this ecosystem. Now, what you're describing really opens up an opportunity for custom solutions at the
Starting point is 00:14:08 semiconductor level. And I think that there's been a history of some of the largest companies on the planet doing their own custom solutions, either designing them themselves or working with a silicon provider to deliver a custom solution. Do you think that AI is a technology that will propel more companies towards the custom solution? And how do you partner with them to deliver that? Whenever there's a large enough market to justify all the R&D investment associated with a custom chip design, you're always going to be able to extract some price, power, performance benefit from a bespoke design. So the issue is that the barrier there is high, right? The cost of the development of a new chip, especially in a very advanced CMOS technology,
Starting point is 00:15:10 the mass cost, cost to validate the design, that makes that barrier really high. And that's what drives this trend towards the use of general purpose computing. And so, for example, GPUs are and are going to continue to be a workhorse of AI computations. The ability to use that hardware architecture and one software stack for a wide variety of AI computes very powerful and carries a lot of advantages. And yet we're seeing these investments and we're going to continue to see from hyperscalers to develop bespoke solutions for their AI applications because they've got specific workloads that they know best. At the same time, you've got things like the chiplet ecosystem developing that lowers that barrier by having pre-validated chiplets. It lets you come out with these bespoke solutions with lower R&D costs.
Starting point is 00:15:57 You can mix and match different technology nodes, use some previous generation CMOS technologies for parts of the system that don't require the most advanced nodes that lowers mass costs as well. So you're certainly going to see a situation here where there's the rising tide of AI just raises all boats, both the general purpose type architectures as well as the custom silicon ones. Now, what you've described really paints a compelling picture for a broader industry innovation around silicon design. Somebody might think, well, you know, why don't we just keep using GPUs that, you know, everybody knows CUDA. We know how to program to it. Why go
Starting point is 00:16:41 through all this trouble? So why do you think everybody is in Santa Clara today talking about different hardware solutions? Why is this such a focus, do you think? Again, I think there's just enough focus on AI. There's so much demand for solutions that get the training done quicker. So much money is being spent on those jobs. Everyone's just looking for solutions to drive down those costs. And when there's that much investment there, then again, it's going to justify some investment in specific custom solutions to extract some price performance or power savings from it. So I think you're going to absolutely see a combination.
Starting point is 00:17:25 Again, a GPU with a standard software stack that everyone's familiar with may be useful, will always be useful, I think, as a playground for initial development of new innovations in the space. But then once you've got a need for just running a ton of training on specific architecture, specific software workload, then there's just going to be too much incentive to develop a custom architecture not to take it. is energy. Can you talk a little bit about how chiplet architectures could be an opportunity to look for more efficient solutions? Yeah, I mean, energy costs are, you know, an operating expense. It's a significant part of the cost, total cost of ownership of these hardware systems. So there's also tremendous motivation to try to get power down and squeeze every last milliwatt out of these systems. One way we can do that is integrating more tightly all the compute that's required here. So taking compute that previously was located
Starting point is 00:18:38 on two different boards or two different chips on a board and putting it all in one package. So that now when information flows back and forth between them, instead of traversing a printed circuit board trace or even worse, a cable, now you're just going less than a millimeter between die edges in the same package. And that can be done for a small fraction of a picajoule per bit. So that's one way. It's an important way that properly engineered, chiplet-based AI compute can help extract power savings here.
Starting point is 00:19:15 So we've been talking about data centers and talking about compute done in high concentration at the nexus of what's going on in AI, but a lot of people are talking about edge AI too. Tell me about edge AI. How do you define it? And what's AlphaWave's strategy there? Edge AI is this funny term because I think it means different things
Starting point is 00:19:36 to different people. I think the commonly understood meaning of edge AI is some basic inference tasks that are performed by your phone in your hand or by some voice recognition device somewhere in your home. That's the very, very edge, if you like, the very endpoints of our networks. So that's a kind of a basic edge AI. From an AlvWave perspective, when I think about edge AI,
Starting point is 00:19:59 what I think about is this trend towards rolling out more and more infrastructure compute, let's call it cloud compute, whatever you want to call it, in regional sites, not right at these massive, not these massive, biggest hyperscale data centers, but these regional sites that are sprinkled more liberally around countries around the world, so that they're closer to the end users. So when you're performing some AI tasks, you're not always having to go all the way back to these massive hyperscale data centers. Some basic tasks can be performed with lower latency again and more responsivity. So that's an important trend because essentially it means that you've
Starting point is 00:20:45 got now instead of just a few big hyperscale data centers you've got almost like distributed virtual data centers all over the place that have to be interconnected with tremendous aggregate bandwidth and that's leading to a lot of refinement of new optical connectivity technologies. So, for example, Coherent Optical Connectivity, which was originally developed and rolled out for very long-haul communication, like trans-oceanic class links, that technology has been refined so that it can be used for this kind of application that it connects within a campus or maybe to a regional data center,
Starting point is 00:21:23 tens or maybe low hundreds of kilometers away. So that's interesting development that's caused massive increase in the number of coherent optical links. And that's an area that I believe is really strategic. Long-term, I see that kind of technology proliferating even into the data centers and seeing higher and higher volumes over time. You know, I think that one of the things that I've learned is that we have incredible opportunity for silicon innovation at this moment. I've never seen it. I've spent over 20 years in the silicon
Starting point is 00:21:58 arena. I've never seen a moment like this in terms of opportunity for the industry to come together and innovate. When you think about AlphaWave, and you just talked about why you joined the company a few minutes ago, why do you think that you've got the right formula for growth? What do you think sets you apart? What I'm really excited that we've been able to create now is a vertically integrated semiconductor company where we provide connectivity, like industry leading connectivity, silicon IP solutions for our customers, whether it's for Ethernet, PCI Express, CXL, some of these data interfaces.
Starting point is 00:22:35 We also work with customers to provide custom silicon solutions that can leverage that industry leading connectivity IP. We can go right from the spec all the way to silicon products. That can be in the form of a fully packaged chip, taking advantage of advanced 2.5D, 3D packaging, or a chiplet, right? Just developing a custom chiplet to enable, as I said, this broader ecosystem,
Starting point is 00:22:57 more and more players to participate there. And then even all the way to our own standard products that can serve connectivity demands both over optical and electrical links inside data centers. It's just the ability to leverage our industry-leading connectivity solutions across the full spectrum of solutions allows us the flexibility to work with some of the biggest players in the AI and meet them wherever they're at. Help them solve problems in whatever way makes most sense, whether it's, again, IP licensing, standard products, custom silicon,
Starting point is 00:23:31 anything in between. So, but that's what I think is really exciting about this point in time for us. Tony, it's been a pleasure talking to you today. I loved what you said about where you guys are focused. I can't wait to hear more. Where can folks go to learn more about AlphaWave Semi and what you're delivering to the market and to engage with your team? And you can absolutely visit our website at awavesemi.com or follow us on LinkedIn. Or you can follow me on LinkedIn as well. I'm always trying to post interesting content about what's going on in AI and connectivity more generally. Well, thanks so much for being on the program today. It was awesome.
Starting point is 00:24:10 Thanks very much for having me. Thanks for joining the Tech Arena. Subscribe and engage at our website, thetecharena.net. All content is copyright by The Tech Arena.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.