In The Arena by TechArena - Connectivity for the AI Era with Alphawave Semi’s Tony Chan Carusone
Episode Date: October 17, 2023TechArena host Allyson Klein chats with Alphawave Semi’s CTO Tony Chan Carusone regarding the unique opportunity for connectivity innovation to fuel the era of AI and why Alphawave is perfectly pois...ed for IP, chiplet and custom solution delivery.
Transcript
Discussion (0)
Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators
and our host, Allison Klein.
Now, let's step into the arena.
Welcome to the Tech Arena.
My name is Allison Klein. I'm delighted to be joined by
Tony Chan-Carossoni. He's a researcher of high-performance connectivity and integrated
circuit design for the last two decades, so he's pretty knowledgeable in this space. He's been a
professor at the University of Toronto, and right now he's the CTO of AlphaWave Semi.
Congratulations on that, and welcome to the program, AlphaWave Semi. Congratulations on that.
And welcome to the program, Tony.
Thanks.
Pleasure to be here.
So Tony, AlphaWave has been on the program before.
We chatted at MemCon earlier this year.
And the Tech Arena audience knows a bit about the company.
But why don't we just start with a foundation for those who didn't listen to that episode
on an introduction to AlphaWave and what your role is in the industry. And then you have an incredible wealth of experience
in technology. What does it mean to be the CTO of AlphaWave, Samoy?
Thanks. Yeah, I mean, being a CTO at AlphaWave is really the culmination and it's about a high
point and a long journey for me through the industry that began at the University of Toronto
over 20 years ago,
where I went through school actually with a lot of the founders of AlphaWave.
And then I became a faculty member at the University of Toronto.
And for 20 years, I've done R&D on optical connectivity and integrated circuits for high
speed data communication.
And during that time, I worked with many companies, large and small startups and some of the biggest
semiconductor companies in the world, working on solutions in this space, building up wealth
expertise and my own ideas about the industry and challenges it's facing.
And then a couple of years ago, around the time that AlphaWave IPO'd, I got back together
with my old friends and heard about their exciting plans for turning Offwave into a vertically integrated solutions provider for connectivity.
And it was a really exciting opportunity.
And so since then, I've been focused in particular on working on strategic technology areas for the company.
And that includes AI hardware and connectivity solutions for including optimal
connectivity. And that's apropos of where we are today, which is the AI hardware and edge AI summit.
And there's so much interest in AI hardware right now. I was thinking earlier today,
it's less than a year from the moment that we first heard the term chat GPT and saw what that core capability of
generative AI could bring. I think it brought AI even more into the forefront in terms of the
industry's pursuit. And that's really put a lot of pressure on the semiconductor industry. Tell
me how you see that landscape today and why connectivity is so important as we look at that.
Yeah, I mean, obviously, it's true.
It's hard to believe it's been less than a year since Chet GPD sort of entered the lexicon.
I mean, it captured so many people's imaginations.
I have to admit, mine included.
It's just amazing to be able to play with that kind of technology.
And if you think about it, it feels like the progress there has been so rapid. And
the reason that's the case is because it's arisen out of our, there was a lot of ideas and
foundational research was in place. But then finally, when hardware systems based on silicon
became possible, they can process these massive data sets. that's really what's made the capability just moonshot.
So you could say that progress, in a sense, is written on the back of CMOS technology
scaling following Moore's Law for a long time, which is exponential in its own way.
Now, of course, there's also been progress on software and as well as new hardware architectures. You know, there's been this transition from use of general purpose CPUs to GPUs and now dedicated hardware accelerators for AI.
So all this progress on the hardware side now is just finally able to see it, the amazing things it can do.
Most recently, you know, connectivity has become the bottleneck, right?
Now that there's this massive compute capability
in the silicon available,
it's really feeding it enough data
that's limiting further progress in AI.
So the same way we saw AI-specific
compute architectures arise over time,
now we're starting to see
AI-specific networking architectures
be developed to just address
that massive demand there.
That's my way.
It's such a strategic technology area for AlphaWood.
Now, I want to get into that a little bit more.
When we look at AI supercomputers, the large supercomputers that cloud service providers
are building to train AI algorithms, they're lending a lot of their design from high-performance computing and
high-performance compute clusters.
You know, when we think about that, we think about Ethernet and InfiniBand.
Obviously, NVIDIA purchased Mellanox to drive InfiniBand as a key technology in this space.
AlphaWave obviously has an IP portfolio in here and you're an optical expert. So I wanted to ask you, what characterizes an AI connectivity solution and what are the key things that people are looking for there?
Yeah, so AI does have some unique connectivity requirements. interesting is that for a long time, AI and connectivity technologies used for AI were
piggybacking on all the significant amount of R&D that was going on to support networking
infrastructure for data centers, which has progressed on a pretty rapid cadence over the
last decade or so, data rates doubling every two to three years. And AI was just riding off that.
But what we're seeing is that the demand
for connectivity in AI is increasing even faster than that now. And so AI is overtaking the data
center networking as the main driver behind the development of new connectivity technologies. So
we're going to see now AI be the leader in terms of driving investment in new connectivity
technologies and everything else we'll have to piggyback on it.
And the new technologies, the new features of connectivity that are required by AI is an emphasis on low latency.
And as well as going together with the fact that the types of mathematical operations and computations required for AI can be massively
parallelized. So we want to spread the training jobs over hundreds, thousands, tens of thousands,
even of processors and accelerators in parallel, and have them all be able to fetch memory from
each other over really low latency. So you get this combination of massive bandwidth requirement over really low latency and over longer physical distances because you've just got thousands of these things. everything from new types of error control coding and the physical layer all the way up to different
networking topologies flatter network architectures with higher rating switches so
that data can get from one processor to another by going through fewer hops and therefore lower
latency right it's driving all these innovations at all layers of the stack. When you describe that, you know, from my experience, it makes me think of high performance
compute clusters and the parallelism that is required there.
For those who work in general data center computing, how would you characterize the
difference between that data center connectivity for, you know, traditional load store applications
or, you know, bursty web applications and what you're talking about here for AI? that data center connectivity for traditional load store applications or
bursty web applications and what you're talking about here for AI.
Yeah, when you think about training some of the largest neural networks, we're talking about
things like these large language models, the number of parameters there, that's so large,
the amount of data that has to be processed, it means that the training job can take months and over that time
you've got extended periods of just continuous data flow point to point of sort of all to all
bandwidth flowing across this massive network of thousands tens of thousands of processors
in parallel and requiring that sustained throughput over time and any and because there's
so much investment in those processors,
each unit may cost thousands,
cents, thousands of dollars.
And having them sit around
waiting for data to show up
just represents a massive cost,
massive investment
that's not being efficiently used.
You know, Meta came out with a study,
I think it was at OCP last year,
where they looked at their own internal
hardware and showed that for some workloads, 20, 30, 40, even 50% of the time for some workloads,
the hardware is just sitting there waiting for the networking to do its thing. And so
when you're talking about the scale of the job, of the investment in the hardware to run the jobs. There's so much wasted equipment sitting around that it really justifies massive investment on the connectivity side to prevent it from becoming a bottleneck.
Another technology that's talked a lot about in this space is chiplets.
And this is an area where AlphaWave is a huge player.
How can chiplets enable AI and what are your plans in this direction?
There's a number of ways that chiplets are really key enabler for future AI computing.
First of all, we want to be able to desire an AI compute to just pack as many of these cores as possible onto a die.
But there's a couple of limitations there one is just a radical limit
of cmos fabrication um that limits the maximum monolithic chip size that we can make and having
multiple compute tiles interconnected together in a package just lets us go beyond that radical
limit there's a there's a practical limit that arises even at even smaller die sizes, especially in the
most advanced technology nodes, which is just due to yields.
Turns out that we can improve yield by taking some of these large dies and chopping it up
into four or more smaller dies.
So that's an economics argument really for saying we can build these large processors
at lower cost. And remember that AI processors, especially because they rely on a lot of memory, a lot
of local memory, we're motivated to implement them in the most advanced CMOS technology
nodes where any lost yield is a real added cost.
So again, extra motivation to make use of a chiplet design paradigm. And then there's other reasons, too, by reusing pre-designed,
pre-validated or in the extreme case, off the shelf chiplets.
You're lowering time to market.
You're lowering design risk.
You're allowing yourself to make custom variants of systems
and package by mixing and matching these pre-validated chiplets quicker and again with
lower risk lower cost so um you put all these things together it's chiplets really an enabling
enabling technology for the future ai and that's why we're we've invested heavily in this area you
know we're we're developing io chiplets to perform the connectivity for these systems in package,
allowing them to provide their Ethernet, PCI Express, CXL connections in and out of the package,
compute accelerate tiles, and memory as well, providing memory interfaces like HBM for
in-package memory. You know, I think a lot of people have heard about chiplets from what the big guys have done, Intel, AMD,
both building their solutions and chiplet architectures.
But we've got kind of a new era of chiplets coming,
and that means a huge opportunity for AlphaWave
and companies like yours.
Can you talk a little bit about the change in chiplets
and why there's going to be some open innovation in this space?
Yeah, we're very bullish on this, really excited by what we've seen in new developments in the last year or so.
We see an ecosystem for chiplets developing really rapidly now. a half ago, we saw the introduction of the first time standard for die to die
interconnect that was embraced by almost the whole industry UCIE standard.
And we've been a big proponent of that and we're active in getting
the details of that standard defined.
There's still some work to do on that.
There's just so much momentum behind it now.
Having a standard for die to die interfaces is what's going to allow
multiple players to come to market with chiplets and you can be confident that these will
play nice with each other, talk to each other right out of the box. We also have every indication that
chiplets are going to be a key area of investment in the US Chips Act specifically to try to foster
this ecosystem and create an environment
where this technology can proliferate out to a wider set of companies beyond
the big companies that you're hearing about now, but it's going to take, it's
going to take a lot of work by the whole industry and to make this happen.
We're talking about new EDA solutions, more capacity for advanced packaging
in order to be able to perform both prototyping of these types
of systems, as well as take them all the way to mass production. And finally, another key ingredient
is people who go out and develop a library of pre-validated chiplet designs. And that's an area
where we're investing just to help enable this and seed this ecosystem. Now, what you're describing really opens up an opportunity for custom solutions at the
semiconductor level.
And I think that there's been a history of some of the largest companies on the planet
doing their own custom solutions, either designing them themselves or working with a silicon
provider to deliver a custom solution. Do you think that AI is a technology that will
propel more companies towards the custom solution? And how do you partner with them
to deliver that? Whenever there's a large enough market to justify all the R&D investment associated with a custom chip design, you're always going to be able
to extract some price, power, performance benefit from a bespoke design. So the issue is that the
barrier there is high, right? The cost of the development of a new chip, especially in a very advanced CMOS technology,
the mass cost, cost to validate the design, that makes that barrier really high. And that's what drives this trend towards the use of general purpose computing.
And so, for example, GPUs are and are going to continue to be a workhorse of AI computations.
The ability to use that hardware architecture and one software stack for a wide variety of AI computes very powerful and carries
a lot of advantages. And yet we're seeing these investments and we're going to continue to see
from hyperscalers to develop bespoke solutions for their AI applications because they've got specific
workloads that they know best. At the same time, you've got things like the chiplet ecosystem developing
that lowers that barrier by having pre-validated chiplets.
It lets you come out with these bespoke solutions with lower R&D costs.
You can mix and match different technology nodes,
use some previous generation CMOS technologies for parts of the system
that don't require the
most advanced nodes that lowers mass costs as well. So you're certainly going to see a situation
here where there's the rising tide of AI just raises all boats, both the general purpose type
architectures as well as the custom silicon ones. Now, what you've described really paints a compelling picture for a broader
industry innovation around silicon design. Somebody might think, well, you know, why don't
we just keep using GPUs that, you know, everybody knows CUDA. We know how to program to it. Why go
through all this trouble? So why do you think everybody is in Santa Clara today
talking about different hardware solutions? Why is this such a focus, do you think?
Again, I think there's just enough focus on AI. There's so much demand for solutions that
get the training done quicker. So much money is being spent on those jobs. Everyone's just looking for solutions to drive down those costs.
And when there's that much investment there, then again, it's going to justify
some investment in specific custom solutions to extract some price performance
or power savings from it.
So I think you're going to absolutely see a combination.
Again, a GPU with a standard software stack that everyone's familiar with may be useful,
will always be useful, I think, as a playground for initial development of new innovations in
the space. But then once you've got a need for just running a ton of training on specific architecture, specific software workload, then there's just going to be too much incentive to develop a custom architecture not to take it. is energy. Can you talk a little bit about how chiplet architectures could be
an opportunity to look for more efficient solutions? Yeah, I mean, energy costs are,
you know, an operating expense. It's a significant part of the cost, total cost of ownership of these
hardware systems. So there's also tremendous motivation to try to get power down and squeeze every last
milliwatt out of these systems. One way we can do that is integrating more tightly
all the compute that's required here. So taking compute that previously was located
on two different boards or two different chips on a board and putting it all in one package.
So that now when information flows back and forth between them, instead of traversing
a printed circuit board trace or even worse, a cable, now you're just going less than a
millimeter between die edges in the same package.
And that can be done for a small fraction of a picajoule per bit.
So that's one way.
It's an important way that properly engineered,
chiplet-based AI compute can help extract power savings here.
So we've been talking about data centers and talking about compute done in high concentration
at the nexus of what's going on in AI,
but a lot of people are talking about edge AI too.
Tell me about edge AI.
How do you define it?
And what's AlphaWave's strategy there?
Edge AI is this funny term
because I think it means different things
to different people.
I think the commonly understood meaning of edge AI
is some basic inference tasks
that are performed by your phone in your hand
or by some voice recognition device somewhere in your home.
That's the very, very edge, if you like, the very endpoints of our networks.
So that's a kind of a basic edge AI.
From an AlvWave perspective, when I think about edge AI,
what I think about is this trend towards rolling out more and more infrastructure compute, let's call it
cloud compute, whatever you want to call it, in regional sites, not right at these massive, not
these massive, biggest hyperscale data centers, but these regional sites that are sprinkled
more liberally around countries around the world, so that they're closer to the end users.
So when you're performing some AI tasks, you're not always having to go all the way back to
these massive hyperscale data centers.
Some basic tasks can be performed with lower latency again and more responsivity.
So that's an important trend because essentially it means that you've
got now instead of just a few big hyperscale data centers you've got almost like distributed virtual
data centers all over the place that have to be interconnected with tremendous aggregate bandwidth
and that's leading to a lot of refinement of new optical connectivity technologies. So, for example, Coherent Optical Connectivity,
which was originally developed and rolled out
for very long-haul communication, like trans-oceanic class links,
that technology has been refined so that it can be used
for this kind of application that it connects within a campus
or maybe to a regional data center,
tens or maybe low hundreds of kilometers away.
So that's interesting development that's caused
massive increase in the number of coherent optical links.
And that's an area that I believe is really strategic.
Long-term, I see that kind of technology proliferating
even into the data centers
and seeing higher and higher volumes over time.
You know, I think that one of the things that I've learned is that we have incredible opportunity for silicon innovation at this moment. I've never seen it. I've spent over 20 years in the silicon
arena. I've never seen a moment like this in terms of opportunity for the industry to come
together and innovate. When you think about AlphaWave, and you just talked about why you joined the company
a few minutes ago, why do you think that you've got the right formula for growth?
What do you think sets you apart?
What I'm really excited that we've been able to create now is a vertically integrated
semiconductor company where we provide connectivity, like industry
leading connectivity, silicon IP solutions for our customers, whether it's for Ethernet,
PCI Express, CXL, some of these data interfaces.
We also work with customers to provide custom silicon solutions that can leverage that industry
leading connectivity IP.
We can go right from the spec all the way to silicon products.
That can be in the form of a fully packaged chip,
taking advantage of advanced 2.5D, 3D packaging,
or a chiplet, right?
Just developing a custom chiplet to enable,
as I said, this broader ecosystem,
more and more players to participate there.
And then even all the way to our own standard products
that can serve connectivity demands
both over optical and electrical links inside data centers. It's just the ability to leverage our
industry-leading connectivity solutions across the full spectrum of solutions allows us the
flexibility to work with some of the biggest players in the AI and meet them wherever they're
at. Help them solve problems in whatever
way makes most sense, whether it's, again, IP licensing, standard products, custom silicon,
anything in between. So, but that's what I think is really exciting about this point in time for us.
Tony, it's been a pleasure talking to you today. I loved what you said about where you guys are focused. I can't wait to hear more.
Where can folks go to learn more about AlphaWave Semi and what you're delivering to the market
and to engage with your team? And you can absolutely visit our website at awavesemi.com
or follow us on LinkedIn. Or you can follow me on LinkedIn as well. I'm always trying to
post interesting content about what's going on in AI and connectivity more generally.
Well, thanks so much for being on the program today.
It was awesome.
Thanks very much for having me.
Thanks for joining the Tech Arena.
Subscribe and engage at our website, thetecharena.net.
All content is copyright by The Tech Arena.