In The Arena by TechArena - Ayar Labs' Mark Wade on Optical I/O Boosting AI Performance
Episode Date: September 11, 2024Mark Wade, CEO of Ayar Labs, explains how optical I/O technology is enhancing AI infrastructure, improving data movement, reducing bottlenecks, and driving efficiency in large-scale AI systems....
Transcript
Discussion (0)
Welcome to the Tech Arena,
featuring authentic discussions between
tech's leading innovators and our host, Alison Klein.
Now, let's step into the arena.
Welcome to the Tech Arena.
My name is Alison Klein, and today we're coming to you
from AI Hardware Summit in the Bay Area. I am delighted to be with Mark Wade, CEO and co-founder
of IR Labs. Welcome to the program. Mark, how are you doing? Hey, I'm doing great. Really excited
to be here. Thanks for having us on. Now, IRabs has been making a tremendous splash, and you guys have been around for a while.
Well known in the HPC and AI arenas for optical networking solution delivery.
But this is the first time we've had you on the tech arena.
So can you provide a bit of background on the company and your technology?
Yeah, sure.
For one thing, myself and my co-founders have been working on this core technology of what we call optical I.O. for a long time.
We predicted that data movement and connectivity was on a path to really becoming a major important
piece of large-scale compute systems.
And what we're doing at IR Labs is bringing these technology innovations that we've been
working on and bringing forward the industry's first commercially viable, what we call in-package
optical I.O. solutions to drive some of the key performance
aspects of large-scale AI. And the way we do that is we're using light to transfer data between
compute chips. And you could think of that as between GPUs, CPUs, accelerators. Whereas today,
all of those connections are done with electrical communications. So it's really about bringing data
movement and bandwidth and energy efficiency capabilities of integrated photonics and optical solutions deep into the compute stack and
replacing electrical I.O. with optical I.O. That's awesome. Now, we're at the AI Hardware Summit,
as I mentioned before, and there's a lot of attention being focused on processor innovation,
but some could argue that AI networks are an equal opportunity
of innovation and maybe even more of a bottleneck. Why is that? Yeah, so there's a lot of exciting
work happening in what we think of as the compute domain. So people looking at building more
specialized processors, focusing on the actual compute chip itself. But what's happening in
large-scale AI is that as AI models have gotten larger and more
complex, those models far exceed the amount of memory that can fit on any one single GPU or
accelerator. And when you look into the AI computation, the performance of the overall AI
workload becomes limited by how quickly you can crunch through these really large matrix
multiplications that are happening underneath the hood. And that becomes limited by memory bandwidth and by how large of
a memory capacity you can have. So once you exceed that footprint of a single GPU or accelerator,
you have to move into a domain to where you're building systems of large numbers of GPUs and
accelerators that are connected together. And that keyword there, connected together, is really where the connectivity focus starts to come in. And you're trying to improve and build
networks of connectivity that allow you to solve these key kind of memory bottlenecks.
So what people are realizing and what I'll be talking a lot about at the AI Hardware Summit
is that it's not just a computation that's becoming a big focus. Interconnect is now a
major limiter of the unit economics of AI. So that's where Interconnect comes in as a big focus.
When you think about the scale that they're building, and this is something that I think
is really fascinating, they're looking at scaling clusters up to 100,000 GPUs and beyond to reach these incredible performance metrics
to train AI algorithms and get to a closer state with AGI.
So what is the challenge with connecting all of these GPUs together?
And what does that mean for Ayer as you look at the solutions that you're providing in
market?
Yeah, that's a great question. And one thing I'll be talking about in my hardware summit talk coming up is
you have to look at the anatomy of how these systems are getting built. And people hear about
these big GPU clusters of 10,000 plus heading to 100,000. But the way those are connected up
together, largely they're scaling the number of connections through what we call
the scale-out fabric. And that part of the fabric today is a much lower performing, low bandwidth,
high latency, high power consumption performing part of the network. Now, if you drill down
underneath the hood there, there's a portion of the network called the scale-up fabric.
And that's the key piece that really scales the efficient performance of the AI model computations and throughputs.
And that piece is much, much smaller today.
So most people are using eight GPU or accelerator domain sizes for the scale up portion.
And maybe a market leading edge solution might use 64 GPUs.
Now, the key thing to realize is that the performance scales with the scale up portion of the network. So one
way to think about what we're driving is as we're trying to get to larger and larger overall GPU
clusters in these thousands, tens of thousands heading towards 100,000, we want to bring
connectivity solutions that can expand the domain size of that scale up portion of the fabric.
And while that's having challenges to scale, you're seeing these challenges
show up in many different places. Training wants to drive towards larger models with more complexity,
higher data volumes, larger input and output sequence links. But the connectivity bottlenecks
that exist in these systems today, it shows up as it's going to cost you more money to train each
model. It's going to take a longer amount of time for a full iteration of training to happen.
And that really impacts the key rate of innovation
that these model builders can really lean into
and how fast we're able to innovate
and bring new capabilities forward in AI.
And it's putting tremendous pressure
on data center infrastructure as well
in terms of the power per rack,
the power density inside of that rack.
So there's both a challenge
on the economics of training at that scale and also a challenge of data center infrastructures trying to solve
some of the physical constraints and challenges that are coming up, especially around power
consumption and power density. And underneath the hood there, a lot of those bottlenecks are
coming from interconnects. Now, you are delivering incredible performance capability in the market,
and I think that you're known for that. But I want you to walk me through how you look at optical network capacity,
because capacity is such a key element of what we're talking about. What differentiates your
solutions in the marketplace? Yeah, so one, I would draw the
deliation here to be between, going back to what I mentioned earlier, the scale-out part of
the network and the scale-up part of the network. So today, if you look at these AI systems, optical
solutions are already being deployed and delivered into the scale-out portion. And people are using
optical transceivers that come from the more traditional data center networking market instead
of products. But if you go down into the scale-up portion of the fabric, and if we use NVIDIA as a benchmark that everyone is
familiar with, you're talking about the NVLink portion of the fabric. So the NVLink portion of
the fabric is what connects all the NVIDIA GPUs together, either directly or through an NVSwitch.
And today, no one is deploying optics in the scale-up portion of the network. So one thing that differentiates the application we've focused on, as well as driving the foundational focus of our technology in terms of its technical value proposition, is to bring optical I.O. and optical connectivity directly into the compute portion of that fabric.
And to do that, we had to solve a number of key innovations in terms of the devices that are underneath the
hood at the front end of these way down in the semiconductor stack, the integration in high
volume CMOS manufacturing, and making sure we're bringing key technology metrics like bandwidth
density, latency, and power efficiency to really solve that compute to compute, compute to memory
bottleneck happening in the scale up portion. So that's the biggest thing that we look at is we're really focused on bringing optics into the scale up computing fabric in these AI systems.
That makes a lot of sense. Another thing that I wanted to ask you about is energy efficiency.
Obviously, all of this computing is consuming a tremendous amount of energy and those GPUs
love to consume energy. I think that the question that I have for
you is how does optical play a role in driving improvement from an energy utilization perspective?
And what does it do from a standpoint of the rack density that your customers can attain?
Yeah, it's a great question. And one thing that we're focused on a lot inside of our labs is really taking the key low-level device improvements and value propositions, but connecting them all the way up to what it can do for the top-level AI workload.
And there's a number of figures of merit that we're starting to focus on and really talk about more broadly, especially as we look at large-scale AI inference.
AI training is a big problem to focus on, which we already touched on.
But whenever you get to deployment, the call structure of applications built on top of
an AI stack really come down to how efficiently you can run large-scale AI inference.
And there's a few figures of merit that we focus on.
One is something that we call throughput.
So just how many users can you support on an AI system and how many tokens are being
produced in that AI workload and how long is it taking you to produce those tokens?
And then that feeds into a metric that we look at called profitability.
You could think of this also as cost structure, but really what's a figure of merit that we
can look at that gives us how much throughput are we producing divided by how much is it costing us in dollars and also divided by how much power is it consuming?
And you have to look at that metric as the last figure of merit we look at is called
interactivity. So we really become focused on predicting and looking at how we can improve
profitability as a function of interactivity, which is a metric that tells you how quickly
you can interact with these large models. And at the end of the day, why we're so excited about what we're bringing forth in Optical
I.O. is you see dramatic improvements in profitability and in interactivity. So you can
open up larger domains of interactivity to bring forward domains that can support things like
machine-to-machine communications and agentic workflows. But you have to be able to do that
in a way that creates enough room at the top of the stack for application builders and customers
to actually build profitable business models on top of this. We're looking at order of magnitude
type benefits in having much better profit structures and profitability headroom as a
function of this key metric interactivity. For us, it's about driving the key economics to allow application builders and innovators
at the top of the stack to have enough room
and a cost-efficient enough cost structure
to bring products to the market that people can afford.
Now, Mark, the metrics that you just went through,
you've just introduced these to customers
and to the industry.
What has the response been on the focus of profitability,
interactivity, and throughput?
And where do you see this driving implementations?
Yeah, so I think the response has been very powerful so far.
And I think that's coming off of, we're about two years into the post-ChatGPT time period here, where in November, December of 2022, OpenAI unveiled ChatGPT.
And rightfully so, the world became very excited,
investors became excited. But now we're almost two years beyond that. And one thing that's happened
is the market and investors have figured out that while everyone's very excited about this
transition to AI across the board, the cost structure of the unit economics of the AI workload
has a lot of challenges in it right now that's creating a
challenging environment for people to build cost-effective applications at the top. So there's
a lot of focus coming in on, okay, in any major technology shift, you often go through this
transition to where you first have to go re-optimize and build a new set of infrastructure,
hardware infrastructure, that can drive efficiency and cost-effective throughput of eventual application workloads coming on top of it.
That is really a profitability challenge where people building AI features into their products
today, they're challenged on how much those features are costing them.
And it's not yet clear that there are end customers willing to dramatically increase
the amount that they're paying for these AI features.
So one is that the focus on saying, okay, we're a little bit dying down from the
initial hype of how exciting generative AI is.
And now we're focused on how do we drive efficiency in the infrastructure?
How do we improve the cost of the unit economics of the AI workload?
And how do we get more granular on finding where exactly in the hardware infrastructure
these bottlenecks are coming from?
And then let's accelerate and pull forward technology innovations that can solve these
bottlenecks.
And the good news is there's clear ways that we solve these bottlenecks.
And it's about accelerating those solutions into the hardware stack and building towards
a path of AI application profitability.
Our conversations have been met with a lot of alignment on both the problem statement
and the challenge and the focus on where the solutions are going to come for a year for optical?
And what do you think is going to drive that in terms of hyperscale market and beyond?
Yeah, so it's good to touch on timing here.
And semiconductor hardware cycles operate on a two to three year kind of timing.
And that's usually in the best of cases whenever technologies and products are stable.
But what we're really facing here is everyone recognizing that new innovations need to come
deep in the hardware stack.
And behind the scenes, the high volume tier one hardware vendors and designers, we're
seeing a lot of traction and a lot of movement on bringing these solutions and optical connectivity
forward.
We tend to focus on a two year time period where we think that these inflection points of optical solutions coming into AI scale-up systems are going to start happening.
And that's to mature the
readiness of this breakthrough optical IO solutions and products so that by the time we're getting
into the 26, 27, 28 timeframe, there's a high volume semiconductor ecosystem delivering these
in hundreds of thousands to millions of parts per month, going into customer deployments, getting deployed at hyperscalers. I believe 2025 is going to be a
year that there's a lot of activity happening behind the scenes still. And in some of that
may start to come out in the public, but really this is aiming at a 26, 27, 28 production deployment
set of intercepts. And so I think for the rest of this year and
through a lot of next year, there's going to be a lot of hard work happening behind the scenes,
maturing the manufacturing ecosystem in the supply chain ecosystem, setting the stage for
high volume ramps. Now, when you take a look at the solution that you're embedded into an AI system
today, and you talked about integrated optics. How do you see your solutions being
embedded? How is your solution an advantage in terms of the way machines are being built today?
And how do chiplets relate to your design? Yeah, there's a number of themes that are all
kind of converging at the same time here. And one, as we already talked about, once you have
optical IO coming straight out of the GPU or accelerator package, you really start to alleviate a lot of the power and power density problems that are happening inside of racks right now because of the electrical bandwidth distance trade-off.
So while you're using electrical I.O., there's a force that's driving you to cram all of these components closer and closer together.
And it drives up this problem on power
density, which triggers a whole bunch of other issues. And so the drive to optical IO coming
straight from the package really starts to simplify rack design and rack infrastructure design for
data center operators and people building infrastructure. Now, chiplets, as you touched on,
are a key part of that innovation. And for those of you who have been following innovations happening in the semiconductor chip world, starting about almost 10 years ago now, maybe
eight years ago or so, companies like AMD, Intel started to innovate on bringing multiple chips
inside of a package. And one thing that we recognized and a key part of the innovation
that we're bringing forward is we're able to bring an optical chiplet that comes
inside of the GPU or accelerator package. And it largely looks and is integrated just as you would
integrate another electrical CMOS chiplet inside of that same package. So the community, the industry
has gotten good at integrating electrical chiplets inside of these GPU and accelerator packages.
And IRLabs is bringing forward what
we call an optical chiplet, which has on the order of 100 million transistors on it, as well as tens
of thousands of optical devices. And it largely looks like a drop-in replacement for electrical
chiplets that are already in that package. That's paved the way and streamlines the path to high
volume CMOS scale integration. And the innovations that have happened in the ecosystem related to chiplet integration
have really helped pave the way for bringing forward an innovation like what we're doing
in the optical world.
I have loved this conversation, Mark, and it's been delightful to learn a little bit
more about what Ayer is delivering and how it's impacting something that the entire industry
is talking about, the advancement of AI. I know that we've probably piqued interest from our
listeners. Where can folks find out more about the solutions we discussed today and engage your team?
Yeah, well, one is we've got a lot of great information on our website, and that's just
ayerlabs.com. We've got information about the core technology. We've got a number of resources
that people can dig into related to our products and also directionally where the roadmaps are going.
And we're showing up on the kind of conference tour here starting this week at the AI Hardware
Summit, but also throughout the rest of the year at a variety of industry conferences.
So I'd say you'll follow our LinkedIn, check our website, and would love to interact with anyone
at upcoming conferences as
well. Fantastic. Thanks so much for your time today. It was a real pleasure. Thank you so much.
Thanks for joining the Tech Arena. Subscribe and engage at our website,
thetecharena.net. All content is copyright by The Tech Arena.