In The Arena by TechArena - Ayar Labs' Mark Wade on Optical I/O Boosting AI Performance

Episode Date: September 11, 2024

Mark Wade, CEO of Ayar Labs, explains how optical I/O technology is enhancing AI infrastructure, improving data movement, reducing bottlenecks, and driving efficiency in large-scale AI systems....

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein. Now, let's step into the arena. Welcome to the Tech Arena. My name is Alison Klein, and today we're coming to you from AI Hardware Summit in the Bay Area. I am delighted to be with Mark Wade, CEO and co-founder of IR Labs. Welcome to the program. Mark, how are you doing? Hey, I'm doing great. Really excited
Starting point is 00:00:39 to be here. Thanks for having us on. Now, IRabs has been making a tremendous splash, and you guys have been around for a while. Well known in the HPC and AI arenas for optical networking solution delivery. But this is the first time we've had you on the tech arena. So can you provide a bit of background on the company and your technology? Yeah, sure. For one thing, myself and my co-founders have been working on this core technology of what we call optical I.O. for a long time. We predicted that data movement and connectivity was on a path to really becoming a major important piece of large-scale compute systems.
Starting point is 00:01:13 And what we're doing at IR Labs is bringing these technology innovations that we've been working on and bringing forward the industry's first commercially viable, what we call in-package optical I.O. solutions to drive some of the key performance aspects of large-scale AI. And the way we do that is we're using light to transfer data between compute chips. And you could think of that as between GPUs, CPUs, accelerators. Whereas today, all of those connections are done with electrical communications. So it's really about bringing data movement and bandwidth and energy efficiency capabilities of integrated photonics and optical solutions deep into the compute stack and replacing electrical I.O. with optical I.O. That's awesome. Now, we're at the AI Hardware Summit,
Starting point is 00:01:56 as I mentioned before, and there's a lot of attention being focused on processor innovation, but some could argue that AI networks are an equal opportunity of innovation and maybe even more of a bottleneck. Why is that? Yeah, so there's a lot of exciting work happening in what we think of as the compute domain. So people looking at building more specialized processors, focusing on the actual compute chip itself. But what's happening in large-scale AI is that as AI models have gotten larger and more complex, those models far exceed the amount of memory that can fit on any one single GPU or accelerator. And when you look into the AI computation, the performance of the overall AI
Starting point is 00:02:38 workload becomes limited by how quickly you can crunch through these really large matrix multiplications that are happening underneath the hood. And that becomes limited by memory bandwidth and by how large of a memory capacity you can have. So once you exceed that footprint of a single GPU or accelerator, you have to move into a domain to where you're building systems of large numbers of GPUs and accelerators that are connected together. And that keyword there, connected together, is really where the connectivity focus starts to come in. And you're trying to improve and build networks of connectivity that allow you to solve these key kind of memory bottlenecks. So what people are realizing and what I'll be talking a lot about at the AI Hardware Summit is that it's not just a computation that's becoming a big focus. Interconnect is now a
Starting point is 00:03:25 major limiter of the unit economics of AI. So that's where Interconnect comes in as a big focus. When you think about the scale that they're building, and this is something that I think is really fascinating, they're looking at scaling clusters up to 100,000 GPUs and beyond to reach these incredible performance metrics to train AI algorithms and get to a closer state with AGI. So what is the challenge with connecting all of these GPUs together? And what does that mean for Ayer as you look at the solutions that you're providing in market? Yeah, that's a great question. And one thing I'll be talking about in my hardware summit talk coming up is
Starting point is 00:04:09 you have to look at the anatomy of how these systems are getting built. And people hear about these big GPU clusters of 10,000 plus heading to 100,000. But the way those are connected up together, largely they're scaling the number of connections through what we call the scale-out fabric. And that part of the fabric today is a much lower performing, low bandwidth, high latency, high power consumption performing part of the network. Now, if you drill down underneath the hood there, there's a portion of the network called the scale-up fabric. And that's the key piece that really scales the efficient performance of the AI model computations and throughputs. And that piece is much, much smaller today.
Starting point is 00:04:49 So most people are using eight GPU or accelerator domain sizes for the scale up portion. And maybe a market leading edge solution might use 64 GPUs. Now, the key thing to realize is that the performance scales with the scale up portion of the network. So one way to think about what we're driving is as we're trying to get to larger and larger overall GPU clusters in these thousands, tens of thousands heading towards 100,000, we want to bring connectivity solutions that can expand the domain size of that scale up portion of the fabric. And while that's having challenges to scale, you're seeing these challenges show up in many different places. Training wants to drive towards larger models with more complexity,
Starting point is 00:05:31 higher data volumes, larger input and output sequence links. But the connectivity bottlenecks that exist in these systems today, it shows up as it's going to cost you more money to train each model. It's going to take a longer amount of time for a full iteration of training to happen. And that really impacts the key rate of innovation that these model builders can really lean into and how fast we're able to innovate and bring new capabilities forward in AI. And it's putting tremendous pressure
Starting point is 00:05:57 on data center infrastructure as well in terms of the power per rack, the power density inside of that rack. So there's both a challenge on the economics of training at that scale and also a challenge of data center infrastructures trying to solve some of the physical constraints and challenges that are coming up, especially around power consumption and power density. And underneath the hood there, a lot of those bottlenecks are coming from interconnects. Now, you are delivering incredible performance capability in the market,
Starting point is 00:06:23 and I think that you're known for that. But I want you to walk me through how you look at optical network capacity, because capacity is such a key element of what we're talking about. What differentiates your solutions in the marketplace? Yeah, so one, I would draw the deliation here to be between, going back to what I mentioned earlier, the scale-out part of the network and the scale-up part of the network. So today, if you look at these AI systems, optical solutions are already being deployed and delivered into the scale-out portion. And people are using optical transceivers that come from the more traditional data center networking market instead of products. But if you go down into the scale-up portion of the fabric, and if we use NVIDIA as a benchmark that everyone is
Starting point is 00:07:09 familiar with, you're talking about the NVLink portion of the fabric. So the NVLink portion of the fabric is what connects all the NVIDIA GPUs together, either directly or through an NVSwitch. And today, no one is deploying optics in the scale-up portion of the network. So one thing that differentiates the application we've focused on, as well as driving the foundational focus of our technology in terms of its technical value proposition, is to bring optical I.O. and optical connectivity directly into the compute portion of that fabric. And to do that, we had to solve a number of key innovations in terms of the devices that are underneath the hood at the front end of these way down in the semiconductor stack, the integration in high volume CMOS manufacturing, and making sure we're bringing key technology metrics like bandwidth density, latency, and power efficiency to really solve that compute to compute, compute to memory bottleneck happening in the scale up portion. So that's the biggest thing that we look at is we're really focused on bringing optics into the scale up computing fabric in these AI systems.
Starting point is 00:08:11 That makes a lot of sense. Another thing that I wanted to ask you about is energy efficiency. Obviously, all of this computing is consuming a tremendous amount of energy and those GPUs love to consume energy. I think that the question that I have for you is how does optical play a role in driving improvement from an energy utilization perspective? And what does it do from a standpoint of the rack density that your customers can attain? Yeah, it's a great question. And one thing that we're focused on a lot inside of our labs is really taking the key low-level device improvements and value propositions, but connecting them all the way up to what it can do for the top-level AI workload. And there's a number of figures of merit that we're starting to focus on and really talk about more broadly, especially as we look at large-scale AI inference. AI training is a big problem to focus on, which we already touched on.
Starting point is 00:09:09 But whenever you get to deployment, the call structure of applications built on top of an AI stack really come down to how efficiently you can run large-scale AI inference. And there's a few figures of merit that we focus on. One is something that we call throughput. So just how many users can you support on an AI system and how many tokens are being produced in that AI workload and how long is it taking you to produce those tokens? And then that feeds into a metric that we look at called profitability. You could think of this also as cost structure, but really what's a figure of merit that we
Starting point is 00:09:41 can look at that gives us how much throughput are we producing divided by how much is it costing us in dollars and also divided by how much power is it consuming? And you have to look at that metric as the last figure of merit we look at is called interactivity. So we really become focused on predicting and looking at how we can improve profitability as a function of interactivity, which is a metric that tells you how quickly you can interact with these large models. And at the end of the day, why we're so excited about what we're bringing forth in Optical I.O. is you see dramatic improvements in profitability and in interactivity. So you can open up larger domains of interactivity to bring forward domains that can support things like machine-to-machine communications and agentic workflows. But you have to be able to do that
Starting point is 00:10:24 in a way that creates enough room at the top of the stack for application builders and customers to actually build profitable business models on top of this. We're looking at order of magnitude type benefits in having much better profit structures and profitability headroom as a function of this key metric interactivity. For us, it's about driving the key economics to allow application builders and innovators at the top of the stack to have enough room and a cost-efficient enough cost structure to bring products to the market that people can afford. Now, Mark, the metrics that you just went through,
Starting point is 00:10:56 you've just introduced these to customers and to the industry. What has the response been on the focus of profitability, interactivity, and throughput? And where do you see this driving implementations? Yeah, so I think the response has been very powerful so far. And I think that's coming off of, we're about two years into the post-ChatGPT time period here, where in November, December of 2022, OpenAI unveiled ChatGPT. And rightfully so, the world became very excited,
Starting point is 00:11:26 investors became excited. But now we're almost two years beyond that. And one thing that's happened is the market and investors have figured out that while everyone's very excited about this transition to AI across the board, the cost structure of the unit economics of the AI workload has a lot of challenges in it right now that's creating a challenging environment for people to build cost-effective applications at the top. So there's a lot of focus coming in on, okay, in any major technology shift, you often go through this transition to where you first have to go re-optimize and build a new set of infrastructure, hardware infrastructure, that can drive efficiency and cost-effective throughput of eventual application workloads coming on top of it.
Starting point is 00:12:08 That is really a profitability challenge where people building AI features into their products today, they're challenged on how much those features are costing them. And it's not yet clear that there are end customers willing to dramatically increase the amount that they're paying for these AI features. So one is that the focus on saying, okay, we're a little bit dying down from the initial hype of how exciting generative AI is. And now we're focused on how do we drive efficiency in the infrastructure? How do we improve the cost of the unit economics of the AI workload?
Starting point is 00:12:39 And how do we get more granular on finding where exactly in the hardware infrastructure these bottlenecks are coming from? And then let's accelerate and pull forward technology innovations that can solve these bottlenecks. And the good news is there's clear ways that we solve these bottlenecks. And it's about accelerating those solutions into the hardware stack and building towards a path of AI application profitability. Our conversations have been met with a lot of alignment on both the problem statement
Starting point is 00:13:04 and the challenge and the focus on where the solutions are going to come for a year for optical? And what do you think is going to drive that in terms of hyperscale market and beyond? Yeah, so it's good to touch on timing here. And semiconductor hardware cycles operate on a two to three year kind of timing. And that's usually in the best of cases whenever technologies and products are stable. But what we're really facing here is everyone recognizing that new innovations need to come deep in the hardware stack. And behind the scenes, the high volume tier one hardware vendors and designers, we're
Starting point is 00:13:55 seeing a lot of traction and a lot of movement on bringing these solutions and optical connectivity forward. We tend to focus on a two year time period where we think that these inflection points of optical solutions coming into AI scale-up systems are going to start happening. And that's to mature the readiness of this breakthrough optical IO solutions and products so that by the time we're getting into the 26, 27, 28 timeframe, there's a high volume semiconductor ecosystem delivering these in hundreds of thousands to millions of parts per month, going into customer deployments, getting deployed at hyperscalers. I believe 2025 is going to be a year that there's a lot of activity happening behind the scenes still. And in some of that
Starting point is 00:14:55 may start to come out in the public, but really this is aiming at a 26, 27, 28 production deployment set of intercepts. And so I think for the rest of this year and through a lot of next year, there's going to be a lot of hard work happening behind the scenes, maturing the manufacturing ecosystem in the supply chain ecosystem, setting the stage for high volume ramps. Now, when you take a look at the solution that you're embedded into an AI system today, and you talked about integrated optics. How do you see your solutions being embedded? How is your solution an advantage in terms of the way machines are being built today? And how do chiplets relate to your design? Yeah, there's a number of themes that are all
Starting point is 00:15:37 kind of converging at the same time here. And one, as we already talked about, once you have optical IO coming straight out of the GPU or accelerator package, you really start to alleviate a lot of the power and power density problems that are happening inside of racks right now because of the electrical bandwidth distance trade-off. So while you're using electrical I.O., there's a force that's driving you to cram all of these components closer and closer together. And it drives up this problem on power density, which triggers a whole bunch of other issues. And so the drive to optical IO coming straight from the package really starts to simplify rack design and rack infrastructure design for data center operators and people building infrastructure. Now, chiplets, as you touched on, are a key part of that innovation. And for those of you who have been following innovations happening in the semiconductor chip world, starting about almost 10 years ago now, maybe
Starting point is 00:16:29 eight years ago or so, companies like AMD, Intel started to innovate on bringing multiple chips inside of a package. And one thing that we recognized and a key part of the innovation that we're bringing forward is we're able to bring an optical chiplet that comes inside of the GPU or accelerator package. And it largely looks and is integrated just as you would integrate another electrical CMOS chiplet inside of that same package. So the community, the industry has gotten good at integrating electrical chiplets inside of these GPU and accelerator packages. And IRLabs is bringing forward what we call an optical chiplet, which has on the order of 100 million transistors on it, as well as tens
Starting point is 00:17:11 of thousands of optical devices. And it largely looks like a drop-in replacement for electrical chiplets that are already in that package. That's paved the way and streamlines the path to high volume CMOS scale integration. And the innovations that have happened in the ecosystem related to chiplet integration have really helped pave the way for bringing forward an innovation like what we're doing in the optical world. I have loved this conversation, Mark, and it's been delightful to learn a little bit more about what Ayer is delivering and how it's impacting something that the entire industry is talking about, the advancement of AI. I know that we've probably piqued interest from our
Starting point is 00:17:50 listeners. Where can folks find out more about the solutions we discussed today and engage your team? Yeah, well, one is we've got a lot of great information on our website, and that's just ayerlabs.com. We've got information about the core technology. We've got a number of resources that people can dig into related to our products and also directionally where the roadmaps are going. And we're showing up on the kind of conference tour here starting this week at the AI Hardware Summit, but also throughout the rest of the year at a variety of industry conferences. So I'd say you'll follow our LinkedIn, check our website, and would love to interact with anyone at upcoming conferences as
Starting point is 00:18:25 well. Fantastic. Thanks so much for your time today. It was a real pleasure. Thank you so much. Thanks for joining the Tech Arena. Subscribe and engage at our website, thetecharena.net. All content is copyright by The Tech Arena.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.