Semiconductor Insiders - Podcast EP348: How Lumerian Labs is Building the Foundation for AI’s Next Era with Jay Dawani

Starting point is 00:00:07 Hello, my name is Daniel Nenny, founder of SemaiWiki, the Open Forum for Semiconductor Professionals. Welcome to the Semiconductor Insiders podcast series. My guest today is Jay DeWani, co-founder and chief executive officer of Lumarian Labs, where he leads the company's mission to reinvent AI infrastructure for greater efficiency, accessibility, and performance. With a background spanning AI system architecture, hardware software co-design, and performance optimization, Duwani has built AI-powered systems for autonomous vehicles,

Starting point is 00:00:37 spaceflight, and large-scale model deployment. Before founding Lemurian Labs in 2022, he advised on NASA's Mars rover program at Geometric Energy, focusing on vision-based navigation and planetary mapping. A frequent speaker and commentator, Jay, is passionate about shaping AI infrastructure to broaden access, accelerate innovation, and unlock new frontiers in intelligent computing. Welcome to the podcast, Jay. Thanks for having me, Danny. Yeah, first may I ask what brought you to co-founding Lemurion Labs. It sounds like an interesting story.

Starting point is 00:01:10 A series of fortunate accidents is the best way to describe it. Running up against problems standing in the way, usually. So you think about starting early on, thinking a lot about how to even build an intelligence system, how do you think about training them, the amount of data you need to train them, the the amount of computational intensity, like training a very, very large model, at least at the time it was considered very large, but today's standards, it's fairly small. Thinking about just a 20 billion parameter model, you know, eight years ago, that could essentially do the entire SENS Plan Act loop for our autonomous system deployed in the wild would require

Starting point is 00:01:55 a few thousand GPUs and several months of training time and then more months of validation time. And that was just way too expensive to even consider in most cases. So you have to be right. So you take a safer bet usually. And it just didn't make sense to me because the promise of autonomous systems is very, very obvious. And it's not like we technically don't know how to get there. It's just that we don't have the right set of breakthroughs that have allowed us to bring the cost into certain place to allow us to innovate on them in the way we should.

Starting point is 00:02:29 And that sort of drove the trajectory. for most of my career. And then eventually, I started thinking a lot more about hardware, because I was considering joining a certain chip company at the time, and I started getting this feeling that the inference workload for an autonomous system is going to be fairly different from how a GPU works. GPUs are designed for predictable memory access, for high throughput, aggressive batching, and for largely compute-bound workloads. But if you look at a system at the edge, you know, you basically have an SOC and you want the entire workload fitting on this device and you care a lot about latency. You want to be processing like a terabyte per second over multiple interfaces.

Starting point is 00:03:17 And you want to be running this mass model running 60 frames a second and so forth. And that just didn't exist. And I felt like that was a thing I was holding back the industry from getting to where it was supposed to. to be. Interesting. Yeah, you know, the industry's answer to AI's compute demands has been bigger chips, of course, you know, more chips, more power. Lumerian's bet is that this curve is breaking, which I agree with, by the way.

Starting point is 00:03:45 What signals convinced you that we've hit the ceiling of what hardware first scaling can deliver? And what made you confident enough to build a company around the alternative? So there's a few things in there that I'd want to see if we can like push aside. and sort of disaviguate. There's one premise in there that hardware has hit a ceiling. I think, firstly, the hardware that we have is absolutely incredible. The fact that we are even here is by itself a miracle on top of miracles.

Starting point is 00:04:16 What we've actually hit is sort of more of a comprehension ceiling. So the gap between what hardware can do and what the software understands and how we use it. So the problem is not so much hardware as it is software. And that gap keeps widening every generation. So you look at today's systems, you know, a lot of training workloads, people are talking about anywhere between 45 to 50% utilization. And that's at like some of the frontier labs, where they have the best people working on this for months at a time, and they know how to, like, right-size their workloads and their

Starting point is 00:04:56 runs for the system that they have. they have. And then you talk to AI companies in terms of what their inference looks like. A lot of time, it's not surprising to hear I have under 15%. And that's not because the hardware itself is flawed. It's because the software stack was designed for a certain reality, which right now is ensuring that the hardware isn't fed enough in order to get the performance. So, GPs are idle most of the time waiting. They're waiting for data to arrange. arrive from memory and waiting for communication collectives to complete so that the next kernel can be scheduled.

Starting point is 00:05:38 So the time between a kernel finishing and waiting for data for the next kernel to start executing is the thing that drives lower utilization right now. And ALUs are basically free. Data movement is the bottleneck. The industry has largely been optimizing for the wrong thing at the wrong level for too long. The cost of moving data from memory to an ALU costs about 100 times more energy than the computation itself. So almost 70% of total system energy and a chip goes towards moving data around.

Starting point is 00:06:13 The arithmetic piece consumes of maybe 10 to 15%. The entire software stack has been built around how do we optimize the 10 to 15% instead of the 70%. The hardware is not getting simpler. it's getting far more complex. It is becoming more heterogeneous in nature. Because chiplets invite a certain kind of heterogeneity as well, different memory hierarchies are different kind of heterogeneity, different kind of interconnects to different kinds of heterogeneity, scale up is its own kind, there's different topologies, all of that impact

Starting point is 00:06:46 software. You have to take all of those things into account when you are writing good software. And if you talk to some of the hyperscalers right now, they will tell you they need a thousand next more capability, we're basically the same power. So the binding constraint right now is how well software can exploit the system. So obviously don't go replacing hardware investment. You still need more hardware. We need it better hardware. We need it faster.

Starting point is 00:07:15 Our focus is the marine, generally it is making sure you get the most out of it with the least amount of efforts. In other words, our focus is maximizing useful intelligence per jewel. Oh, interesting. You know, it's an open secret that most GPU clusters run in a fraction of their theoretical performance with much of that gap sitting in the software stack. So where does Lemurion see the most painful bottlenecks today? And why has the industry struggled to close them despite billions in infrastructure spending? Yeah, it's interesting, right?

Starting point is 00:07:49 We've been known about these problems for so long, but we still haven't solved them. And that goes back to what I was saying earlier a little bit, right? Hardware has been moving faster than software and the abstractions haven't changed in a very long time. To a large degree, you know, hardware has moved far beyond what we had 25 years ago, but the software stack is still in the same abstraction. So we're still in this world of programming a single GPU at a time, assuming a compute bound and regular workload while we're trying to program a supercomputer.

Starting point is 00:08:22 But let's take a look at the atomic unit of the AI economy as it were, which is generating a token from an LLM. So the chip itself was designed for a workload that's two order of magnitude, more computation per byte. And then the way to sort of recover that is by batching. That's how you get the most useful work out of this. And if you process, say, 30 sequences simultaneously, you can boost utilization to 35, 45, But then you think about the production conditions, and they don't usually permit an ideal bat size.

Starting point is 00:08:59 Because there's a lot of the things around latency and user experience and SLAs that you have to go and think about. Then there's the context length. The context length kills the gains for bat size. So there's a trade-off there. If I have a 32K token context, I can fit maybe four sequences on a GPU. And then if I do that, then my model flop utilization, the MFU, will fall to less than 10%. We keep evolving the workload at a faster rate than we're able to optimize things. And the direction at which AI wants to evolve and the way hardware is evolving is pulling apart the kernel model.

Starting point is 00:09:42 So kernels generally assume a static explicit management with fixed launches, course-screen data transfers, compile time decisions that can't adapt to runtime surprises like uneven execution time due to data-dependent branches or unpredictability in exit skill setups that leads to like 70 80 percent no wasted cycles on just coordination and memory overhead that result in a system only achieving 20 to 30 percent of the theoretical peak and by theoretical peak in this case I'm referring not to the hardware that vendors are putting out there on their sheets, but the theoretical peak for a workload

Starting point is 00:10:27 on a given system under ideal conditions. So all these static approaches that we have require conservative over-allocation, and they can't predict which device will have access to what data at runtime. These are the things that need to be addressed. So the compiler world has been too conservative, and they're sort of hoping things will magically get solved by better kernels, and that's what the industry is. been writing just better kernels, but better kernels only marginally move the needle. We need a radical rethink. That's much harder to do.

Starting point is 00:11:02 Interesting. You know, the software first AI infrastructure sounds abstract, but in concrete terms, what changes for an ML team running production workloads on Lemurion stack, you know, in how their models get compiled, scheduled, executed that they can't get from today's tools? Yeah. So if you think about the teams today's today, in many different functions of what you call an AI engineer. You have people who are collecting data, they're labeling data, they're cleaning it. There's people that are doing research. And if you look, if you go to a frontier lab, like, let's just assume that they have

Starting point is 00:11:40 a split around 30% compute for R&D, 30% for a large pre-training, and 30% for entrance. You have to choose your bets. And once the research part is done, the separate team that is doing larger-scale pre-training, thinking about that infrastructure, then environment, to optimize it, to give you the best bang for your buck on that. And you have to complete model training in a certain timeline, of course, to go in order to get a model out in the market because there's implications if you don't. And then you think about the inference side.

Starting point is 00:12:11 There's a separate team that takes these frozen weights from this model's checkpoint, and then they are optimizing it for the hardware that they're going to be running this model on. There's a massive delta of months between when the model is done training and when it's validated to when it's deployed on this hardware. And sometimes it's different vendors, different stacks, different kernels, different conditions, different clouds, a whole bunch of problems. That dramatically slows you down. These different teams, they can't cross-talk, they can't share insights, they have to restart every single time. That's meaningfully slowing things down in a way that is not really great. Because now your ideas and what you can even test is based on the inductive bias of the hardware and what that hardware wants you to do.

Starting point is 00:13:01 So most ideas are actually killed based on whether or not I have kernels and infrastructure to support my ideas to begin with, and whether or not it'll run well on this hardware. Now let's talk about what are we going to eliminate, and then we can talk about where we're going. The problem, as I said, is that this optimization is bespoke. In our design philosophy, starts from one question. What if the optimized path was the default path? As in, your Pytorch model goes in, you've expressed an idea of something you want, and what happens next is different from the current stack and how we do things,

Starting point is 00:13:40 and that we don't compile it at the kernel level. We compile for the system. in that we aren't building a parallelizing compiler. We are building a compiler that is natively parallel understands the system. And that insight makes certain things possible. You stop thinking about a cluster of collection of individual chips and now as a single distributed computer essentially. When you write a kernel, you're writing for a specific part of a chip.

Starting point is 00:14:12 When you write for a stack, you're just expressing computation. the system figures out how to execute it across whatever hardware is available and then figures out the data movement, the routing to minimize it, the scheduling, how do you overlap communication with computation, how do you adapt when things change as they inevitably do. So now you can essentially get the maximum performance out of the system as you take an idea from a single GPU, scale it up to a large pre-training run, freeze that, and deploy onto the hardware you want on the next day instead of months later. that velocity that you gained, that is what we're giving you while giving you portable performance.

Starting point is 00:14:54 Okay. So, you know, for years, riding to a specific accelerator was the path to peak performance. Why is hardware specificity becoming a liability for AI teams? And how can hardware agnostic software layer reverse this trend without sacrificing performance? A different workload is going to have a different characteristic. What you would do for a monolithic LLM versus the mixture of experts LLM versus a reasoning LLM versus a tool calling and agentic LLM, they have different balances. In some cases you can find like, yeah, if I have an agent, the right balance could end of actually being one CPU, one GPU and four GPs or two CPS and eight GPs.

Starting point is 00:15:40 And then you'd have to write for them differently. And then you find, like, then you have all the chips like the MTIA and the Maya and the Traneum and the TPUs, which are also fragmented into feigning versus inference. And then there's the AMD chips and then if Intel decides to release one again, and then the Qualcomm's, cerebruses, intense thorns, and so on. I think we're ending up in a world where every single hardware that is stamped out and produced will be plugged in and will have to contribute to the icon. And we will end up in the world we are right-sizing workloads to hardware in order to maximize this intelligence per jewel or intelligence per dollar.

Starting point is 00:16:24 So now if your software stack embeds assumptions about one vendor's hardware, which, you know, every single vendor stack obviously does because it's designed for that one-chip architecture, you've made a strategic commitment as an organization to a shorter half-life than your production system. That's a very dangerous place to be. So your multi-year inference pipeline is optimized for hardware that could end up being obsolete in 18 months because you have to jump to the newest hardware in order to be competitive. And in our case, we're not trading performance for portability. That trade-off exists in a certain world, under certain constraints. We've changed the problem itself to allow us to solve those problems in a very unique way. You know, right once, run slow everywhere. That used to be the case.

Starting point is 00:17:11 OpenCL tried this, many others have tried this. They produce mediocre code in every platform. That's not what we do. Our compiler is generating different optimized code for each hardware target, for each system. All right, so assuming the next decade of AI progress comes from software rather than new silicon, how should CTOs and infrastructure leaders rethink

Starting point is 00:17:31 how they evaluate cost and performance and long-term scalability? What should they stop optimizing for? A useful metric that I think people should adopt and I haven't seen enough of this, is useful output per dollar. As in how much useful work are you getting out of each dollar you spend on your compute? Any metric that doesn't include engineering costs

Starting point is 00:17:53 should probably be retired. As an if your infrastructure team spends three, four, six, seven months tuning a new model or a specific hardware, that is a real cost that should be factored up. If switching to a new accelerator requires rewriting your kernel library, library, that's what you can't ignore that. Time to production matters more today.

Starting point is 00:18:18 If the hardware, if your answer is measured in months, your infrastructure is optimized for the past and the future. I think that's what happened ultimately trying to get at. In a world where new model architectures are dropping on like a monthly cadence and new hardware shifts every six to 12 months, time to production becomes your most competitive variable. So the team that deploys a new model on the newest hardware, fast, for the best economics is the one that wins. And if you can do that faster and get more out of it, you're obviously in a very different league.

Starting point is 00:18:47 So don't think about infrastructure as a cost center. You minimize and think about it as a capability that you maximize. So a lot of AI teams are still chasing winds at the kernel or model level while system level performance is on the table. Where do you most often see organizations misjudge their real performance? And what does Lemurion's approach reveal that component-level optimist? misses. I think the deepest misjudgment is probably one of category error. Confusing component level efficiency with system efficiency. An organization that profiles their matrix

Starting point is 00:19:24 multiply kernels and they say they get like 80, 85% of peak throughput, then you have the same potential kernels for like 65, 70%. They conclude that their workload in the stack is well optimized. But they've sort of only measured, you know, a part of the system, a small part. They haven't measured the end-to-end execution time of a workload. What we're sort of trying to do is, say, instead of thinking about new kernels and having all these hardware-specific insights, have this full data flow graph of the computation across the entire system, and not just what each operation is and what it does, but like how data moves between all of them? And how do you optimize that?

Starting point is 00:20:12 Where does communication overlap with computation? How do we do that well? How do we tune things for the hardware topology? Because we have our own versions of things that are like nickel-like, but they can be tuned to different infrastructures and different topologies. The waste that no one's measuring, I think, is the one that's worth the most today. that's what we're sort of focusing on what we're saying. Right.

Starting point is 00:20:39 So final question, Jay. New accelerators are landing every few months, each with its own programming model. What role does Lemurian software stack play in infallating developers from that churn, letting them target whatever hardware makes sense without rewriting their workloads, while still extracting performance that rivals,

Starting point is 00:20:57 you know, hand-tuned code? Yeah, it's a tricky problem. And this is where I think we're different. I don't think hand-tuned code is the problem. I actually think for most hardware, a better kernel makes your hardware look worse, right? Because you're exposing the latency of the system. So the question did you ask yourself if you're writing kernels is at what point does better stop being better? At what point is it just vanity?

Starting point is 00:21:23 And I don't see hardware diversity as a threat that we need to be managed. I think it is an advantage that we need to be exploiting. In our case, we don't have to do the things people would normally have to do. Because each new system with every single change, these changes in the compiler, new bring up times, if there's a new memory hierarchy changes in it, you have to go and capture that. To rewrite kernels, update the libraries, put that in the customer's hands. In our case, we just had a hardware model into our compiler so we can parameterize it so so that it can do hardware-specific optimizations.

Starting point is 00:22:04 A lot of the hardware ourselves, we're very specific in terms of the problem we're solving. We understand how to build a hardware-independent compiler that is parallel, but it can do a lot of these highly optimizing things, specific two systems with hardware information. It's a very, very different problem. Portable performance has always been its point of contention. And the contention has been that you're always solving for the lowest common denominator. If you have portability, it comes to the expense of performance. And if you want performance, you have to exploit the hardware to its fullest.

Starting point is 00:22:43 If you are network bound and memory bound, your performance is not where you think it is. What you need to do is optimize data movement and maximize it from the point of view of the memory hierarchy of the system that you are targeting and right size the work and place it ahead of time where it's supposed to be to intercept the memory to keep the ALU's max really busy. It is a very, very different way of thinking about the problem. And that is how you can do this and to get addressed different kinds of systems, different scales, different workloads as things are changing and insulate yourself from all this.

Starting point is 00:23:20 Great conversation, Jay. Thank you for your time. It's a pleasure to meet you and hopefully we can catch up later on in the year and get a status. Absolutely, that would be lovely. That concludes our podcast. Thank you all for listening and have a great day.

Semiconductor Insiders - Podcast EP348: How Lumerian Labs is Building the Foundation for AI’s Next Era with Jay Dawani

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.