In The Arena by TechArena - A Mathematical Revolution in AI with Lemurian Labs CEO Jay Dawani

Episode Date: October 17, 2023

TechArena host Allyson Klein chats with Lemurian Labs co-founder and CEO Jay Dawani about his vision for a mathematical revolution in AI and how his company plans to change the game on AI performance....

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein. Now, let's step into the arena. Welcome to the Tech Arena. My name is Alison Klein, and I am really delighted to have Jay Diwani, co-founder and CEO of Lemurian Labs with me. Welcome, Jay. Hi, thank you for having me, Alison. Delighted to be here. So Jay, I've got a lot of people on the show that I would say
Starting point is 00:00:38 they've written a book on things, but you've actually written the book on AI and ML. So tell me a little bit about your background and how you came to be the CEO of Lemurian Labs. Yeah. So I can't go all the way into the background because I don't think we have enough time for all the bizarre things that I've attempted. But a few years ago, I was working as director of AI at a company called Geometric Energy. And I was trying to build a team to allow us to solve a lot of the problems we were. And I interviewed a few hundred people for the roles. And out of those few hundred, maybe 10 could actually answer the questions that I had.
Starting point is 00:01:13 And only two could actually write any of the code that I was hoping they would. And it was really disappointing to see that because we need more ML engineers that actually understand the fundamentals. They understand well enough to manipulate and create new workloads and new algorithms. Because I don't believe we just stop here. Just the workloads we have that are open source, you can't just keep pulling those and working for the same models and the same data sets. And it was sort of out of that frustration that I decided, you know what? I understand the math well enough.
Starting point is 00:01:43 Maybe I'll write a book that is accessible and builds on high school mathematics foundations so that more people can get into the field. It's not as daunting as people think. And I wanted to fix that misconception that you don't need a PhD to be an ML engineer because I don't have a PhD in ML. I dropped out of university.
Starting point is 00:02:02 And if I can do it, they can too. Now, you and I have talked off the air about the importance in this moment in the industry of development of AI technology, and we're at the AI Hardware Summit. This journey that you've had into ML and looking at various fields like robotics and autonomous transport and how they're using machine learning has led you to a company that is actually developing hardware. How does this challenge inform underlying hardware requirements? And how do you see the interplay between hardware and software today? Well, that's a loaded question. So when I was working in a lot of these areas,
Starting point is 00:02:46 I was really excited that computer vision seemed to be more or less solved. And that's something that's been holding us back for a very long time. And you no longer have to hang the engineer features and try to make a classifier. You can learn all those features. And the larger your model gets, the less of a priority you need
Starting point is 00:03:06 so your model can figure it out. And it can learn features that you may not have thought of. And then you also had reinforcement learning that was showing some really, really great promise so that you could couple the two and have a vision-based decision-making system. And that is essentially what brought people to research a lot of the robotics ideas.
Starting point is 00:03:25 Autonomous vehicles were born out of the observation that computer vision was almost solved. With enough compute, we could get there. With enough data, we can get there. So the data we collected, we made a lot of progress in the industry, but we still have a massive problem ahead of us because the workload isn't done evolving yet. It needs a lot more. And more importantly, we don't have enough compute in those environments at the edge to make autonomous robots work. The model has to grow and the more it grows, the more memory requirement needs, the more flop requirements it has, and hardware is not progressing at the rate that it needs to to enable that, especially safely. Because if you start quantizing models too aggressively to try and make them fit, you're creating more errors inside the model that you can't account for.
Starting point is 00:04:13 So I had a case at a self-driving company where I had a one in a million probability that my model would misclassify and react wrongly to a woman pushing a baby in the shoulder across the street. That's not a scenario I want to go wrong. But because of compute constraints, I had to quantize this model and deploy it. And that probability ended up being 1 in 10,000. And I had no idea how do I need to fix this with my training data set? What was changing inside? And that's where I started to think about the importance of numerics and hardware and how that needs to be evolving for these workloads. It was really, again, I guess most things to me come
Starting point is 00:04:51 down from I was frustrated at some point, so I started doing something. I started thinking more about the hardware and looking at why hardware wasn't evolving in the direction that we needed. And I learned that we kind of made a lot of choices a few decades ago that we've been continuing to ride along with. And it wasn't anyone's fault. They were the best architectures at the time for the workloads we looked at. But when the workload changes, hardware has to evolve with it to make it work. Because hardware is normally, if you saw the switch from single core to multi-core, software took a decade to catch up. And it's still somewhat catching up, especially as hardware continues to evolve.
Starting point is 00:05:29 But now, software is not off in its own direction. Because we had that switch from software 1.0 to 2.0. And hardware never really fully evolved with it. We continue to use the same platform. And now that we're ripping out bits bits trying to go for smaller number formats, there's a diminishing return from going for more efficient number formats because the architecture can't benefit. It spends more energy in the instruction decode and the data movement side than it does in the math. So if you save a couple of percentage and energy just changing the math,
Starting point is 00:06:01 it doesn't matter. You're still dominated by movement. And it was really those observations together that started making you think about maybe there's an opportunity for introducing a new kind of computer architecture. So we made the transition from scalar to vector. Let's make the geometry matrix now, because that's what this workload is asking for. So instead of going for SIMD, we went for spatial. And we've been working very, very diligently on our software stack because to me, that is, accelerated computing needs to be around enabling a workload and the engineers writing those workloads. We can't make their
Starting point is 00:06:35 lives harder. So let's look at how the workload wants to run, how the workload may evolve. And Jeff Bezos said a long time ago, when times are changing, it's more important to look at what stays the same. Because only when things are somewhat static can you actually start creating value. Because if the workload is changing, you don't know where it's going to end up. Everyone is kind of guessing. So that's how we kind of approach this. What are the kernels that are going to stay the same? What is going to change? Data flows are constantly changing. The underlying kernel isn't really changing all that much. Right.
Starting point is 00:07:08 Numerical requirements are not changing that much. And that gave us a sense of this is how we should start thinking about the system. And then, of course, thinking about the edge, power is important. And I was a little bit optimistic around timing around the edge. So we kind of brought ourselves up to the data center because a lot of folks had come to us and said, what you're doing seems really great, but I think we want to try and use this for this application. So we rethought a lot of the architecture side.
Starting point is 00:07:35 But to answer your question, we need to be more software-centric right now. And we need to build around engineers and their needs and share the workload. We can't go chasing really cool tops per watt kind of numbers because they don't translate to real life benefit. Right. You know, when you're talking, it makes me think about so many things
Starting point is 00:07:53 that I've experienced in my career in the Silicon arena and knowing that software developers have been constrained based on what architecture they're writing towards all the time. This is the case. And I think that we are hitting a peak with AI moving so quickly. The architectures that we are utilizing are providing constraints. You've introduced a new concept at Lemurian around a new number system for AI that I want to talk to you about and why that's so important.
Starting point is 00:08:23 I know your team is very hard at work developing silicon. When you look at introducing a new architecture, you've talked a little bit about it, but why now and why a new number system? Yeah. So I mentioned earlier that I had a problem with quantization of these models. Now the entire industry a few years ago got really, really excited. And they're like, hey, let's just jump to Int8 because it'll work. And I don't think any AI engineer was like, yeah, that's a great idea. It was mostly hardware engineers being like, that's how we get performance and efficiency. Because we save like, what, 20x on the mat wall, 4x on data movements and so on. And that's great.
Starting point is 00:09:05 But you still need to run the workload. So convolutions could get quantized fairly well, but that's a small subset of them. Fully connected networks could get quantized fairly well. And then you have transformers, which are largely matmuls, but they have other components too that are not very quantizable. And these workloads are still going to evolve. They're going to get more complicated.
Starting point is 00:09:26 They're very sensitive to perturbations, which is what numerics do. And you can think of it as a dynamical system. If something is constantly flowing and slight perturbations here and there will lead you astray, you want the system to be as stable as possible. And the DSP world has been doing this for decades. They've looked at a workload. They've exploited the numerics, have it very, very smartly. And then they built the right architecture that would solve
Starting point is 00:09:52 and be the right tool for the job. And it was really that thinking that we brought in. We've never done that in general purpose computing. We're starting to see that a little bit. GPUs, they give 1,000x in 10 years, they claim. Most of it was from number formats and specialization. Very little came from node scaling. Interesting.
Starting point is 00:10:13 Yeah. So now that we have this workload that is demanding so much, and it's growing at 5x the rate of hardware, so training flops is growing at 10x a year. Assume hardware grows 2x a year. We've gone from a 5x difference from year one to almost a 15,000 plus difference in year six.
Starting point is 00:10:33 Right. Right? So that's an alarming rate of growth. We've never experienced something like that as an industry. And we can't use the same bag of tricks to continue going on. It's not sustainable. Yeah. So there was a saying that the status quo is Latin for challenge. It's something you should be throwing down. It's something to be challenged. Oh, interesting.
Starting point is 00:10:58 Yeah. Why are we continuously stuck in certain ideologies because they're comfortable? The world is changing. We should too. Change isn't comfortable, which is probably why people push back. But we need to rethink computer architecture for these workloads. And a big part of that is number systems.
Starting point is 00:11:21 At the end of the day, you can think of a car. Or if you have a car, a motorcycle, or an airplane, each are transportation vehicles. They get you from point A to point B. But the engine is different. The body is different. You're not going to take the engine from a motorcycle
Starting point is 00:11:38 and put that in an airplane. That just doesn't work. Or the other way around. You have to right-size it for the tasks that you're going to solve and do that. Because right now, the architectures we have, to them, everything looks like a nail and they're a hammer. But that's not what you want all the time. There's a lot of problems where you just want a good tool for the job.
Starting point is 00:11:59 So number systems, they're an engine. They need to get fed. At the end of the day, compute is computation. It's math. So we thought about how much precision and dynamic range do these workloads need? And when thinking about that and the trend that people were going into,
Starting point is 00:12:16 it kind of became obvious that we were going to run out of bits very soon. And that was going to make software engineers' job really hard because we can't deploy these systems safely. And we went back to a really old idea, log number systems. Log number systems are great because it multiplies or adds,
Starting point is 00:12:33 divisions and subtractions, and those are very free operations. But there's a caveat. Addition and subtraction in a log format is very hard. So what people did early on was they used a lookup table to convert from the log domain to the linear domain and do the addition, subtraction in the linear domain
Starting point is 00:12:50 where it was easier to do. But those lookup tables end up growing very, very big. So any return that you get from that benefit of turning multiplication to addition gets kind of mitigated away. And there's sort of the holy grail of number systems that if you can make surely logarithmic format work. It's like a 350-year-old math problem.
Starting point is 00:13:12 Right. And it's been 50 years in the VLSI world. But me and my co-founder, my co-founder is one of the experts in arithmetic and logarithmic systems, and we came up with a way of actually making addition in a log format work. Wow. And we've developed multiple types, all the way from 2-bit to 64-bits to stack up against floating point. And where floating point scales linearly and gets worse and worse as you add more bits, we kind of taper off something in that way. So pound for pound or bit for bit this thing has better precision dynamic range and the signal to
Starting point is 00:13:47 noise ratio and so from instead of saying a ppa perspective we call it ppp perspective performance power precision this thing wins all day long and the interesting thing was number systems can have a morsel-like effect whereas they shrink your architecture opening up space in your silicon so you can attempt more right that's what brings right gave rise to multi-core and that's always been holding back a lot of different architectures that could have been very promising right because if you just change the architecture alone and keep everything else the same like if you have the same IP as every other of your competitors, you have the same workload and constraints on software,
Starting point is 00:14:30 and all you change is the architecture, you're bound to, at best, a 30% improvement. And if that's the case, you'll just stay with whoever the market leader is because they have the best software stack. You know it's going to work. But if you really want to break free from that, you want to deliver something that's going to create value,
Starting point is 00:14:46 you have to rethink number formats and software for what the application's asking for and design and architecture around that. And that's what we did. What are the performance you threw out some numbers there, but what are the performance deltas that you're
Starting point is 00:15:01 looking at? We're targeting a 300 watt PCIe form factor. It's got a crap ton of HBM on it. And that's probably because the workload is very, very memory-centric and it needs a lot of data movement, especially with transformers. And this thing can do three and a half petaops of dense compute. With sparsity, obviously, we can get a 2x. But that is massive.
Starting point is 00:15:28 And this isn't a five nanometer process. And we'll be releasing a lot more down the line about the architecture, more specifics, how our software stack works, as well the number system, showing how to do quantization in it, as well as training in RCA
Starting point is 00:15:43 number form and then quantize down or from floats to ours. And we'll also be sharing MLPerf numbers on our architecture in our simulation in the name of openness. We want to be open. We don't want to hide it. We want to give engineers more freedom, more control of the software stack and the architecture so that they can, in the event the workload changes, you have this function plus one
Starting point is 00:16:07 thing where you have a new kernel that you need to run and you don't want to have to wait for the company to decide it's worthy enough to pack into their software stack and make work. You as an engineer want to have control enough to be able to create your own kernels and fuse them and then compose them to do whatever you want.
Starting point is 00:16:24 And that's how we've thought about this. And we have results that we're going to be sharing again very soon. In fact, I'll probably be sharing that on Thursday. So stay tuned for that. And I think we'll surprise a few people. I'm looking forward to it. The theme at this conference is all about this incredible demand for choice. I think that everyone is looking at this and saying,
Starting point is 00:16:49 we can't have a single supplier fueling something as important as AI. Who are you aiming for the introduction of these platforms and why? So I mentioned earlier that I was really frustrated that I couldn't get to train and deploy the workloads that I was looking at. Around this time, I was venting to a lot of my friends who were in the industry, and a lot of them worked at one of the bigger companies, a lot of startups, and they weirdly felt similar pains to the ones I did,
Starting point is 00:17:18 despite them being much more well-capitalized, having very large compute clusters available. And that's because there's too many people, too many ideas, and there's not enough compute to go around. Very recently, a very large company deployed a very popular workload
Starting point is 00:17:35 and it took up enough compute to the point where 90% of the AI projects in the company were killed. Because they ran out. That's bizarre to think about. A single workload being deployed, taking up all the compute in many, many large-scale data centers. And this is the beginning.
Starting point is 00:17:57 You think about that chasm that is getting created with the growth of model size, if you want to stay at the frontiers, and the rate of compute and the cost of compute and the power consumption, very few companies are going to be able to train these models. I was with a company a few days ago, and they just trained their latest model. And this thing took a little over a few, took almost 10 Yottaflops to train. Wow. That is ridiculous. We just built the first Exascale computer last year. You were talking about Yottaflops to train. Wow. That is ridiculous. We just built the first exascale computer last year.
Starting point is 00:18:26 You were talking about Yottaflops now, not even Zettaflops. And if you build a Zettascale computer today, that's going to take like 20-ish nuclear reactors. Right. That's not sustainable. And even with the current trend of scaling, it doesn't get that much better. 10 years from now, you'll still need a few nuclear reactors to power this. So we need to rethink and reimagine accelerated computing for this workload. And we need to make computing accessible so that more people can come in, more architectures can get trained.
Starting point is 00:18:59 And it's not just five companies in the world that have the computer to train this. Right. And it's not just five companies in the world that have the computer to train this. Because when you have that, the incentive to progress the workload and deliver something as great isn't there. A lot of the learnings, the failures of these models don't surface. And that is partly what I want to fix. I want to make all the startups out there, all the midsize companies, all the tier two clouds that are all starved for compute. I want them to be able to have a voice in this and have enough compute to train AI models and deploy them at scale without, you know, breaking the planet. When you think about that, it almost takes me back to that line in Jurassic Park where, you know, they were so focused on if they could build it, they didn't think about if they should.
Starting point is 00:19:45 And I feel like we're at that moment with AI. You know, I think that you talk about the power, you know, and I read a statistic of, you know, forecasts of AI growing data center energy drop to 20% of what the world consumes. Yeah, that is entirely possible. I think it's going to happen within the next 10 years, if not sooner. There's a lot of talk about AI doomerism. That guy's going to
Starting point is 00:20:14 do a lot of bad in the world. And I've been saying pretty jokingly that it's unlikely that's going to be the case because our pursuit of these models and the rate these models are growing and the carbon emissions resulting from that are probably going to get us first. And that's something we need to fix. And then we can figure out, you know, the killer robots scenario later on.
Starting point is 00:20:34 Right, exactly. But let's paint a rosier picture. Sure. Let's say that you and other innovators in the AI space are successful and delivering some breakthroughs in terms of performance efficiency and core capability and democratizing access, what does the future look like?
Starting point is 00:20:52 What do you want to see from this industry? And what do you want to foment for the broader developer community? So when I was a kid, I got into technology because I grew up in a third world country. And I didn't have access to a lot of what everybody else did. I'd say some Canada and other places.
Starting point is 00:21:13 And I remember the first time I saw a computer and what it could do, it blew my mind. I've always had this view of any technology, once it's built and once it gets to a price point where it gets proliferated, is going to be an amplifier. And every single platform shift, a technological shift, has brought along parts of the old and enabled it much more. So to me, it's the internet democratized access to information. AI turns that information into knowledge and enables you to have access to that. That's a force multiplier. A person that didn't know anything or couldn't read all of those books now can go to a large language model, interact with it, ask it questions, and it can learn better.
Starting point is 00:22:06 Imagine the impact that could have. It's profound. And there's other applications that are also stifling this. When compute gets down and the ag is progressed, we can have autonomous robots, safer roads, better drug discovery, better diagnostics for people. We can, instead of addressing things after that, we can start being preventative, understanding more about our bodies.
Starting point is 00:22:30 That excites me. That's what I want to enable. But most of the people working on those problems are compute staff. We can't attempt it. Right now, the best we can do is build models that interact with us, give some decent looking answers, create some images, which are all really cool because that impacts creators
Starting point is 00:22:50 and a lot of other industries a lot. But there's more important work that needs to get done. We're not there yet. So I can't wait to hear more about what you and the Lemurian team deliver. I can't wait to see your MLPerf delivery later this week.
Starting point is 00:23:04 That's exciting. I'm't wait to see your MLPerf delivery later this week. That's exciting. I'm sure that folks online are intrigued and want to engage with you and your team. Where would you find the, excuse me, where will you send them for more information and to engage? Simplest way is go to our website. There is a contact us page and you will get a hold of us pretty quick. If not, you can always reach out to us on our LinkedIn page as well. And we're pretty responsive. So if you have genuine requests or you want to learn more, we're more than happy to share more. Well, Jay, thank you so much for the time today.
Starting point is 00:23:39 It's been a real pleasure getting to know you and have a fantastic time. Thank you. Thank you as well. Thanks for joining the Tech Arena. Subscribe and engage at our website, thetecharena.net. All content is copyright by the Tech Arena.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.