Signals and Threads - Why ML Needs a New Programming Language with Chris Lattner

Starting point is 00:00:00 Welcome to signals and threads, in-depth conversations about every layer of the text stack from Jane Street. I'm Ron Minsky. It is my great pleasure to have Chris Latner on the show. Typically, on signals and threads, we end up talking to engineers who work here at Jane Street. But sometimes we like to grab outside folk, and Chris is an amazing figure to bring on, because he's been so involved in a bunch of really foundational pieces of computing that we all use, LVM and Clang and MLIR and OpenCL and Swift and now Mojo. And this has happened at a bunch of different storied institutions, Apple and Tesla and Google and SciFive and now modular.

Starting point is 00:00:41 So anyway, it's a pleasure to have you joining us, Chris. Thank you, Ron. I'm so happy to be here. I guess I want to start by just hearing a little bit more about your origin story. How did you get into computing and how did you get into this world of both compiler engineering and programming language design? So I grew up in the 80s and back before computers were really a thing. I mean, we had PCs, but they weren't considered cool. And so I fell in love with understanding how the computer worked. And back then, things were way simpler.

Starting point is 00:01:07 I started with a basic interpreter, for example, and get a book from the store. Remember when we had books? You learned things from books. Did you do the thing where you'd get the hobbyist magazine and copy out the listing of the program from it? That's exactly right. And so we didn't have vibe coding, but we did have books. And so just by typing things in, you could understand how things work. And then when you broke it, because inevitably you're typing something in.

Starting point is 00:01:28 and you don't really know what you're doing. You have to figure out what went wrong. And so it encouraged a certain amount of debugging. I really love computer games. Again, back then, things were a little bit simpler. Computer games drove graphics and performance and things like this. And so I spent some time on these things called Bolton Board Systems and the early Internet reading about how game programmers were trying to push the limits of the hardware.

Starting point is 00:01:47 And so that's where I got interested in performance and computers and systems. I went on to college and had an amazing professor at my school. Shout out to University of Portland in Portland, Oregon. And he was a compiler nerd. And so I think that his love for compilers was infectious. His name was Stephen Vegdahl. And that caused me to go on to pursue compilers at University of Illinois. And there again, continue to fall down this rabbit hole of compilers and systems and built LVM.

Starting point is 00:02:10 And ever since I got into the compiler world, I loved it. I love compilers because they're large-scale systems. There's multiple different components that all work together. And in the university setting, it was really cool in the compiler class just because unlike most of the assignments where you do an assignment, turn it in, forget about it. In compilers, you would do an assignment, turn it in, get graded, and then build on it. And it felt much more realistic like software engineering rather than just doing a project to get graded. Yeah, I think for a lot of people, the OS class, are their first real experience of doing a thing where you really are building layer on top of layer. I think it's an incredibly important experience for people as they start engineering.

Starting point is 00:02:44 It's also one where you get to use some of those data structures. I took this almost academic, here's what a binary tree is, and here's what a graph is. And particularly when I went through it was taught from a very math-forward perspective, but it really made it useful. And so that was actually really cool. I'm like, oh, this is why I learned this stuff. So one thing that strikes me about your career is that you've ended up going back and forth between compiler engineering and language design space,

Starting point is 00:03:06 whereas I feel like a lot of people are on one side or the other. You know, they're mostly compilers people and they don't care that much about the language and just how do we make this thing go fast? And there are some people who are really focusing on language design and the work on the compiler is a secondary thing towards that design. And you've both popped back and forth and then also a lot of your compiler engineering work

Starting point is 00:03:25 really starting with LLVM, in some sense, it's itself very language forward. LLVM, there's a language in there that's this intermediate language that you're surfacing as a tool for people to use. So I'm just curious to hear more about how you think about the back and forth between compiler engineering and language design.

Starting point is 00:03:39 The reason I do this is that effectively my career is following my own interests, and so my interests are not static. I want to work on different kinds of problems and solve useful problems and build into things. And so the more technology and capability you have, the higher you can reach. And so with LovM, for example, built and learned a whole bunch of cool stuff about deep code generation

Starting point is 00:03:59 for an XA6 chip, that category of technology with register allocation and stuff like this. But then it made it possible to go say, let's go tackle C++ plus and let's go use this to build the world's best implementation of something that lots more people use and understand than deep back-end code generation technology. And then with Swift, it was built even higher and say, okay, well, C++, maybe some people like it, but I think we can do better and let's reach higher. I've also been involved in AI systems, been involved in building an iPad. app to help teach kids how to code, and so lots of different things over time.

Starting point is 00:04:29 And so for me, the place I think I'm most useful and where a lot of my experience is valuable ends up being at this hardware software boundary. I'm curious how you ended up making the leap to working on Swift. From my perspective, Swift looks from the outside like one of these points of arrival in mainstream programming contexts of a bunch of ideas that I have long thought are really great ideas in other programming languages. And I'm curious, in some ways, step away from like, oh, I'm going to work on really low-level stuff and compiler optimization.

Starting point is 00:04:54 And then, you know, we'll go much higher level and do a C++ implementation, which is still a pretty low level. How did the whole Swift thing happen? Great question. I mean, the time frame for people that aren't familiar is that LVM started in 2000. So by 2005, I had exited university and I joined Apple. And so LVM was kind of an advanced research project at that point. By the 2010 timeframe, LVM was much more mature, and we had just shipped C++ support in Clang. And so it could bootstrap itself, which means the compiler could compile itself.

Starting point is 00:05:23 It's all written in C++. to build advanced libraries, like the boost template library, which is super crazy advanced template stuff. And so the C++ implementation that I and the team had built was real. Now, C++, in my opinion, is not a beautiful programming language. And so implementing it is a very interesting technical challenge. For me, a lot of problem solving ends up being how do you factor the system the right way?

Starting point is 00:05:45 And so Clang has some really cool stuff that allowed it to scale and things like that. But I was also burned out. We had just shipped it. It was amazing. I'm like, there has to be something better. And so Swift really came starting in 2010. It was a nights and weekends project. It wasn't like a top-down management said, let's go build new programming language.

Starting point is 00:06:01 It was a crisp being burned out. I was writing a 20 to 40-person team at the time, being an engineer during the day and being a technical leader, but then needing an escape hatch. And so I said, okay, well, I think we can have something better. I have a lot of good ideas. Turns out programming languages are mature space. It's not like you need to invent pattern matching at this point. It's embarrassing that C++ doesn't have good pattern matching. We should just pause for things. I think this is a small but really essential thing.

Starting point is 00:06:26 I think the single best feature coming out of language like ML in the mid-70s is, first of all, this notion of an algebraic data type, meaning every programming language on Earth has a way of saying this and that and the other, a record or a class or a tuple. A weird programming language, I think it was Barbara Liskoff. Yeah, and she did a lot of the early theorizing about what are abstract data types. But the ability to do this or that or the other to have data types that are a union of different possible shapes of the data. and then having this pattern matching facility that lets you basically in a reliable way do the case analysis so you can break down what the possibilities are

Starting point is 00:06:59 is just incredibly useful and very few mainstream language have picked it up. I mean, Swift, again, is an example, but language is like ML, SML, and Haskell and O'Camel. Standard. That's right, SML, standard ML. It's been there for a long time. I mean, pattern matching, it's not an exotic feature.

Starting point is 00:07:15 Here we're talking about 2010. C-sharp didn't have it. C-plus didn't have it. Obviously, Java didn't have it. I don't think JavaScript had it. None of these mainstream languages had it, but it's obvious. And so part of my opinion about that, and so by the way, I represent as engineer. I'm not actually a mathematician.

Starting point is 00:07:32 And so type theory goes way over my head. I don't really understand this. The thing that gets me frustrated about the academic approach to programming languages is that people approach it by saying there's some types and there's intersection types and there's these types and they don't start from utility forward. And so pattern matching, when I learned OCam, it's so beautiful. It makes it so easy and expressive to build very simple things. And so to me, I always identify it to the utility. And then, yes, there's amazing formal type theory behind it. And that's great.

Starting point is 00:07:58 And that's why it actually works and composes. But bringing that stuff forward and focusing on utility and the problems it solves and how it makes people happy ends up being the thing that I think moves the needle in terms of adoption, at least in mainstream. Yeah, I mean, I think that's right. My approach also and my interest in the language is also very much not from the mathematical perspective. Although, you know, my undergraduate degree is in math.

Starting point is 00:08:17 I like math a lot. But I mostly approach these things as a practitioner. But the thing I've been struck by over the years is the value of having these features have a really strong mathematical foundation is they generalize and, as you were saying, compose much better. If they are in the end mathematically simple, you're way more likely to have a feature that actually pans out as it gets used way beyond your initial view as to what the thing was for. That's right. Well, and see, this is actually a personal defect because I don't understand the math and the way that maybe theoretically would be ideal. I end up having to rediscover certain truths that are obvious.

Starting point is 00:08:48 The cliche of the Russian mathematician invented it 50 years ago, right? And so a lot of what I find is that I can find truth and beauty when things compose and things fit together. And often I'll find out, you know, it's already been discovered

Starting point is 00:09:01 because everything in programming language has been done. There's almost nothing novel. But still that design process of saying, let's pull things together. That's reason about why it doesn't quite fit together. Let's go figure out how to better factor this.

Starting point is 00:09:12 Let's figure out how to make it simpler these days. That process, to me, I think is kind of like people working on physics, I hear. The simpler the outcome becomes, the more close to truth it feels like it is. And so I share that. Maybe it's more design gene or engineer design combination, but it's probably what you mathematicians actually know inherently, and I just haven't figured it out yet. Do you find yourself doing things after you come to it from an engineering perspective trying to figure out whether there are useful mathematical insights? Do you go back and read the papers? Do you have other PL people who are more mathematically

Starting point is 00:09:43 oriented who you talk to. How do you extend your thinking to cover some of that other stuff? See, the problem is math is scary to me. So I see Greek letters and I run away. I do follow archive and things like this and there's a programming language section on that. And so I get into some of it. But what I get attracted to in that is the examples and the results section and the future looking parts of it. And so it's not necessarily the how. It's the what it means. And so I think a lot of that really speaks to me. The other thing that really speaks to me when you talk about language design and things like this is blog posts from some obscure academic programming language that I've never heard of.

Starting point is 00:10:16 You just have somebody talking about algebraic effects systems for this and then the other thing or something really fancy, but they figure out how to explain it in a way that's useful. And so when it's not just let me explain to you the type system, but it's let me explain this problem, this fancy feature enables, that's where I get excited. And that's where it speaks to me because, again, I'm problem oriented and having a beautiful way to express and solve problems. I appreciate. I think there's a lot of value in the work that's done in papers of really like working out in detail the theory and the math and how it all fits together.

Starting point is 00:10:45 But yeah, I think the fact that the world has been filled of a lot of interesting blog posts from the same people has been great because I think it's another modality where it often encourage you to pull out the simpler and easier to consume versions of those ideas. And I think that's just a different kind of insight and it's valuable to surface that too. And also when I look at those blog posts, sometimes they design smell, particularly the C++ community. There's a lot of really good work to fix C++. They're adding a lot of stuff to it, and C++ will never get simpler. You can't really remove things, right? And so a lot of the challenge there is it's constrained problem solving. And so when I look at that, often what I'll see when I'm reading one of those posts,

Starting point is 00:11:20 and again, these are brilliant people, and they're doing God's work trying to solve problems with C++. Best luck with that. But you look at that and you realize there's a grain of sand in the system that didn't need to be there. And so to me, it's like if you remove that grain of sand, then the entire system gets relaxed. And suddenly all these constraints fall away, and you can get to something much simpler. Swift, for example. It's a wonderful language and it's grown really well, and the community is amazing. But it has a few grades of sand it that causes it to get a lot more complicated. And so this is where I'm not just happy with things that got built. LVM's amazing. It's very practical, but it has

Starting point is 00:11:49 lots of problems. That's when when I get a chance to build a next generation system, I want to learn from that and actually try to solve these problems. So this is the great privilege of getting to work on a new language, which is a thing you're doing now, right? There's this new language called Mojo, and it's being done by this company that you co-founded called Modular. Maybe just so we understand the context a little bit. Can you tell me a little bit about what is modular, what's the basic offering, what's the business model? Before I even get there, I'll share more of how I got here. If you oversimplify my background, did this LVM thing, and it's foundational compiler technology for CPUs. It helped unite a lot of CPU era infrastructure, and it provided

Starting point is 00:12:24 a platform for languages like, yes, Swift, but also Rust and Julia, and many different systems that all got built on top of. And I think it really catalyzed and enabled a lot of really cool applications of accelerated compiler technology. People use LVM in databases and for query engine optimization, lots of cool stuff. Maybe you use it for trading or something. I mean, there can be tons of different applications for this kind of technology. And then did programming language stuff with SWIFT. But in the meantime, AI happened. And so with AI brought this entirely new generation of compute, GPUs, tensor processing units, large scale AI training systems, FPGAs and ASICs and all this complexity for compute. And LVM never really worked in that system.

Starting point is 00:13:02 And so one of the things that I built when I was at Google was a bunch of foundational compiler technology for that category of systems. And there's this compiler technology called MLIR. MLIR is basically L-OVM 2.O. And so take everything you learn from building L-OVM and helping solve this, but then bring it forward into this next generation of compiler technology so that you can go hopefully unify the world's compute for this GPU and AI and ASIC kind of world. MLIR has been amazingly successful and I think it's used in roughly every one of these. AI systems and GPUs. It's used by NVIDIA. It's used by Google. It's used by roughly everybody in the space. But one of the

Starting point is 00:13:37 challenges is that there hasn't been unification. And so you have these very large-scale AI software platforms. You have Kuda from NVIDA from Google. You have Rock M from AMD. It's countless. Every company has their own software stack. And one of the things that I discovered and encountered, and I think the entire world

Starting point is 00:13:53 sees, is that there's this incredible fragmentation driven by the fact that each of these software stacks built by a hardware maker are just all completely different. And some of them work better than others. But regardless, it's a gigantic mess. And there's these really cool high-level technologies like Pi-Torch that we all love and we want to use. But if Pi-Torch is built on completely different stacks and excluding together these megalithic worlds from different vendors, it's very difficult to get something that works. Right. There are both complicated trade-offs around

Starting point is 00:14:20 the performance that you get out of different tools, and then also a different set of complicated trade-offs around how hard they are to use, how complicated it is to write something in them, and then what hardware you can target from each individual one. And each of these ecosystems is churning just incredibly fast. There's always new hardware coming out and new vendors in new places. And there's also new little languages popping up into existence. And it makes the whole thing pretty hard to wrangle. Exactly.

Starting point is 00:14:42 And AI is moving so fast. There's a new model every week. It's crazy. And new applications, new research, the amount of money being dumped into this by everybody is just incredible. And so how does anybody keep up? It's a structural problem in the industry. And so the structural problem is that the people doing this kind of work,

Starting point is 00:14:57 the people doing cogeneration for advanced GPUs and things like this. They're all at hardware companies. And the hardware companies, every single one of them is building their own stack because they have to. There's nothing to plug into. There's nothing like LLVM but for AI. That doesn't exist. And so as they go and build their own vertical software stack,

Starting point is 00:15:14 of course they're focused on their hardware. They got advanced roadmaps. They have a new chip coming out next year, right? They're plowing their energy and time into solving for their hardware. But we out in the industry, we actually want something else. We want to be able to have software that runs across multiple pieces of hardware. And so if everybody, doing the work is a hardware company, it's very natural that you get this fragmentation across

Starting point is 00:15:35 vendors because nobody's incentivized to go work together. And even if they're incentivized, they don't have time to go work on somebody else's chip. Amdd is not going to pay to work on Nvidia GPUs or something like this. That's true when you think about this kind of a split between low-level and high-level languages. Invidia has Kuda and AMD has Rokam, which is mostly a clone of Kuda, and then the XLA tools from Google work incredibly well on TPUs and so on so forth. Different vendors have different things. Then there's, like, the high-level tools, PiTorch and Jax and Triton and various things like that. And those are typically actually not made by the hardware vendors. Those are made by different kinds of users. I guess Google is

Starting point is 00:16:12 responsible for some of these, and they are also sometimes a hardware vendor. But a lot of the time, it's more step back. Although even there, the cross-platform support is complicated and messy and incomplete. Because they're built on top of fundamentally incompatible things. So that's the fundamental nature. And so, again, you go back to Chris's dysfunction. And my weird career choices. I always end up back at the hardware software boundary. And there's a lot of other folks that are really good at adding very high-level abstractions. If you go back a few years ago, MLOPS was a cool thing. And it was, let's build a layer of Python on top of Tenseflow and Pytech and build a unified AI platform. But the problem with that is that building abstractions on

Starting point is 00:16:45 top of two things that don't work very well. Can't solve performance or liability or management or these other problems. You can only add a layer of duct tape. But as soon as something goes wrong, you end up having to debug this entire crazy stack of stuff. that you really didn't want to have to know about. And so it's a leaky abstraction. And so the genesis of Modular, bringing it back to this, was realizing there are structural problems in the industry.

Starting point is 00:17:08 There is nobody that's incentivized to go build a unifying software platform and do that work at the bottom level. And so what we set off to do is we said, okay, let's go build. And there's different ways of explain this. You could say a replacement for CUDA. It's like a flamboyant way to say this.

Starting point is 00:17:21 But let's go build a successor to all of this technology that is better than what the hardware makers are building and is portable. And so what this takes, This takes doing the work that these hardware companies are doing. And I set the goal for the team of saying, let's do it better than, for example,

Starting point is 00:17:36 Nvidia is doing it for their own hardware. Which is no easy feat. They've got a lot of very strong engineers, and they understand their hardware better than anyone does. Beating them on their own hardware is tough. That is really hard. And they've got a 20-year head start, because Kood is about 20 years old.

Starting point is 00:17:50 They've got all the momentum. They're a pretty big company. As you say, lots of smart people. And so that was a ridiculous goal. Why did I do that? Well, I mean, a certain amount of confidence in understanding how the technology worked, having a bet on what I thought we could build and the approach and some insight and intuition, but also realizing that it's actually destiny, somebody has to do this work. If we ever want to get to an ecosystem where one vendor doesn't control everything, if we want to get the best out of the hard row, if we want to get new programming language technologies, if we want pattern matching on a GPU, I mean, come on, this isn't rocket science, then we need at some point to do this. And if nobody else is going to do it, I'll step up and do that.

Starting point is 00:18:27 And so that's where Modular came from is saying, let's go crack this thing open. I don't know how long it will take, but sometimes it's worthwhile doing really hard things if they're valuable to the world. And the belief was it could be profoundly impactful and hopefully get more people into even just being able to use this new form of compute with GPUs and accelerators and all the stuff and just really redemocratize AI compute. So you pointed out that there's a real structural problem here. And I'm actually wondering how at a business model level you want to solve the structural problem, which is, The history of computing is these days littered with the bodies of companies that try to sell a programming language. It's a really hard business. How is modular set up so that it's incented to build this platform in a way that can be a shared platform that isn't subject to just one other vendors lock in?

Starting point is 00:19:10 First answer is don't sell a programming language. As you say, that's very difficult. So we're not doing that. Go take Mojo, go use it for free. We're not selling a programming language. What we're doing is we're investing this foundational technology to unify hardware. Our view is, as we've seen in many other domains, once you fix the foundation, now you can build high value services for enterprises. And so our enterprise layer, often what we talk to, you end up with these groups where you have hundreds or thousands of GPUs.

Starting point is 00:19:36 Often it's rented from a cloud on a three-year commit. You have a platform team that's carrying pagers and they need to keep all the stuff running and all the production workloads running. And then you have these product teams that are inventing new stuff all the time. And there's new research. There's a new model that comes out and they want to get it on the production infrastructure. But none of this stuff actually works. And so the software ecosystem we have with all these brilliant but crazy open source tools that are thrashing around, all these different versions of Kuta and libraries, all this different hardware happening.

Starting point is 00:20:02 It's just a gigantic mess. And so helping solve this for the platform engineering team that actually needs to have stuff work and want to be able to reason about it and want good observability and manageability and scalability and things like this is actually, we think, very interesting. We've gotten a lot of good response from people on that. The cost of doing this is we want to actually make it work. That's where we do fundamental language compiler underlying systems technology and help bring together these accelerators so that we can get, for example, the best performance on an AMD GPU

Starting point is 00:20:29 and get it so that the software comes out in the same release train as support for an Nvidia GPU. And being able to pull that together, again, it's just multiplicatively reduces complexity, which then leads to a product that actually works, which is really cool and very novel in AI. So the way that Mojo plays in here is it basically lets you provide the best possible performance, and I guess the best possible performance across multiple different hardware platforms. Are you primarily thinking about this as an inference platform

Starting point is 00:20:55 or how does the training world fit in? So let me zoom out and I'll explain our technology components. I have a blog post series I encourage you and any viewers or listeners to check out called democratizing AI compute. It goes through the history of all the systems

Starting point is 00:21:09 and problems and challenges that they've run into and it gets to what is modular doing about it. So part 11 talks about architecture and the inside is Mojo which is a programming language. I'll explain Mojo in a second. Next level out is it's called Max.

Starting point is 00:21:22 And so you can think of Max as being a pie torch replacement or a VLLM replacement, something that you can run on a single node and then get high performance LLM serving, that kind of use case. And then the next level out is called Mammoth, and this is the cluster management, Kubernetes, kind of layer. And so if you zoom in all the way back to Mojo, you say, your experience, you know what programming languages are? They're incredibly difficult and expensive to build.

Starting point is 00:21:42 Why would you do that in the first place? And the answer is, we had to. In fact, when we started modular, I was like, I'm not going to invent a programming language. I know that's a bad idea. It takes too long. It's too much work. You can't convince people to adopt a new language.

Starting point is 00:21:54 I know all the reasons why creating a language is actually a really bad idea. But it turns out we are forced to do this because there is no good way to solve the problem. And the problem is how do you write code that is portable across accelerators? So that problem, I want portability across, for example, make it simple, AMD and Nvidia GPUs. But then you layer on the fact that you're using a GPU because you want performance. And so I don't want a simplified waterdown. I want Java that runs on a GPU. I want the full power of the GPU.

Starting point is 00:22:25 I want to be able to deliver performance that meets and beats and video on their own hardware. I want to have portability and unify this crazy compute where you have these really fancy heterogeneous systems and you have tensor cores and you have this explosion of complexity and innovation happening in this hardware platform layer. Most programming languages don't even know

Starting point is 00:22:42 that there's an 8-bit floating point that exists. And so we looked around and I really did not want to have to do this. this, but it turns out that there really is no good answer. And again, we decided that, hey, the stakes are high, we want to do something impactful, we're willing to invest. I know what it takes to build a programming language. It's not rocket science. It's just a lot of really hard work and you need to set the team up to be incentivized the right way. But we decided that, yeah, let's do that. So I want to talk more about Mojo and its design. But before we do,

Starting point is 00:23:07 maybe let's talk a little bit more about the pre-existing environment. I did actually read that blog post series. I recommend it to everyone. I think it's really great. And I want to talk a little bit about what the existing ecosystem of languages looks like. But even before then, can we talk more about the hardware? What is that space of hardware look like that people want to run these ML models on? Yeah. So the one that most people are on is a GPU. And so GPUs are, I think, getting better understood now. And so if you go back before that, though, you have CPUs. So modern CPUs in a data center. Often you'll have, I mean, today, you guys are probably writing quite big iron, but you got 100 cores and a CPU, and you've got a server with two to four CPUs on a motherboard,

Starting point is 00:23:46 and then you go and you scale that. And so you've got traditional threaded workloads that have to run on CPUs, and we know how to scale that for internet servers and things like this. If you get to a GPU, the architecture shifts. And so they have basically these things called SMs. And now the programming model is that you have effectively much more medium-sized compute that's now put together on much higher performance memory fabrics. And the programming model shifts.

Starting point is 00:24:10 And one of the things that really broke Kuta, for example, was when GPU's got this thing called a tensor core. And the way to think about a tensor core is it's a dedicated piece of hardware for matrix multiplication. And so why do we get that? Well, a lot of AI is matrix multiplication. And so if you design the hardware to be good at a specific workload, you can have dedicated silicon for that, and you can make things go really fast. There are really these two quite different models sitting inside of the GPU space. Of course, the name itself is where GPU is graphics processing unit, which is what they were originally for.

Starting point is 00:24:40 And then this SM model is really interesting. They have this notion of a warp, right? A warp is a collection of typically 32 threads that are operating together, kind of in lockstep, always doing the same thing. A slight variation on what's called the SIMD model, same instruction, multiple data. It's like a little more gentle than that. More or less, you can think of it as the same thing. And you just have to run a lot of them.

Starting point is 00:25:00 And then there's a ton of hardware inside of these systems, basically to make a switching between threads incredibly cheap. So you pay a lot of silicon to add extra registers. So the context switch is super cheap. So you can do a ton of stuff in parallel. Each thing you're doing is itself like 32-wise parallel. And then because you can do all this very fast context switching, you can hide a lot of latency.

Starting point is 00:25:19 And that worked for a while. And then we're like, actually, we need way more of this matrix multiplication stuff. And you can sort of do reasonably efficient matrix multiplication through this warp model, but not really that good. And then there's a bunch of quite idiosyncratic hardware, which changes its performing characteristics from generation to generation just for doing these matrix multiplications.

Starting point is 00:25:38 Right? So that's sort of the Nvidia GPU story. The Volta is like V-100 and A-100 and H-100. They just keep on going and changing pretty materially from generation to generation in terms of the performance characteristics and then also the memory model, which keeps on changing. You go back to intuition, Kuda was never designed for this world. Kuda was not designed for modern GPUs.

Starting point is 00:25:57 It was designed for a much simpler world. And Kuta, being 20 years old, it hasn't really caught up. And it's very difficult because, as you say, the hardware keeps changing. And so Kuda was designed from a world where almost like, C, It's designed for a very simple programming model that it expected to scale, but then as the hardware changed, it couldn't adapt. Now, if you get beyond GPUs, you get to Google TPU and many other dedicated AI systems, they blow this way out, and they say, okay, well, let's get rid of the threads that you have on a GPU, and let's just have matrix multiplication units and have really big matrix multiplication units and build the entire chip around that, and you get much more specialization, but you get a much higher throughput for those AI workloads. Going back to Why Mojo, well, Mojo was designed from first principles to support this kind of system. Each of these chips, as you're saying, even within NVIDIA's family, the Volta to Amper, to Hopper, to Blackwell, these things are not compatible with each other.

Starting point is 00:26:51 Actually, Blackwell just broke compatibility with Hopper, so it can't run Hopper kernels always on Blackwell. Oops. Well, why are they doing that? Well, A software is moving so fast, they decided that was the right try to make. And meanwhile, we all software people need the ability to target this. When you look at other existing systems, with Triton, for example, their goal was, let's make it easier to program a GPU, which I love. That's awesome. But then they said, we'll just give up 20% of the performance of the silicon to do it.

Starting point is 00:27:16 Wait a second. I want all the performance. And so if I'm using a GPU, GPUs are quite expensive, by the way. I want all the performance. And if it's not going to be able to deliver the same quality of results you would get by writing Kuta, well, then you're always going to run to this head room where you get going quickly, but then you run into a ceiling and then have to switch to a different system to get full performance. And so this is where Mojo is really trying to say, solve this problem where we can get more usability, more portability, and full performance of the silicon, because it's designed for these wacky architectures like tensor cores. And if we look at the other languages that are out there, there's languages like Kuda and OpenCL, which are low-level, typically look like variations on C++ in that tradition, are unsafe languages, which means that there's a lot of rules you have to follow.

Starting point is 00:28:01 And if you don't exactly follow the rules, you're in undefined behavior land. it's very hard to reason about your program. And just let me make fun of my C++ heritage, because I've spent so many years. Like, you just have a variable that you forget to initialize. It just shoots your foot off. Like, it's just unnecessary violence to programmers. Right, and it's done in the interest of making performance better, because the idea is C++ and its related languages don't really give you enough information to know when you're making

Starting point is 00:28:26 a mistake. And they want to have as much space as they can to optimize the programs they get. So the stance is just, if you do anything that's not allowed, we have no obligation to maintain any kind of reasonable semantics or debugability around that behavior. And we're just going to try really, really hard to optimize correct programs, which is a super weird stance to take because nobody's programs are correct. There are bugs and undefined behavior in almost any C++ program of any size. And so you're in a very strange position in terms of the guarantees that you get from the compiler system you're using.

Starting point is 00:28:58 Well, so, I mean, I can be dissatisfied, but I can also be sympathetic with people that work on C++. So again, I've spent decades in this language and around this ecosystem and building compilers for us. I know quite a lot about it. The challenge is that C++ is established. And so there's tons of code out there. By far, the code that's already written is the code that's the most valuable. And so if you're building a compiler or you have a new chip or you have an optimizer, your goal is to get value out of the existing software. And so you can't invent a new programming paradigm that's a better way of doing things and defines away the problem and said you have to work with what you've got. You have a spec benchmark you're trying to make go fast. And so you

Starting point is 00:29:35 invent some crazy hero hack that makes some important benchmark work because you can't go change the code. In my experience, particularly for AI, but also I'm sure within Jane Street, if something's going slow, go change the code. You have control over the architecture of the system. And so what I think the world really benefits from, unlike benchmark hacking, is languages that give control and power and expressivity to the programmer. And this is something where I think that if you, again, you take a step back and you realize history is the way it is for lots of structural and very valid reasons, but the reasons that don't apply to this new age of compute, nobody has a workload that they can pull forward to next year's GPU. It doesn't exist. Nobody solved this problem. I don't know

Starting point is 00:30:15 the time frame, but once we solve that problem, once we solve portability, you can start this new era of software that can actually go forward. And so now, to me, the burden is make sure it's actually good. And so, to your point about memory safety, don't make it so forgetting to initial a variable is just going to shoot your foot off. Produce a good compiler error saying, hey, you forgot to initialize a variable, right? These basic things are actually really profound and important and the tooling and all this usability.

Starting point is 00:30:42 And this DNA, these feelings and thoughts are what flow into Mojo. And GPU programming is just a very different world from traditional CPU programming. Just in terms of the basic economics and how humans are involved, you end up dealing with much smaller programs. You have these very small,

Starting point is 00:30:56 but very high value programs, whose performance is super critical. and in the end, a relatively small codery of experts who end up programming in it. And so it pushes you ever in the direction you're saying of performance engineering, right? You want to give people the control they need to make the thing behave as it should. And you want to do it in a way that allows people to be highly productive. And the idea that you have an enormous amount of legacy code that you need to bring over, it's like, actually you kind of don't.

Starting point is 00:31:20 The entire universe of software is actually shockingly small. And it's really about how to write these small programs as well as possible. And also there's another huge change. And so this is something that I don't think that the programming language community has recognized yet, but AI coding has massively changed the game. Because now you can take a Kuda kernel and say, hey, Claude, go make that into Mojo. And actually, how good have you guys found the experience of that of doing translation? Well, we do hackathons, and people do amazing things, having never touched Mojo, have never done GPU programming.

Starting point is 00:31:50 And within a day, they can make things happen that are just shocking. And so now, AI coding tools are not magic. You cannot just vibe code deep seeker 1 or something, right? But it's amazing what that can do in terms of learning new languages, learning new tools, and getting into and capitalizing ecosystems. And so this is one of the things where, again, you go back five or ten years, everybody knows nobody can learn a new language and nobody's willing to adopt new things, but the entire system has changed.

Starting point is 00:32:16 So let's talk a little bit more in detail about the architecture of Mojo. What kind of languages Mojo and what are the design elements that you chose in order to make it be able to address this set of problems? Yeah. Again, just to relate how different the situation is, back when I was working on Swift, one of the major problems to solve was objective C was very difficult for people to use. And you had pointers and you had square brackets and it was very weird. And so the goal in the game of the day was invent new syntax and bring together modern programming language features to build a new language. Fast forward to today, actually, some of that is true. AI people don't like C++ plus. C++ has pointers and it's ugly and it's a 40-year-old plus language. has actually the same problem that Swift had to solve back in the day. But today there's something different, which is that AI people do actually love a thing. It's called Python.

Starting point is 00:33:04 And so one of the really important things about Mojo is it's a member of the Python family. And so this is polarizing to some because, yes, I get it that some people love curly braces. But it's hugely powerful because so much of the AI community is Pythonic already. And so we start out by saying, let's keep the syntax like Python and only diverge from that if there's a really good reason. But then what are the good reasons? Well, the good reasons are we want, as we're talking about, performance, power, full control over the system. And for GPUs, there's these very important things

Starting point is 00:33:31 you want to do that require metaprogramming. And so Mojo has a very fancy metaprogramming system, kind of inspired by this language called Zig that brings runtime and compile time together to enable really powerful library designs. And the way you crack open this problem with tensor cores and things like this is you enable really powerful libraries

Starting point is 00:33:49 to be built in the language as libraries instead of hard coding into the compiler. Let's take a little bit to the metaprograming idea. What is metaprogramming and why does it matter for performance in particular? Yeah, it's a great question. And I think you know the answer to this too. And I know you're saying. Secretly, we are also working on metaprogramming features in our own world.

Starting point is 00:34:07 Exactly. And so the observation here is when you're writing a four loop in a programming language, for example, typically that for loop executes at runtime. So you're writing code that when you execute the program, it's the instructions that the computer will follow to execute the algorithm within your code. But when you get into designing higher level type systems, Suddenly you want to be able to run code at compile time as well. And so there's many languages out there.

Starting point is 00:34:29 Some of them have macro systems. C++ has templates. What you end up getting is you end up getting in many languages, this duality between what happens at runtime and then a different language almost that happens at compile time. And C++ plus is the most egregious because templates, you have a for loop in runtime, but then you have unrolled recursive templates

Starting point is 00:34:46 or something like that at compile time. Well, so the insight is, hey, these two problems are actually the same. they just run at different times. And so what Mojo does is says, let's allow the use of effectively any code that you would use at runtime to also work at compile time. And so you can have a list or string

Starting point is 00:35:03 or whatever you want in the algorithms go do memory allocation, deallocation, and you can run those at compile time, enabling you to build really powerful, high-level abstractions and put them into libraries. So why is this cool? Well, the reason it's cool is that on a GPU, for example, you'll have a tensor core.

Starting point is 00:35:19 TensorFlow cores are weird. We probably don't need a deep dive into all the reasons why. But the indexing and the layout that 10scores use is very specific and very vendor different. And so the 10th score you have it on AMD or the 10th score you have on different versions of Nvidia GPUs are all very different. And so what you want is you want to build as a GPU programmer, a set of abstractions so you can reason about all of these things in one common ecosystem and have the layouts much higher level. And so what this enables, it enables very powerful libraries and very powerful libraries where a lot of logic is actually done at compile time, but you can debug it because. It's the same language that you use at runtime, and it makes the language much more simpler, much more powerful, and just be able to scale into these complexities in a way that's possible with C++, but in C++, you get some crazy template stack trace that is maddening and impossible to understand. In Moji, you can get a very simple error message.

Starting point is 00:36:10 You can actually debug your code, a debugger, and things like this. So maybe an important point here is that metaprogramming is really an old solution to this performance problem. Maybe a good way of thinking about this is imagine you have some piece of data that you have that represents a little, embedded domain-specific language that you've written that you want to execute via a program that you wrote. You can, in a nice high-level way, write a little interpreter for that language that just, you know, I have maybe a Boolean expression language, or who knows, what else? Maybe it's a language for computing on tensors in a GPU. And you could write a program that just executes that mini-domain-specific language and does the thing that you want. And you can do it, but it's really

Starting point is 00:36:45 slow. Writing an interpreter is just inherently so, because all this interpretation overhead, where you are dynamically making decisions about what the behavior of the program is. And sometimes what you want is you just want to actually emit exactly the code that you want and boil away the control structure and just get the direct lines of machine code

Starting point is 00:37:03 that you want to do the thing that's necessary. And various forms of code generation let you get past in a simpler way, let you get past all of this control structure that you have to execute at runtime and instead be able to execute it at compile time and get this minified program that just does exactly the thing that you want.

Starting point is 00:37:19 So that's a really old idea. It goes back to all sorts of programming languages, a lot of LISPs that did a lot of this metaprogramming stuff. But then the problem is this stuff is super hard to think about and reason about and debug. And that's certainly true if you think about in C, all these macro language if you use the various C preprocessors to do this kind of stuff in C, it's pretty painful to reason about. And then C++ made it richer and more expressive, but still really hard to reason about. And you write a C++ template and you don't really know what it's going to do or if it's going

Starting point is 00:37:47 to compile until you give it all the inputs and let it go. And it feels good in the simple case, but then when you get to more advanced cases, suddenly the complexity compounds and it gets out of hand. And it sounds like the thing that you're going for in Mojo is it feels like one language. It has one type system that covers both the stuff you're generating statically and the stuff that you're doing at runtime. It sounds like debugging works in the same way across both of these layers. But you still get the actual runtime behavior you want from a language that you could more explicitly

Starting point is 00:38:17 just be like, here's exactly the code that I want to generate. In size zero into metaprogramming as one of the fancy features, one of the cool features is it feels and looks like Python, but with actual types. Right. And let's not forget the basics. Having something that looks and feels like Python, but it's thousand times faster or something,

Starting point is 00:38:34 is actually pretty cool. For example, if you're on a CPU, you have access to SIMD, the SIMD registers that allow you to do multiple operations at a time and be able to get the full power of your hardware, even without using the fancy features, is also really cool. And so the challenge with any of these systems,

Starting point is 00:38:49 is how do you make something that's powerful, but it's also easy to use? I think your team's been playing with Mojo and doing some cool stuff. I mean, what have you seen? And what's your experience been? We're all still pretty new to it, but I think it's got a lot of exciting things going for it.

Starting point is 00:39:01 I mean, the first thing is, yeah, it gives you the kind of programming model you want to get the performance you need. And actually, in many ways, the same kind of programming model that you get out of something like Cutless or QDSL, which are these invidia-specific, some at the C++ level,

Starting point is 00:39:15 some at the Python DSL level. And by the way, every tool you can imagine nowadays is done once in C++ and once in Python. We don't need to implement programming languages any other way anymore. They're all either skins on C++ plus or skins on Python. But depending on which path you go down, whether you go the C++-plus path or the Python path, you get all sorts of complicated tradeoffs. In the C++ path, in particular, you get very painful compilation times. The thing you said about template metaprogramming is absolutely true.

Starting point is 00:39:40 The error messages are super bad. If you look at these more Python-embedded DSLs, the compile times tend to be better. It still can be hard to reason about, though. One nice thing about Mojo is the overall discipline seems very explicit. When you want to understand, is this a value that's happening at execution time at the end, or is it a value that you know is going to be dealt with at compile time? It says very explicit in the syntax you can look and understand, whereas in some of these DSLs, you have to actively go and poke the value and ask it what kind of value it is.

Starting point is 00:40:08 And I think that kind of explicitness is actually really important for performance engineering, making easy to understand just what precisely you're doing. You actually see this a ton, not even with these very low-level things, but if you look at PyTorch, which is a much higher-level tool, PyTorch does this thing where you get to write a thing that looks like an ordinary Python program, but really it's got a much trickier execution model. Python's an amazing and terrible ecosystem in which to do this kind of stuff, because what guarantees do you have when you're using Python?

Starting point is 00:40:35 None. What can you do? Anything. You have an enormous amount of freedom. The PyTorch people in particular have leveraged this freedom in a bunch of very clever ways where you can write a Python program that looks like it's doing something very simple and straightforward

Starting point is 00:40:46 that would be really slow. But no, it's very carefully delaying and making some operations lazy so it can overlap, compute on the GPU and CPU and make stuff go really fast. And that's really nice, except sometimes it just doesn't work. This is the trap.

Starting point is 00:41:00 Again, this is my decades of battle scars now. So as a compiler guy, I can make fun of other compiler people. There's this trap, and it's an attractive trap, which is called the sufficiently smart compiler. And so what you can do is you can take something and you can make it look good on a demo, and you can say, look, I make it super easy, and I'm going to make my compiler super smart, and it's going to take care of all

Starting point is 00:41:20 this and make it easy through magic. But magic doesn't exist. And so anytime you have one of those sufficiently smart compilers, if you go back in the days, it was like auto-parallization, just write C-code, the sequential logic, and then we're going to automatically map it into running on 100 cores on a supercomputer or something like that. They often actually do work. They work in very simple cases, and they work in the demos. But the problem is that you go and you're using them, and then you change one thing and suddenly everything breaks. Maybe the compiler crashes. It just doesn't work.

Starting point is 00:41:48 Or you go and fix a bug, and now, instead of a hundred times speedup, you get 100 times slow down because it foiled the compiler. A lot of AI tools, a lot of these systems, particularly these DSLs, have this design point of, let me pretend like it's easy, and then I will take care of it behind the scenes, but then when something breaks, you have to end up looking at compiler dumps. Right?

Starting point is 00:42:08 And this is because magic doesn't exist. And so this is where predictability and control is really, I think, the name of the game, particularly if you want to get the most out of a piece of hardware, which is how we ended up here. It's funny, the same issue of how clever is the underlying system you're using comes up when you look at the difference between CPUs and GPUs. CPUs themselves are trying to do a weird thing where a chip is a fundamentally parallel substrate. It's got all of these circuits that in principle could be running in parallel. And then it is yoked to running this extremely sequential programming language, which is just trying to do one thing after another. And then how does that actually work with any reasonable efficiency?

Starting point is 00:42:43 Well, there's all sorts of clever, dirty tricks happening under the covers where it's trying to predict what you're going to do, the speculation that allows it to dispatch multiple instructions in a row by guessing what you're going to do in the future. There's things like memory prefetching, where it has heuristics to estimate what memory you're going to ask in the future so it can dispatch multiple memory requests at the same time. And then if you look at things like GPUs and I think even more TPUs, and then also totally other things like FPGA, it's the field programmable. gate array where you put basically a circuit design on it. It's a very different kind of software system. But all of them are, in some sense,

Starting point is 00:43:18 simpler and more deterministic and more explicitly parallel. When you write down your program, you have to write an explicitly parallel program. That's actually harder to write. I don't want to complain too much about CPUs. The great thing about CPUs is they're extremely flexible and incredibly easy to use. And all of that

Starting point is 00:43:34 dark magic actually works a pretty large fraction of the time. Yeah, remarkably well. But your point here, I think it's really great. And what you're saying is you're saying CPUs are the magic box that makes sequential code go in parallel pretty fast. And then we have new more explicit machines, somewhat harder to program because they're not a magic box. But you get something from it. You get performance and power because that magic box doesn't come without a cost. It comes with a very significant cost.

Starting point is 00:43:59 Often the amount of power that your machine dissipates, and so it's not efficient. And so a lot of the reasons we're getting these new accelerators is because people really do care about it being 100 times faster or using way less. power or things like this. I'd never thought about it, but your analogy of Triton to Mojo kind of follows a similar pattern, right? As Triton is trying to be the magic box, and it doesn't give you the full performance, and it burns more power and all that kind of stuff. And so Mojo's saying, look, let's go back to being simple. Let's give the programmer more control. And that more explicit approach, I think, is a good fit for people that are building crazy advanced hardware like you're talking about, but also people that want to get the best performance out of the existing hardware we have.

Starting point is 00:44:37 So we talked about how meta-programming lets you write faster programs by boiling away this control structure that you don't really need. So that part's good. How does it give you portable performance? How does it help you on the portability front? Yeah, so this is another great question. So in this category of sufficiently smart compilers, and particularly for AI compilers, there's been years of work. And MLIR has catalyzed a lot of this work, building these magic AI compilers that take TensorFlow or even the new PyTorch stuff and trying to generate optimal code for, some chip. So take some high torch model and put it through a compiler and magically get out

Starting point is 00:45:12 high performance. And so there's tons of these things and there's a lot of great work done here. And a lot of people have shown that you can take kernels and accelerate them with compilers. The challenge with this is that people don't ever measure what is the full performance of the chip. And so people always measure from a somewhat unfortunate baseline and then try to climb higher instead of saying, what is the speed of light? And so if you measure from speed of light, suddenly you say, okay, how do I achieve several different things? even if you zero into one piece of silicon, how do I achieve the best performance for one use case?

Starting point is 00:45:42 And then how do I make it so the software I write can generalize even within the domain? And so, for example, take a matrix multiplication. Well, you want to work on maybe float 32, but then you want to generalize it to float 16. Okay, well, templates and things like this are an easy way to do this. And then programming allows you to say, okay, I will tackle that. And then the next thing that happens is because you went from float 32 to float 16, your effective cache size has doubled because twice as many elements fit into cache

Starting point is 00:46:09 if there's 16 bits and if they're 32 bits. Well, if that's the case, now suddenly the access pattern needs to change. And so you get a whole bunch of this conditional logic that now changes in a very parametric way as a result of one simple change that happened with float 32 to float 16. Now, you play that forward and you say,

Starting point is 00:46:26 okay, well, actually matrix multiplication is a recursive hierarchical problem. There's specializations for tall and skinny matrices and a dimension is one or something, there's all these special cases. Just one algorithm for one chip becomes this very complicated subsystem that you end up wanting to do a lot of transformations to

Starting point is 00:46:44 so you can go specialize it for different use cases. And so Mojo with the MetaProgramming allows you to tackle that. Now you bring in other hardware. And so think of matrix multiplication these days as being almost an operating system. There's like so many different subsystems and special cases and different D types

Starting point is 00:47:00 and crazy float four and six and other stuff going on. At some point, they're going to come out with a floating point number so small that it will be a joke. But every time I think that you're just kidding, it turns out it's real. Seriously, I heard somebody talking about

Starting point is 00:47:11 1.2-bit floating point. It's exactly like you're saying. Is that a joke? You can't be serious. And so now when you bring in other hardware, other hardware brings in more complexity because suddenly the TensorFlow has a different layout

Starting point is 00:47:24 an AMD than it does on Nvidia. Or maybe, to your point about warps, you have 64 threads that a warp on one and 32 threads in a warp on. the other. But what you realize is you realize, wait a second, this really has nothing to do with hardware vendors. This is actually true even within, for example, the NVIDio line, because across these different data types, the tensor cores are changing. The way the tensor core works for float three, two, is different than the way it works for float four or something. And so you already

Starting point is 00:47:52 within one vendor have to have this very powerful meta-programming to be able to handle the complexity and do so in the scaffolding of a single algorithm like matrix multiplication. And so now as you bring in other vendors, well, it turns out, hey, they all have things that look roughly like tensor cores. And so we're coming at this with a software engineering perspective. And so we're forced to build abstractions. We have this powerful metaprogramming system so we can actually achieve this. And so even for one vendor, we get this thing called layout tensor. Layout tensor is saying, okay, well, I have the ability to reason about not just an array of numbers or a multidimensional array of numbers,

Starting point is 00:48:27 but also how it's laid out in memory and how it gets accessed. And so now we can declaratively map these things onto the hardware that you have in these abstractions stack. And so it's this really amazing triumvirate between having a type system that works well and it's a very important basis. I know you're a fan of types systems also. You then bring in metaprogramming,

Starting point is 00:48:46 and so you can build powerful abstractions that run at compile time, so you get no runtime overhead. And then you bring in the most important part of this entire equation, which is programmers who understand the domain. I am not going to write a fast matrix multiplication. I'm sorry, that's not my experience. But there are people in that space that are just freaking brilliant. They understand exactly how the hardware works. They understand the use cases and the latest research and the new crazy quantized format of the day. But they're not compiler people. And so the magic of mojo is it says, hey, you have a type system, you have

Starting point is 00:49:18 metaprogramming, you have effectively the full power of a compiler that you have, when you're building libraries. And so now these people that are brilliant at unlocking the power of the hardware can actually do this. And now they can write software that scales both across the complexity of the domain, but also across hardware. And to me, that's what I find so exciting and so powerful about this is it's like unlocking the power of the Mojo programmer instead of trying to put it into the compiler, which

Starting point is 00:49:43 is what a lot of earlier systems have tried to do. So maybe the key point here is that you get to build these abstractions that allow you to represent different kinds of hardware, and then you can conditionally have your code execute based on the kind of hardware that it's on. It's not like an if-def where you're picking between different hardware platforms. There are complicated data structures like these layout values that tell you how you can traverse data. Which is kind of a tree. This isn't just a simple int that you're passing around. This is like a recursive hierarchical tree that you need at compile time. The critical thing is you get to write a thing that feels like one synthetic

Starting point is 00:50:12 program with one understandable behavior. But then parts of it are actually going to execute a compile time so that the thing that you generate is in fact specialized for the particular platform that you're going to run it on. So one concern I have over this is it sounds like the configuration space of your programs is going to be massive. And I feel like there are two directions where this seems potentially hard to do from an engineering perspective. One is, can you really create abstractions that within the context of the program hide the relevant complexity so it's possible for people to think in a modular way about the program they're building. Their brains don't explode with the 70 different kinds of hardware that they might be

Starting point is 00:50:48 running it on. And then the other question is how do you think about testing, right? Because there's just so many configurations, how do you know whether it's working in all the places? Because it sounds like it has an enormous amount of freedom to do different things, including wrong things in some cases. How do you deal with those two problems, both controlling the complexity of the abstractions and then having a testing story that works out? Okay, Ron, I'm going to blow your mind. I know you're going to be resistant to this, but let me convince you that types are cool. Okay. I know you're going to fight me on this. Well, so this is, again, you go back to the challenges and opportunities working with either Python or C++. Python doesn't have

Starting point is 00:51:21 types, really. I mean, it has some stuff, but it doesn't really have a type system. C++ has a type system, but it's just incredibly painful to work with. And so what Mojo does is it says, again, it's not rocket science. We see it all around us. Let's bring in traits. Let's bring in a reasonable way to write code so that we can build abstractions or domain specific, and they can be checked modularly. And so one of the big problems with C++ is that you get error messages when you instantiate layers and layers and layers of templates. And so if you get some magical number wrong, it explodes spectacularly in a way that you can't reason about. And so what Mojo does is it says, cool, let's bring in traits that feel very much like protocols in Swift or traits in Rust or

Starting point is 00:52:01 type classes in Haskell. Like, this isn't novel. This is like a mechanism for what's called ad hoc polymorphism, meaning I want to have some operation or function that has some meaning, but actually it's going to get implemented in different ways for different types. And these are basically all mechanisms of a way of given the thing that you're doing and the types involve looking up the right implementation that's going to do the thing that you want. Yeah, I mean, a very simple case is an iterator. So Mojo has an iterator tree and you can say, hey, well, what is an iterator over a collection? Well, you can either check see if there's an element or you can get the value at the current element. And then as you keep pulling things out of an iterator,

Starting point is 00:52:36 it will eventually decide to stop. And so this concept can be applied to things like a link list or an array or a dictionary or an unbounded sequence of packets coming off a network. And so you can write code this generic across these different calm backends. their models that implement this trait. And what the compiler will do for you is it will check to make sure when you're writing that generic code, you're not using something that won't work. And so what that does

Starting point is 00:53:01 is it means that you can check the generic code without having to instantiate it, which is good for compile time. It's good for user experience because if you get something wrong as a programmer, that's important. It's good for reasoning about the modularity of these different subsystems because now you have an interface that connects the two

Starting point is 00:53:17 components. I think it's an underappreciated problem with like C++ templates approach to the world, where C++ templates, they seem like a deep language feature, but really they're just a code generation feature. They're like C macros. That's right. It both means they're hard to think about and reason about because it sort of seems at first glance not to be so bad, this property that you don't really know when your template expands if it's actually going to compile.

Starting point is 00:53:41 But as you start composing things more deeply, it gets worse and worse because something somewhere is going to fail and it's just going to be hard to reason about and understand. understand. Whereas when you have type-level notions of generosity that are guaranteed to compose correctly and won't just blow up, you just drive that error right down. So that's one thing that's nice about getting past templates as a language feature. And then the other thing is it's just crushingly slow. You're generating the code almost exactly the same code over and over and over again. And so that just means you can't save any of the compilation work. You just have to redo the whole thing from scratch. That's exactly right. And so this is where, again, we're talking about

Starting point is 00:54:19 sand in the system. These little things that if you get wrong, they play forward and they cause huge problems. The metaprogramming approach and Mojo's cool both for usability and compile time and correctness. Coming back to your point about portability, it's also valuable for portability because what it means is that the compiler parses your code and it parses it generically and has no idea what the target is. And so when Mojo generates the first level of intermediate representation, the compiler representation for the code, it's not hard coding in the pointers are 32-bit or 64-bit or that you're on a x-a-6 or whatever. And what this means is that you can take generic code in Mojo and you can put it on a CPU

Starting point is 00:54:57 and you can put it on a GPU. Same code, same function. And again, these crazy compilery things that Chris gets obsessed about, it means that you can slice out the chunk of code that you want to put onto your GPU in a way that it looks like a distributed system, but it's a distributed system where the GPU is actually a crazy embedded device that wants this tiny snippet of code and it wants it fully self-contained. These worlds are things that normal programming languages haven't even thought about. So does that mean when I compile a Mojo program, I get a shipable executable that contains

Starting point is 00:55:29 within it another little compiler that can take the Mojo code and specialize it to get the actual machine code for the final destination that you need? Do I bundle together all the compilers for all the possible platforms in every Mojo executable? The answer is no. The world's not ready for that. And there are use cases for jig compilers and things like this, and that's cool. But the default way of building, if you just run Mojo built, then it will give you just an A.DOT out, executable and normal thing. But if you build a Mojo package, the Mojo package retains portability.

Starting point is 00:55:57 This is a big difference. This is what Java does. If you think about Java in a completely different way, and for different reasons, the different ecosystem universe, it parses all of your source code without knowing what the target is, and it generates Java bytecode. And so it's not 1995 anymore. The way we do this is completely different. We're not Java, obviously, and we have a type system that's very different. But this concept is something that's been well known as something that, at least the world of compiled languages like Swift and C++ and Rust kind of forgotten. So the Mojo package is kind of shipped with the compiler technology required to specialize to the difference domains.

Starting point is 00:56:30 And so, again, by default, if you're a user, you're sitting on your laptop and you say compile a Mojo program, you just want an executable. But the compiler technology has all these powerful features and they can be used in different ways. And this is similar to LVM, where LVM had a just-in-time compiler. And that's really important if you're Sony pictures and you're rendering shaders for some fancy movie, but that's not what you'd want to use if you're just drawing a C++ code that needs to be ahead of time compiled. I mean, there's some echoes here also of the PtX story with NVIDIA. NVIDIA has this thing that they sort of hide that it's an intermediate representation, but this thing called PtX, which is a portable bytecode essentially.

Starting point is 00:57:04 And they, for many years, maintained compatibility across many, many different generations of GPUs. They have a thing called the assembler that's part of the driver thing for loading on. And it's really not an assembler. It's like a real compiler that takes the PtX and compiles it down to SAS, the accelerator-specific machine code, which they very carefully do not fully document because they don't want to give away all of their secrets. And so there's a built-in portability story there where it's meant to actually be portable in the future across new generations. Although, as you were pointing out before, it in fact doesn't always succeed. And there are now some programs that will not actually make the transition to Blackwell.

Starting point is 00:57:38 So that's in the category that I'd consider it be like a virtual machine, very low-level virtual machine, by the way. And so when you're looking at these systems, the thing I'd ask is, what is the type system? And so if you look at PtX, because as you're saying, you're totally right, it's an abstraction between a whole bunch of source code on the top end and then the specific SaaS hardware thing on the back end. But the type system isn't very interesting. It's pointers and registers and memory, right? And so Java, what is the type system? Well, Java achieves portability by making the type system in its bytecode expose objects.

Starting point is 00:58:08 And so it's a much higher level abstraction, dynamic virtual dispatch. that's all part of the Java ecosystem. It's not a bytecode, but the representation that's portable maintains the full generic system. And so this is what makes it possible to say, okay, well, I'm going to take this code, compile once to a package,

Starting point is 00:58:24 and now go specialize and instantiate this for a device. And so the way that works is a little bit different, but it enables, coming back to your original question of safety and correctness, it enables all the checking to happen the right way. Right, there's also a huge shift in control. With PtX, the machine-specific details of how it's compiled are totally out of the programmers' control.

Starting point is 00:58:44 You can generate the best PtX you can, and then it's going to get compiled. How? Somehow, don't ask too many questions. It's going to do what it's going to do. Whereas here you're preserving in the portable object the programmer-driven instructions about how the specialization is going to work. You've just partially executed your compilation.

Starting point is 00:59:00 You've got part way down, and then there's some more that's going to be done at the end when you pick actually where you're going to run it. Exactly. And so these are all very nerdy pieces that go into the stack, but the thing that I like is if you bubble out of that, it's easy to use. It works. It gives good error messages, right? I don't understand the Greek letters, but I do understand a lot of the engineering that goes into this. The way this technology stack builds up, the whole purpose is to unlock compute. And we want new programmers to be able to get into the system. And if they know Python, if they understand some of the basics of the hardware, they can be effective. And then they don't get limited to 80% of the performance. They can keep driving and keep growing in sophistication. And maybe not everybody wants to do that. They can stop at 80%. But if you do want to go all the way, then you can get there.

Starting point is 00:59:39 So one thing I'm curious about is how do you actually manage to keep it simple? You said that Mojo is meant to be Python and you talked a bunch about the syntax. But actually, one of the nice things about Python is it's simple in some ways in a deeper sense. The fact that there isn't by default a complicated type system with complicated type errors to think about. There's a lot of problems with that,

Starting point is 00:59:57 but it's also a real source of simplicity for users who are trying to learn the system. Dynamic errors at runtime are in some ways easier to understand. I wrote a program and it tried to do a thing and it tripped over this particular thing and you can see it tripping over And in some ways, that's easier to understand. When you're going to a language which, for both safety and performance reasons, needs much more precise type level control, how do you do that in a way that still feels pythonic in terms of the base simplicity that you're exposing to users?

Starting point is 01:00:24 I can't give you the perfect answer, but I can tell you my current thoughts. So again, learn from history. Swift had a lot of really cool features, but it spiraled and got a lot of complexity that got layered in over time. and also one of the challenges was Swift is it had a team that was paid to add features to Swift. It's never a good thing. Well, you have a C++ committee.

Starting point is 01:00:45 What is the C++ committee going to do? They're going to keep adding features to C++. Don't expect C++ to get smaller. It's common sense. And so with Mojo, there's a couple of different things. So one of which is start from Python. So Python being the surface level syntax enables me as management to be able to push back and say,

Starting point is 01:01:02 look, let's make sure we're implementing the full power of the Python ecosystem. and let's have lists and for comprehensions and like all this stuff before just inventing random stuff because it might be useful. But there's also, for me, personally, a significant back pressure on complexity. How can we factor these things? How can we get, for example, the metaprogramming system to subsume a lot of complexity that would otherwise exist?

Starting point is 01:01:24 And there are fundamental things that I want us to add, for example, checked generics, things like this, because they have a better U.S. They're part of the metaprogramming system. They're part of the core edition that we're adding. But I don't want Mojo to turn into a add-every-language feature that every other language has just because it's useful to somebody. I was actually inspired by and learned a lot from Go. And it's a language that people are probably surprised to hear me talk about.

Starting point is 01:01:51 Go, I think, did a really good job of intentionally constraining the language with Go-1. And they took a lot of heat for that. They didn't add a generic system. And everybody, myself included, were like, ha, ha, ha, why doesn't this language even have a generic system? And you're not even a modern language. But they held the line. They understood how far people could get. And then they did a really good job of adding generics to Go too.

Starting point is 01:02:12 And I thought they did a great job. There was a recent blog post I was reading talking about Go. And apparently they have an 80-20 rule. And they say they want to have 80% of the features with 20% of the complexity, something like that. And the observation is that that's a point in the space that annoys everybody. Because everybody wants 81% of the features. but 81% of the features maybe gives you 35% of the complexity. And so figuring out where to draw that line

Starting point is 01:02:41 and figuring out where to say no. For example, we have people in the community that are asking for very reasonable things that exist in Rust. And Rust is a wonderful language. I love it. There's a lot of great ideas, and we pull shamelessly good ideas from everywhere,

Starting point is 01:02:55 but I don't want the complexity. I often like to say that one of the most critical things about a language design is maintaining the power-to-weight ratio. You want to get an enormous amount of good functionality and power and good user experience while minimizing that complexity. I think it is a very challenging thing to manage. And I think it's actually a thing that we are seeing a lot as well. We are also doing a lot to extend O'Camel in all sorts of ways, pulling from all sorts

Starting point is 01:03:18 of languages, including Rust. And again, doing it in a way where the language maintains its basic character and maintains its simplicity is a real challenge. And it's kind of hard to know if you're hitting the actual right point on that. And it's easier to do in a world where you can take things back, try things out and decide that maybe they don't work and then adjust your behavior. And we're trying to iterate a lot in that mode, which is the thing you can do under certain circumstances, it gets harder as you have a big open source language

Starting point is 01:03:42 that lots of people are using. That's a really great point. And so one of the other lessons I've learned was Swift is that with Swift I pushed very early to have an open design process or anybody could come in, write a proposal, and then it would be evaluated by the language committee. And then if it was good, it would be implemented and put into Swift. Again, be careful what you wish for. That enabled a lot of people with really good ideas to add a bunch of features to Swift.

Starting point is 01:04:02 And so with Mojo as a counterbalance, I really want the core team to be small. I want the core team not just be able to add a whole bunch of stuff because it might be useful someday, but to be really deliberate about how we add things, how we evolve things. How are you thinking about maintaining backwards compatibility guarantees as you evolve it forward? We're actively debating and discussing what Mojo 1.0 looks like. So I'm not going to be a time frame, but it will hopefully not be very far away. And what I am fond of is this notion of semantic versioning. And so saying we're going to have a 1.0, and then we're going to have a 2.0,

Starting point is 01:04:36 and we're going to have a 3.0, and we're going to have a 4.0, et cetera. And each of these will be able to be incompatible, but they can link together. And so one of the big challenges and a lot of the damage in the Python ecosystem was from the Python 2 to 3 conversion. It took 15 years, and it was a heroic mess for many different reasons. The reason it took so long is because you have to convert the entire package ecosystem before you can be 3.0. And so if you contrast that to something like C++, let me say good things about C++. They got the ABI right. And so once the ABI was set, then you could have one package built in C++98 and one package

Starting point is 01:05:13 built in C++ 23. And these things would interoperate and be compatible even if you took new keywords or other things in the future language version. And so what I see for Mojo is much more similar to the maybe the C++ ecosystem or something like this. But that allows us to be a little bit more aggressive in terms of migrating code and in terms of fixing bugs and moving language forward. But I want to make sure that Mojo 2.0 and Mojo 1.0 packages work together

Starting point is 01:05:37 and that there's good tooling, probably AI driven, but good tooling that move from 1.2.0 and be able to manage the ecosystem that way. I think the type system also helps an enormous amount. I think one of the reasons the Python migration was so hard is you couldn't be like, and then let me try and build this with Python 3 and see what's broken. You could only see what's broken by actually walking all of the execution paths of your program. And if you didn't have enough testing, that would be very hard. And even if you did, it wasn't that easy.

Starting point is 01:06:02 Whereas with a strong type system, you can get an enormous amount of very precise guidance. And actually, the combination of a strong type system and an agentic coding system is awesome. We actually have a bunch of experience of just trying these things out now where you make some small change to the type of something. And then you're like, hey, AI system,

Starting point is 01:06:17 please run down all the type errors, fix them all. And it does surprisingly well. I absolutely agree. There's other components to us. So Russ has done a very good job with the stabilization approach with crates and APIs. And so I think that's a really good thing. And so I think we'll take good ideas

Starting point is 01:06:30 from many of these different ecosystems and hopefully do something that works well and works well for the ecosystem allows us to scale without being completely constrained by never being able to fix something once it gets, you know, you should put 1.0. I'm actually curious just to go to the agentic programming thing for a second, which is having AI agents

Starting point is 01:06:46 that write good kernels is actually pretty hard. And I'm curious what your experiences of how things work with Mojo. Mojo is obviously not a language deeply embedded in the training set that these models were built on. But on the other hand, you have this very strong type structure that can guide the process of the AI agent trying to write and modify code. I'm curious how that pans out in practice as you try and use these tools.

Starting point is 01:07:08 So this is why Mojo being open source. And so we have hundreds of thousands lines of Mojo code that are public with all these GPU kernels and like all this other cool stuff. And we have a community of people writing more code. Having hundreds of thousand lines of mojo code is fantastic. You can point your coding tool, cursor, or whatever it is at that repo and say, go learn about this repo and index it. So it's not. that you have to train the model to know the language, just having access to it that enables it to do good work and these tools are phenomenal.

Starting point is 01:07:36 And so that's been very, very, very important. And so we have instructions on our webpage for how to set up these tools. And there's a huge difference if you set it up right so that it can index that or if you don't. And make sure to follow that markdown file that explains how to set the algorithm tool. So I want to talk a little bit about the future of Mojo. I think that the current way that Modular and you have been talking about Mojo, these days at least,

Starting point is 01:07:58 for CUDA, an alternate full top-to-bottom stack for building GPUs, for writing programs that execute on GPUs. But that's not like the only way you've ever talked about Mojo. You've also taught, especially earlier on, I think there was more discussion of Mojo as an extension and maybe evolution of and maybe eventually replacement of Python. And I'm curious, how do you think about that now? To what degree do you think of Mojo as its own new language that takes inspiration and syntax from Python?

Starting point is 01:08:24 And to what degree do you want something that's more deeply integrated over time? So today, to pull it back to what is Mojo useful for today, and how do we explain it? Mojo is useful if you want code to go fast. If you have code on a CPU or a GPU and you want it to go fast, Mojo is a great thing. One of the really cool things that is available now, but it's in preview and it will solidify in the next month or something, is it's also the best way to extend Python. And so if you have a large-scale Python code base, again, tell me if this sounds familiar, you are coding away and you're doing cool stuff in Python. And then it starts to get slow. Typically what people do is they have to either go rewrite the whole thing in Rust or see.

Starting point is 01:08:58 or they carve out some chunk of it and move some chunk of that package to C++ or rust. This is what NumPy or PyTorch or like all modern large scale of Python code bases end up doing. If you look up on the mirrors and look at the percentage of programs that have C extensions in them, it's shockingly high. A really large faction of Python stuff is actually part Python and part some other language, almost always see in C++ a little bit of rust. That's right. And so today, this is a distant future. Today you can take your Python package and you can create a Mojo file and you can say, okay, well, these four loops are

Starting point is 01:09:33 slow. Move it over to Mojo. We have people, for example, doing bioinformatics and other crazy stuff I know nothing about saying, okay, well, I'm just taking my Python code. I move it over to Mojo. Wow, now I get types. I get these benefits, but there's no bindings. The PIP experience is beautiful. It's super simple. You don't have to have FFIs and nanobind and like all this complexity to be able to do this. You also are not moving from Python with its syntax to curly braces and borrow checkers and other craziness, you now get a very simple and seamless way to extend your Python package. And we have people that say, okay, well, I did that. I got it first 10x and 100x and 1,000x faster on CPU. But then because it was easy, I just put it on a GPU. And so to me,

Starting point is 01:10:13 this is amazing because these are people that didn't even think and would never have gotten on a GPU if they switched to Rust or something like that. Again, the way I explain it is Mojo is good for performance. It's good if you want to go fast on a GPU, if you want to make Python go fast, or if you want to, I mean, some people are crazy and have to go a whole hog and just write entirely from scratch Mojo programs, and that's super cool. If you fast forward six, nine months, something, I think that Mojo will be a very credible

Starting point is 01:10:37 top to bottom replacement for Rust. And so we need a few more extensions to the generic system. And there's a few things I want to bake out a little bit. Some of the dynamic features that Rust has for the existentials, the ability to make a runtime trait is missing in Mojo. And so we'll add a few of those kinds of features. And as we do that, I think that will be really interesting as an applications-level programming language

Starting point is 01:10:56 for people to care about this kind of stuff. You fast forward, I'm not even projected time frame, maybe a year, 18 months from now, it depends on how we prioritize things, and we'll add classes. And so as we add classes, suddenly it will look and feel to a Python programmer much more familiar.

Starting point is 01:11:11 And so the classes in Mojo will be intentionally designed to be very similar to Python. And at that point, we'll have something that looks and feels kind of like a Python 4. It's very much cut from the same mold as Python. It integrates really well from Python. It's really easy to extend Python. And so it's very much a member of the Python family,

Starting point is 01:11:28 but it's not compatible with Python. And so what we'll do over the course of N years, and I can't predict exactly how long that is, is continue to run down the line of, okay, well, how much compatibility do we want to add to this thing? And then I think that at some point, people will consider it to be a Python super set. And effectively, it will feel just like the best way to do Python in general.

Starting point is 01:11:47 And I think that that will come in time. But to bring it all the way back, I want us to be very focused on what is Mojo useful for today. And so great claims require great proof. We have no proof that we can do this. I have a vision and a future in my brain, and I've built a few languages and some scale things before. And so I have quite high confidence that we can do this,

Starting point is 01:12:08 but I want people to zero back into, okay, if you're writing performance code, if you're writing GPU kernels or AI, if you have Python code, you want to go slow. If you of us have that problem, then Mojo can be very useful. And hopefully it'll be even more useful to more people in the future. Right.

Starting point is 01:12:22 And I think already the practical short-term thing is already plenty ambitious and exciting on its own. Seems like a great thing to focus on. Yeah, let's solve a heterogeneous compute in the AI. That's actually a pretty useful thing, right? All right, that seems like a great place to stop. Thank you so much for joining me. Yeah, well, thank you for having me.

Starting point is 01:12:38 I love nerding out with you, and I hope it's useful and interesting to other people, too. But even if not, I had a lot of fun with you. You'll find a complete transcript of the episode, along with show notes and links at Signals and Threads.com. Thanks for joining us. See you next time. Thank you.

Signals and Threads - Why ML Needs a New Programming Language with Chris Lattner

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.