Latent Space: The AI Engineer Podcast - LLMs Everywhere: Running 70B models in browsers and iPhones using MLC — with Tianqi Chen of CMU / OctoML

Episode Date: August 10, 2023

We have just announced our first set of speakers at AI Engineer Summit! Sign up for the livestream or email sponsors@ai.engineer if you’d like to support.We are facing a massive GPU crunch. As both ...startups and VC’s hoard Nvidia GPUs like countries count nuclear stockpiles, tweets about GPU shortages have become increasingly common. But what if we could run LLMs with AMD cards, or without a GPU at all? There’s just one weird trick: compilation. And there’s one person uniquely qualified to do it.We had the pleasure to sit down with Tianqi Chen, who’s an Assistant Professor at CMU, where he both teaches the MLC course and runs the MLC group. You might also know him as the creator of XGBoost, Apache TVM, and MXNet, as well as the co-founder of OctoML. The MLC (short for Machine Learning Compilation) group has released a lot of interesting projects:* MLC Chat: an iPhone app that lets you run models like RedPajama-3B and Vicuna-7B on-device. It gets up to 30 tok/s!* Web LLM: Run models like LLaMA-70B in your browser (!!) to offer local inference in your product.* MLC LLM: a framework that allows any language models to be deployed natively on different hardware and software stacks.The MLC group has just announced new support for AMD cards; we previously talked about the shortcomings of ROCm, but using MLC you can get performance very close to the NVIDIA’s counterparts. This is great news for founders and builders, as AMD cards are more readily available. Here are their latest results on AMD’s 7900s vs some of top NVIDIA consumer cards.If you just can’t get a GPU at all, MLC LLM also supports ARM and x86 CPU architectures as targets by leveraging LLVM. While speed performance isn’t comparable, it allows for non-time-sensitive inference to be run on commodity hardware.We also enjoyed getting a peek into TQ’s process, which involves a lot of sketching:With all the other work going on in this space with projects like ggml and Ollama, we’re excited to see GPUs becoming less and less of an issue to get models in the hands of more people, and innovative software solutions to hardware problems!Show Notes* TQ’s Projects:* XGBoost* Apache TVM* MXNet* MLC* OctoML* CMU Catalyst* ONNX* GGML* Mojo* WebLLM* RWKV* HiPPO* Tri Dao’s Episode* George Hotz EpisodePeople:* Carlos Guestrin* Albert GuTimestamps* [00:00:00] Intros* [00:03:41] The creation of XGBoost and its surprising popularity* [00:06:01] Comparing tree-based models vs deep learning* [00:10:33] Overview of TVM and how it works with ONNX* [00:17:18] MLC deep dive* [00:28:10] Using int4 quantization for inference of language models* [00:30:32] Comparison of MLC to other model optimization projects* [00:35:02] Running large language models in the browser with WebLLM* [00:37:47] Integrating browser models into applications* [00:41:15] OctoAI and self-optimizing compute* [00:45:45] Lightning RoundTranscriptAlessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, Partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, writer and editor of Latent Space. [00:00:20]Swyx: Okay, and we are here with Tianqi Chen, or TQ as people call him, who is assistant professor in ML computer science at CMU, Carnegie Mellon University, also helping to run Catalyst Group, also chief technologist of OctoML. You wear many hats. Are those, you know, your primary identities these days? Of course, of course. [00:00:42]Tianqi: I'm also, you know, very enthusiastic open source. So I'm also a VP and PRC member of the Apache TVM project and so on. But yeah, these are the things I've been up to so far. [00:00:53]Swyx: Yeah. So you did Apache TVM, XGBoost, and MXNet, and we can cover any of those in any amount of detail. But maybe what's one thing about you that people might not learn from your official bio or LinkedIn, you know, on the personal side? [00:01:08]Tianqi: Let me say, yeah, so normally when I do, I really love coding, even though like I'm trying to run all those things. So one thing that I keep a habit on is I try to do sketchbooks. I have a book, like real sketchbooks to draw down the design diagrams and the sketchbooks I keep sketching over the years, and now I have like three or four of them. And it's kind of a usually a fun experience of thinking the design through and also seeing how open source project evolves and also looking back at the sketches that we had in the past to say, you know, all these ideas really turn into code nowadays. [00:01:43]Alessio: How many sketchbooks did you get through to build all this stuff? I mean, if one person alone built one of those projects, he'll be a very accomplished engineer. Like you built like three of these. What's that process like for you? Like it's the sketchbook, like the start, and then you think about the code or like. [00:01:59]Swyx: Yeah. [00:02:00]Tianqi: So, so usually I start sketching on high level architectures and also in a project that works for over years, we also start to think about, you know, new directions, like of course generative AI language model comes in, how it's going to evolve. So normally I would say it takes like one book a year, roughly at that rate. It's usually fun to, I find it's much easier to sketch things out and then gives a more like a high level architectural guide for some of the future items. Yeah. [00:02:28]Swyx: Have you ever published this sketchbooks? Cause I think people would be very interested on, at least on a historical basis. Like this is the time where XGBoost was born, you know? Yeah, not really. [00:02:37]Tianqi: I started sketching like after XGBoost. So that's a kind of missing piece, but a lot of design details in TVM are actually part of the books that I try to keep a record of. [00:02:48]Swyx: Yeah, we'll try to publish them and publish something in the journals. Maybe you can grab a little snapshot for visual aid. Sounds good. [00:02:57]Alessio: Yeah. And yeah, talking about XGBoost, so a lot of people in the audience might know it's a gradient boosting library, probably the most popular out there. And it became super popular because many people started using them in like a machine learning competitions. And I think there's like a whole Wikipedia page of like all state-of-the-art models. They use XGBoost and like, it's a really long list. When you were working on it, so we just had Tri Dao, who's the creator of FlashAttention on the podcast. And I asked him this question, it's like, when you were building FlashAttention, did you know that like almost any transform race model will use it? And so I asked the same question to you when you were coming up with XGBoost, like, could you predict it would be so popular or like, what was the creation process? And when you published it, what did you expect? We have no idea. [00:03:41]Tianqi: Like, actually, the original reason that we built that library is that at that time, deep learning just came out. Like that was the time where AlexNet just came out. And one of the ambitious mission that myself and my advisor, Carlos Guestrin, then is we want to think about, you know, try to test the hypothesis. Can we find alternatives to deep learning models? Because then, you know, there are other alternatives like, you know, support vector machines, linear models, and of course, tree-based models. And our question was, if you build those models and feed them with big enough data, because usually like one of the key characteristics of deep learning is that it's taking a lot [00:04:22]Swyx: of data, right? [00:04:23]Tianqi: So we will be able to get the same amount of performance. That's a hypothesis we're setting out to test. Of course, if you look at now, right, that's a wrong hypothesis, but as a byproduct, what we find out is that, you know, most of the gradient boosting library out there is not efficient enough for us to test that hypothesis. So I happen to have quite a bit of experience in the past of building gradient boosting trees and their variants. So Effective Action Boost was kind of like a byproduct of that hypothesis testing. At that time, I'm also competing a bit in data science challenges, like I worked on KDDCup and then Kaggle kind of become bigger, right? So I kind of think maybe it's becoming useful to others. One of my friends convinced me to try to do a Python binding of it. That tends to be like a very good decision, right, to be effective. Usually when I build it, we feel like maybe a command line interface is okay. And now we have a Python binding, we have R bindings. And then it realized, you know, it started getting interesting. People started contributing different perspectives, like visualization and so on. So we started to push a bit more on to building distributive support to make sure it works on any platform and so on. And even at that time point, when I talked to Carlos, my advisor, later, he said he never anticipated that we'll get to that level of success. And actually, why I pushed for gradient boosting trees, interestingly, at that time, he also disagreed. He thinks that maybe we should go for kernel machines then. And it turns out, you know, actually, we are both wrong in some sense, and Deep Neural Network was the king in the hill. But at least the gradient boosting direction got into something fruitful. [00:06:01]Swyx: Interesting. [00:06:02]Alessio: I'm always curious when it comes to these improvements, like, what's the design process in terms of like coming up with it? And how much of it is a collaborative with like other people that you're working with versus like trying to be, you know, obviously, in academia, it's like very paper-driven kind of research driven. [00:06:19]Tianqi: I would say the extra boost improvement at that time point was more on like, you know, I'm trying to figure out, right. But it's combining lessons. Before that, I did work on some of the other libraries on matrix factorization. That was like my first open source experience. Nobody knew about it, because you'll find, likely, if you go and try to search for the package SVD feature, you'll find some SVN repo somewhere. But it's actually being used for some of the recommender system packages. So I'm trying to apply some of the previous lessons there and trying to combine them. The later projects like MXNet and then TVM is much, much more collaborative in a sense that... But, of course, extra boost has become bigger, right? So when we started that project myself, and then we have, it's really amazing to see people come in. Michael, who was a lawyer, and now he works on the AI space as well, on contributing visualizations. Now we have people from our community contributing different things. So extra boost even today, right, it's a community of committers driving the project. So it's definitely something collaborative and moving forward on getting some of the things continuously improved for our community. [00:07:37]Alessio: Let's talk a bit about TVM too, because we got a lot of things to run through in this episode. [00:07:42]Swyx: I would say that at some point, I'd love to talk about this comparison between extra boost or tree-based type AI or machine learning compared to deep learning, because I think there is a lot of interest around, I guess, merging the two disciplines, right? And we can talk more about that. I don't know where to insert that, by the way, so we can come back to it later. Yeah. [00:08:04]Tianqi: Actually, what I said, when we test the hypothesis, the hypothesis is kind of, I would say it's partially wrong, because the hypothesis we want to test now is, can you run tree-based models on image classification tasks, where deep learning is certainly a no-brainer right [00:08:17]Swyx: now today, right? [00:08:18]Tianqi: But if you try to run it on tabular data, still, you'll find that most people opt for tree-based models. And there's a reason for that, in the sense that when you are looking at tree-based models, the decision boundaries are naturally rules that you're looking at, right? And they also have nice properties, like being able to be agnostic to scale of input and be able to automatically compose features together. And I know there are attempts on building neural network models that work for tabular data, and I also sometimes follow them. I do feel like it's good to have a bit of diversity in the modeling space. Actually, when we're building TVM, we build cost models for the programs, and actually we are using XGBoost for that as well. I still think tree-based models are going to be quite relevant, because first of all, it's really to get it to work out of the box. And also, you will be able to get a bit of interoperability and control monotonicity [00:09:18]Swyx: and so on. [00:09:19]Tianqi: So yes, it's still going to be relevant. I also sometimes keep coming back to think about, are there possible improvements that we can build on top of these models? And definitely, I feel like it's a space that can have some potential in the future. [00:09:34]Swyx: Are there any current projects that you would call out as promising in terms of merging the two directions? [00:09:41]Tianqi: I think there are projects that try to bring a transformer-type model for tabular data. I don't remember specifics of them, but I think even nowadays, if you look at what people are using, tree-based models are still one of their toolkits. So I think maybe eventually it's not even a replacement, it will be just an ensemble of models that you can call. Perfect. [00:10:07]Alessio: Next up, about three years after XGBoost, you built this thing called TVM, which is now a very popular compiler framework for models. Let's talk about, so this came out about at the same time as ONNX. So I think it would be great if you could maybe give a little bit of an overview of how the two things work together. Because it's kind of like the model, then goes to ONNX, then goes to the TVM. But I think a lot of people don't understand the nuances. I can get a bit of a backstory on that. [00:10:33]Tianqi: So actually, that's kind of an ancient history. Before XGBoost, I worked on deep learning for two years or three years. I got a master's before I started my PhD. And during my master's, my thesis focused on applying convolutional restricted Boltzmann machine for ImageNet classification. That is the thing I'm working on. And that was before AlexNet moment. So effectively, I had to handcraft NVIDIA CUDA kernels on, I think, a GTX 2070 card. I have a 22070 card. It took me about six months to get one model working. And eventually, that model is not so good, and we should have picked a better model. But that was like an ancient history that really got me into this deep learning field. And of course, eventually, we find it didn't work out. So in my master's, I ended up working on recommender system, which got me a paper, and I applied and got a PhD. But I always want to come back to work on the deep learning field. So after XGBoost, I think I started to work with some folks on this particular MXNet. At that time, it was like the frameworks of CAFE, Ciano, PyTorch haven't yet come out. And we're really working hard to optimize for performance on GPUs. At that time, I found it's really hard, even for NVIDIA GPU. It took me six months. And then it's amazing to see on different hardwares how hard it is to go and optimize code for the platforms that are interesting. So that gets me thinking, can we build something more generic and automatic? So that I don't need an entire team of so many people to go and build those frameworks. So that's the motivation of starting working on TVM. There is really too little about machine learning engineering needed to support deep learning models on the platforms that we're interested in. I think it started a bit earlier than ONNX, but once it got announced, I think it's in a similar time period at that time. So overall, how it works is that TVM, you will be able to take a subset of machine learning programs that are represented in what we call a computational graph. Nowadays, we can also represent a loop-level program ingest from your machine learning models. Usually, you have model formats ONNX, or in PyTorch, they have FX Tracer that allows you to trace the FX graph. And then it goes through TVM. We also realized that, well, yes, it needs to be more customizable, so it will be able to perform some of the compilation optimizations like fusion operator together, doing smart memory planning, and more importantly, generate low-level code. So that works for NVIDIA and also is portable to other GPU backends, even non-GPU backends [00:13:36]Swyx: out there. [00:13:37]Tianqi: So that's a project that actually has been my primary focus over the past few years. And it's great to see how it started from where I think we are the very early initiator of machine learning compilation. I remember there was a visit one day, one of the students asked me, are you still working on deep learning frameworks? I tell them that I'm working on ML compilation. And they said, okay, compilation, that sounds very ancient. It sounds like a very old field. And why are you working on this? And now it's starting to get more traction, like if you say Torch Compile and other things. I'm really glad to see this field starting to pick up. And also we have to continue innovating here. [00:14:17]Alessio: I think the other thing that I noticed is, it's kind of like a big jump in terms of area of focus to go from XGBoost to TVM, it's kind of like a different part of the stack. Why did you decide to do that? And I think the other thing about compiling to different GPUs and eventually CPUs too, did you already see some of the strain that models could have just being focused on one runtime, only being on CUDA and that, and how much of that went into it? [00:14:50]Tianqi: I think it's less about trying to get impact, more about wanting to have fun. I like to hack code, I had great fun hacking CUDA code. Of course, being able to generate CUDA code is cool, right? But now, after being able to generate CUDA code, okay, by the way, you can do it on other platforms, isn't that amazing? So it's more of that attitude to get me started on this. And also, I think when we look at different researchers, myself is more like a problem solver type. So I like to look at a problem and say, okay, what kind of tools we need to solve that problem? So regardless, it could be building better models. For example, while we build extra boots, we build certain regularizations into it so that it's more robust. It also means building system optimizations, writing low-level code, maybe trying to write assembly and build compilers and so on. So as long as they solve the problem, definitely go and try to do them together. And I also see it's a common trend right now. Like if you want to be able to solve machine learning problems, it's no longer at Aggressor layer, right? You kind of need to solve it from both Aggressor data and systems angle. And this entire field of machine learning system, I think it's kind of emerging. And there's now a conference around it. And it's really good to see a lot more people are starting to look into this. [00:16:10]Swyx: Yeah. Are you talking about ICML or something else? [00:16:13]Tianqi: So machine learning and systems, right? So not only machine learning, but machine learning and system. So there's a conference called MLsys. It's definitely a smaller community than ICML, but I think it's also an emerging and growing community where people are talking about what are the implications of building systems for machine learning, right? And how do you go and optimize things around that and co-design models and systems together? [00:16:37]Swyx: Yeah. And you were area chair for ICML and NeurIPS as well. So you've just had a lot of conference and community organization experience. Is that also an important part of your work? Well, it's kind of expected for academic. [00:16:48]Tianqi: If I hold an academic job, I need to do services for the community. Okay, great. [00:16:53]Swyx: Your most recent venture in MLsys is going to the phone with MLCLLM. You announced this in April. I have it on my phone. It's great. I'm running Lama 2, Vicuña. I don't know what other models that you offer. But maybe just kind of describe your journey into MLC. And I don't know how this coincides with your work at CMU. Is that some kind of outgrowth? [00:17:18]Tianqi: I think it's more like a focused effort that we want in the area of machine learning compilation. So it's kind of related to what we built in TVM. So when we built TVM was five years ago, right? And a lot of things happened. We built the end-to-end machine learning compiler that works, the first one that works. But then we captured a lot of lessons there. So then we are building a second iteration called TVM Unity. That allows us to be able to allow ML engineers to be able to quickly capture the new model and how we demand building optimizations for them. And MLCLLM is kind of like an MLC. It's more like a vertical driven organization that we go and build tutorials and go and build projects like LLM to solutions. So that to really show like, okay, you can take machine learning compilation technology and apply it and bring something fun forward. Yeah. So yes, it runs on phones, which is really cool. But the goal here is not only making it run on phones, right? The goal is making it deploy universally. So we do run on Apple M2 Macs, the 17 billion models. Actually, on a single batch inference, more recently on CUDA, we get, I think, the most best performance you can get out there already on the 4-bit inference. Actually, as I alluded earlier before the podcast, we just had a result on AMD. And on a single batch, actually, we can get the latest AMD GPU. This is a consumer card. It can get to about 80% of the 4019, so NVIDIA's best consumer card out there. So it's not yet on par, but thinking about how diversity and what you can enable and the previous things you can get on that card, it's really amazing that what you can do with this kind of technology. [00:19:10]Swyx: So one thing I'm a little bit confused by is that most of these models are in PyTorch, but you're running this inside a TVM. I don't know. Was there any fundamental change that you needed to do, or was this basically the fundamental design of TVM? [00:19:25]Tianqi: So the idea is that, of course, it comes back to program representation, right? So effectively, TVM has this program representation called TVM script that contains more like computational graph and operational representation. So yes, initially, we do need to take a bit of effort of bringing those models onto the program representation that TVM supports. Usually, there are a mix of ways, depending on the kind of model you're looking at. For example, for vision models and stable diffusion models, usually we can just do tracing that takes PyTorch model onto TVM. That part is still being robustified so that we can bring more models in. On language model tasks, actually what we do is we directly build some of the model constructors and try to directly map from Hugging Face models. The goal is if you have a Hugging Face configuration, we will be able to bring that in and apply optimization on them. So one fun thing about model compilation is that your optimization doesn't happen only as a soft language, right? For example, if you're writing PyTorch code, you just go and try to use a better fused operator at a source code level. Torch compile might help you do a bit of things in there. In most of the model compilations, it not only happens at the beginning stage, but we also apply generic transformations in between, also through a Python API. So you can tweak some of that. So that part of optimization helps a lot of uplifting in getting both performance and also portability on the environment. And another thing that we do have is what we call universal deployment. So if you get the ML program into this TVM script format, where there are functions that takes in tensor and output tensor, we will be able to have a way to compile it. So they will be able to load the function in any of the language runtime that TVM supports. So if you could load it in JavaScript, and that's a JavaScript function that you can take in tensors and output tensors. If you're loading Python, of course, and C++ and Java. So the goal there is really bring the ML model to the language that people care about and be able to run it on a platform they like. [00:21:37]Swyx: It strikes me that I've talked to a lot of compiler people, but you don't have a traditional compiler background. You're inventing your own discipline called machine learning compilation, or MLC. Do you think that this will be a bigger field going forward? [00:21:52]Tianqi: First of all, I do work with people working on compilation as well. So we're also taking inspirations from a lot of early innovations in the field. Like for example, TVM initially, we take a lot of inspirations from Halide, which is just an image processing compiler. And of course, since then, we have evolved quite a bit to focus on the machine learning related compilations. If you look at some of our conference publications, you'll find that machine learning compilation is already kind of a subfield. So if you look at papers in both machine learning venues, the MLC conferences, of course, and also system venues, every year there will be papers around machine learning compilation. And in the compiler conference called CGO, there's a C4ML workshop that also kind of trying to focus on this area. So definitely it's already starting to gain traction and becoming a field. I wouldn't claim that I invented this field, but definitely I helped to work with a lot of folks there. And I try to bring a perspective, of course, trying to learn a lot from the compiler optimizations as well as trying to bring in knowledges in machine learning and systems together. [00:23:07]Alessio: So we had George Hotz on the podcast a few episodes ago, and he had a lot to say about AMD and their software. So when you think about TVM, are you still restricted in a way by the performance of the underlying kernel, so to speak? So if your target is like a CUDA runtime, you still get better performance, no matter like TVM kind of helps you get there, but then that level you don't take care of, right? [00:23:34]Swyx: There are two parts in here, right? [00:23:35]Tianqi: So first of all, there is the lower level runtime, like CUDA runtime. And then actually for NVIDIA, a lot of the mood came from their libraries, like Cutlass, CUDN, right? Those library optimizations. And also for specialized workloads, actually you can specialize them. Because a lot of cases you'll find that if you go and do benchmarks, it's very interesting. Like two years ago, if you try to benchmark ResNet, for example, usually the NVIDIA library [00:24:04]Swyx: gives you the best performance. [00:24:06]Tianqi: It's really hard to beat them. But as soon as you start to change the model to something, maybe a bit of a variation of ResNet, not for the traditional ImageNet detections, but for latent detection and so on, there will be some room for optimization because people sometimes overfit to benchmarks. These are people who go and optimize things, right? So people overfit the benchmarks. So that's the largest barrier, like being able to get a low level kernel libraries, right? In that sense, the goal of TVM is actually we try to have a generic layer to both, of course, leverage libraries when available, but also be able to automatically generate [00:24:45]Swyx: libraries when possible. [00:24:46]Tianqi: So in that sense, we are not restricted by the libraries that they have to offer. That's why we will be able to run Apple M2 or WebGPU where there's no library available because we are kind of like automatically generating libraries. That makes it easier to support less well-supported hardware, right? For example, WebGPU is one example. From a runtime perspective, AMD, I think before their Vulkan driver was not very well supported. Recently, they are getting good. But even before that, we'll be able to support AMD through this GPU graphics backend called Vulkan, which is not as performant, but it gives you a decent portability across those [00:25:29]Swyx: hardware. [00:25:29]Alessio: And I know we got other MLC stuff to talk about, like WebLLM, but I want to wrap up on the optimization that you're doing. So there's kind of four core things, right? Kernel fusion, which we talked a bit about in the flash attention episode and the tiny grab one memory planning and loop optimization. I think those are like pretty, you know, self-explanatory. I think the one that people have the most questions, can you can you quickly explain [00:25:53]Swyx: those? [00:25:54]Tianqi: So there are kind of a different things, right? Kernel fusion means that, you know, if you have an operator like Convolutions or in the case of a transformer like MOP, you have other operators that follow that, right? You don't want to launch two GPU kernels. You want to be able to put them together in a smart way, right? And as a memory planning, it's more about, you know, hey, if you run like Python code, every time when you generate a new array, you are effectively allocating a new piece of memory, right? Of course, PyTorch and other frameworks try to optimize for you. So there is a smart memory allocator behind the scene. But actually, in a lot of cases, it's much better to statically allocate and plan everything ahead of time. And that's where like a compiler can come in. We need to, first of all, actually for language model, it's much harder because dynamic shape. So you need to be able to what we call symbolic shape tracing. So we have like a symbolic variable that tells you like the shape of the first tensor is n by 12. And the shape of the third tensor is also n by 12. Or maybe it's n times 2 by 12. Although you don't know what n is, right? But you will be able to know that relation and be able to use that to reason about like fusion and other decisions. So besides this, I think loop transformation is quite important. And it's actually non-traditional. Originally, if you simply write a code and you want to get a performance, it's very hard. For example, you know, if you write a matrix multiplier, the simplest thing you can do is you do for i, j, k, c, i, j, plus, equal, you know, a, i, k, times b, i, k. But that code is 100 times slower than the best available code that you can get. So we do a lot of transformation, like being able to take the original code, trying to put things into shared memory, and making use of tensor calls, making use of memory copies, and all this. Actually, all these things, we also realize that, you know, we cannot do all of them. So we also make the ML compilation framework as a Python package, so that people will be able to continuously improve that part of engineering in a more transparent way. So we find that's very useful, actually, for us to be able to get good performance very quickly on some of the new models. Like when Lamato came out, we'll be able to go and look at the whole, here's the bottleneck, and we can go and optimize those. [00:28:10]Alessio: And then the fourth one being weight quantization. So everybody wants to know about that. And just to give people an idea of the memory saving, if you're doing FB32, it's like four bytes per parameter. Int8 is like one byte per parameter. So you can really shrink down the memory footprint. What are some of the trade-offs there? How do you figure out what the right target is? And what are the precision trade-offs, too? [00:28:37]Tianqi: Right now, a lot of people also mostly use int4 now for language models. So that really shrinks things down a lot. And more recently, actually, we started to think that, at least in MOC, we don't want to have a strong opinion on what kind of quantization we want to bring, because there are so many researchers in the field. So what we can do is we can allow developers to customize the quantization they want, but we still bring the optimum code for them. So we are working on this item called bring your own quantization. In fact, hopefully MOC will be able to support more quantization formats. And definitely, I think there's an open field that's being explored. Can you bring more sparsities? Can you quantize activations as much as possible, and so on? And it's going to be something that's going to be relevant for quite a while. [00:29:27]Swyx: You mentioned something I wanted to double back on, which is most people use int4 for language models. This is actually not obvious to me. Are you talking about the GGML type people, or even the researchers who are training the models also using int4? [00:29:40]Tianqi: Sorry, so I'm mainly talking about inference, not training, right? So when you're doing training, of course, int4 is harder, right? Maybe you could do some form of mixed type precision for inference. I think int4 is kind of like, in a lot of cases, you will be able to get away with int4. And actually, that does bring a lot of savings in terms of the memory overhead, and so on. [00:30:09]Alessio: Yeah, that's great. Let's talk a bit about maybe the GGML, then there's Mojo. How should people think about MLC? How do all these things play together? I think GGML is focused on model level re-implementation and improvements. Mojo is a language, super sad. You're more at the compiler level. Do you all work together? Do people choose between them? [00:30:32]Tianqi: So I think in this case, I think it's great to say the ecosystem becomes so rich with so many different ways. So in our case, GGML is more like you're implementing something from scratch in C, right? So that gives you the ability to go and customize each of a particular hardware backend. But then you will need to write from CUDA kernels, and you write optimally from AMD, and so on. So the kind of engineering effort is a bit more broadened in that sense. Mojo, I have not looked at specific details yet. I think it's good to start to say, it's a language, right? I believe there will also be machine learning compilation technologies behind it. So it's good to say, interesting place in there. In the case of MLC, our case is that we do not want to have an opinion on how, where, which language people want to develop, deploy, and so on. And we also realize that actually there are two phases. We want to be able to develop and optimize your model. By optimization, I mean, really bring in the best CUDA kernels and do some of the machine learning engineering in there. And then there's a phase where you want to deploy it as a part of the app. So if you look at the space, you'll find that GGML is more like, I'm going to develop and optimize in the C language, right? And then most of the low-level languages they have. And Mojo is that you want to develop and optimize in Mojo, right? And you deploy in Mojo. In fact, that's the philosophy they want to push for. In the ML case, we find that actually if you want to develop models, the machine learning community likes Python. Python is a language that you should focus on. So in the case of MLC, we really want to be able to enable, not only be able to just define your model in Python, that's very common, right? But also do ML optimization, like engineering optimization, CUDA kernel optimization, memory planning, all those things in Python that makes you customizable and so on. But when you do deployment, we realize that people want a bit of a universal flavor. If you are a web developer, you want JavaScript, right? If you're maybe an embedded system person, maybe you would prefer C++ or C or Rust. And people sometimes do like Python in a lot of cases. So in the case of MLC, we really want to have this vision of, you optimize, build a generic optimization in Python, then you deploy that universally onto the environments that people like. [00:32:54]Swyx: That's a great perspective and comparison, I guess. One thing I wanted to make sure that we cover is that I think you are one of these emerging set of academics that also very much focus on your artifacts of delivery. Of course. Something we talked about for three years, that he was very focused on his GitHub. And obviously you treated XGBoost like a product, you know? And then now you're publishing an iPhone app. Okay. Yeah. Yeah. What is his thinking about academics getting involved in shipping products? [00:33:24]Tianqi: I think there are different ways of making impact, right? Definitely, you know, there are academics that are writing papers and building insights for people so that people can build product on top of them. In my case, I think the particular field I'm working on, machine learning systems, I feel like really we need to be able to get it to the hand of people so that really we see the problem, right? And we show that we can solve a problem. And it's a different way of making impact. And there are academics that are doing similar things. Like, you know, if you look at some of the people from Berkeley, right? A few years, they will come up with big open source projects. Certainly, I think it's just a healthy ecosystem to have different ways of making impacts. And I feel like really be able to do open source and work with open source community is really rewarding because we have a real problem to work on when we build our research. Actually, those research bring together and people will be able to make use of them. And we also start to see interesting research challenges that we wouldn't otherwise say, right, if you're just trying to do a prototype and so on. So I feel like it's something that is one interesting way of making impact, making contributions. [00:34:40]Swyx: Yeah, you definitely have a lot of impact there. And having experience publishing Mac stuff before, the Apple App Store is no joke. It is the hardest compilation, human compilation effort. So one thing that we definitely wanted to cover is running in the browser. You have a 70 billion parameter model running in the browser. That's right. Can you just talk about how? Yeah, of course. [00:35:02]Tianqi: So I think that there are a few elements that need to come in, right? First of all, you know, we do need a MacBook, the latest one, like M2 Max, because you need the memory to be big enough to cover that. So for a 70 million model, it takes you about, I think, 50 gigahertz of RAM. So the M2 Max, the upper version, will be able to run it, right? And it also leverages machine learning compilation. Again, what we are doing is the same, whether it's running on iPhone, on server cloud GPUs, on AMDs, or on MacBook, we all go through that same MOC pipeline. Of course, in certain cases, maybe we'll do a bit of customization iteration for either ones. And then it runs on the browser runtime, this package of WebLM. So that will effectively... So what we do is we will take that original model and compile to what we call WebGPU. And then the WebLM will be to pick it up. And the WebGPU is this latest GPU technology that major browsers are shipping right now. So you can get it in Chrome for them already. It allows you to be able to access your native GPUs from a browser. And then effectively, that language model is just invoking the WebGPU kernels through there. So actually, when the LATMAR2 came out, initially, we asked the question about, can you run 17 billion on a MacBook? That was the question we're asking. So first, we actually... Jin Lu, who is the engineer pushing this, he got 17 billion on a MacBook. We had a CLI version. So in MLC, you will be able to... That runs through a metal accelerator. So effectively, you use the metal programming language to get the GPU acceleration. So we find, okay, it works for the MacBook. Then we asked, we had a WebGPU backend. Why not try it there? So we just tried it out. And it's really amazing to see everything up and running. And actually, it runs smoothly in that case. So I do think there are some kind of interesting use cases already in this, because everybody has a browser. You don't need to install anything. I think it doesn't make sense yet to really run a 17 billion model on a browser, because you kind of need to be able to download the weight and so on. But I think we're getting there. Effectively, the most powerful models you will be able to run on a consumer device. It's kind of really amazing. And also, in a lot of cases, there might be use cases. For example, if I'm going to build a chatbot that I talk to it and answer questions, maybe some of the components, like the voice to text, could run on the client side. And so there are a lot of possibilities of being able to have something hybrid that contains the edge component or something that runs on a server. [00:37:47]Alessio: Do these browser models have a way for applications to hook into them? So if I'm using, say, you can use OpenAI or you can use the local model. Of course. [00:37:56]Tianqi: Right now, actually, we are building... So there's an NPM package called WebILM, right? So that you will be able to, if you want to embed it onto your web app, you will be able to directly depend on WebILM and you will be able to use it. We are also having a REST API that's OpenAI compatible. So that REST API, I think, right now, it's actually running on native backend. So that if a CUDA server is faster to run on native backend. But also we have a WebGPU version of it that you can go and run. So yeah, we do want to be able to have easier integrations with existing applications. And OpenAI API is certainly one way to do that. Yeah, this is great. [00:38:37]Swyx: I actually did not know there's an NPM package that makes it very, very easy to try out and use. I want to actually... One thing I'm unclear about is the chronology. Because as far as I know, Chrome shipped WebGPU the same time that you shipped WebILM. Okay, yeah. So did you have some kind of secret chat with Chrome? [00:38:57]Tianqi: The good news is that Chrome is doing a very good job of trying to have early release. So although the official shipment of the Chrome WebGPU is the same time as WebILM, actually, you will be able to try out WebGPU technology in Chrome. There is an unstable version called Canary. I think as early as two years ago, there was a WebGPU version. Of course, it's getting better. So we had a TVM-based WebGPU backhand two years ago. Of course, at that time, there were no language models. It was running on less interesting, well, still quite interesting models. And then this year, we really started to see it getting matured and performance keeping up. So we have a more serious push of bringing the language model compatible runtime onto the WebGPU. [00:39:45]Swyx: I think you agree that the hardest part is the model download. Has there been conversations about a one-time model download and sharing between all the apps that might use this API? That is a great point. [00:39:58]Tianqi: I think it's already supported in some sense. When we download the model, WebILM will cache it onto a special Chrome cache. So if a different web app uses the same WebILM JavaScript package, you don't need to redownload the model again. So there is already something there. But of course, you have to download the model once at least to be able to use it. [00:40:19]Swyx: Okay. One more thing just in general before we're about to zoom out to OctoAI. Just the last question is, you're not the only project working on, I guess, local models. That's right. Alternative models. There's gpt4all, there's olama that just recently came out, and there's a bunch of these. What would be your advice to them on what's a valuable problem to work on? And what is just thin wrappers around ggml? Like, what are the interesting problems in this space, basically? [00:40:45]Tianqi: I think making API better is certainly something useful, right? In general, one thing that we do try to push very hard on is this idea of easier universal deployment. So we are also looking forward to actually have more integration with MOC. That's why we're trying to build API like WebILM and other things. So we're also looking forward to collaborate with all those ecosystems and working support to bring in models more universally and be able to also keep up the best performance when possible in a more push-button way. [00:41:15]Alessio: So as we mentioned in the beginning, you're also the co-founder of Octomel. Recently, Octomel released OctoAI, which is a compute service, basically focuses on optimizing model runtimes and acceleration and compilation. What has been the evolution there? So Octo started as kind of like a traditional MLOps tool, where people were building their own models and you help them on that side. And then it seems like now most of the market is shifting to starting from pre-trained generative models. Yeah, what has been that experience for you and what you've seen the market evolve? And how did you decide to release OctoAI? [00:41:52]Tianqi: One thing that we found out is that on one hand, it's really easy to go and get something up and running, right? So if you start to consider there's so many possible availabilities and scalability issues and even integration issues since becoming kind of interesting and complicated. So we really want to make sure to help people to get that part easy, right? And now a lot of things, if we look at the customers we talk to and the market, certainly generative AI is something that is very interesting. So that is something that we really hope to help elevate. And also building on top of technology we build to enable things like portability across hardwares. And you will be able to not worry about the specific details, right? Just focus on getting the model out. We'll try to work on infrastructure and other things that helps on the other end. [00:42:45]Alessio: And when it comes to getting optimization on the runtime, I see when we run an early adopters community and most enterprises issue is how to actually run these models. Do you see that as one of the big bottlenecks now? I think a few years ago it was like, well, we don't have a lot of machine learning talent. We cannot develop our own models. Versus now it's like, there's these great models you can use, but I don't know how to run them efficiently. [00:43:12]Tianqi: That depends on how you define by running, right? On one hand, it's easy to download your MLC, like you download it, you run on a laptop, but then there's also different decisions, right? What if you are trying to serve a larger user request? What if that request changes? What if the availability of hardware changes? Right now it's really hard to get the latest hardware on media, unfortunately, because everybody's trying to work on the things using the hardware that's out there. So I think when the definition of run changes, there are a lot more questions around things. And also in a lot of cases, it's not only about running models, it's also about being able to solve problems around them. How do you manage your model locations and how do you make sure that you get your model close to your execution environment more efficiently? So definitely a lot of engineering challenges out there. That we hope to elevate, yeah. And also, if you think about our future, definitely I feel like right now the technology, given the technology and the kind of hardware availability we have today, we will need to make use of all the possible hardware available out there. That will include a mechanism for cutting down costs, bringing something to the edge and cloud in a more natural way. So I feel like still this is a very early stage of where we are, but it's already good to see a lot of interesting progress. [00:44:35]Alessio: Yeah, that's awesome. I would love, I don't know how much we're going to go in depth into it, but what does it take to actually abstract all of this from the end user? You know, like they don't need to know what GPUs you run, what cloud you're running them on. You take all of that away. What was that like as an engineering challenge? [00:44:51]Tianqi: So I think that there are engineering challenges on. In fact, first of all, you will need to be able to support all the kind of hardware backhand you have, right? On one hand, if you look at the media library, you'll find very surprisingly, not too surprisingly, most of the latest libraries works well on the latest GPU. But there are other GPUs out there in the cloud as well. So certainly being able to have know-hows and being able to do model optimization is one thing, right? Also infrastructures on being able to scale things up, locate models. And in a lot of cases, we do find that on typical models, it also requires kind of vertical iterations. So it's not about, you know, build a silver bullet and that silver bullet is going to solve all the problems. It's more about, you know, we're building a product, we'll work with the users and we find out there are interesting opportunities in a certain point. And when our engineer will go and solve that, and it will automatically reflect it in a service. [00:45:45]Swyx: Awesome. [00:45:46]Alessio: We can jump into the lightning round until, I don't know, Sean, if you have more questions or TQ, if you have more stuff you wanted to talk about that we didn't get a chance to [00:45:54]Swyx: touch on. [00:45:54]Alessio: Yeah, we have talked a lot. [00:45:55]Swyx: So, yeah. We always would like to ask, you know, do you have a commentary on other parts of AI and ML that is interesting to you? [00:46:03]Tianqi: So right now, I think one thing that we are really pushing hard for is this question about how far can we bring open source, right? I'm kind of like a hacker and I really like to put things together. So I think it's unclear in the future of what the future of AI looks like. On one hand, it could be possible that, you know, you just have a few big players, you just try to talk to those bigger language models and that can do everything, right? On the other hand, one of the things that Wailing Academic is really excited and pushing for, that's one reason why I'm pushing for MLC, is that can we build something where you have different models? You have personal models that know the best movie you like, but you also have bigger models that maybe know more, and you get those models to interact with each other, right? And be able to have a wide ecosystem of AI agents that helps each person while still being able to do things like personalization. Some of them can run locally, some of them, of course, running on a cloud, and how do they interact with each other? So I think that is a very exciting time where the future is yet undecided, but I feel like there is something we can do to shape that future as well. [00:47:18]Swyx: One more thing, which is something I'm also pursuing, which is, and this kind of goes back into predictions, but also back in your history, do you have any idea, or are you looking out for anything post-transformers as far as architecture is concerned? [00:47:32]Tianqi: I think, you know, in a lot of these cases, you can find there are already promising models for long contexts, right? There are space-based models, where like, you know, a lot of some of our colleagues from Albert, who he worked on this HIPPO models, right? And then there is an open source version called RWKV. It's like a recurrent models that allows you to summarize things. Actually, we are bringing RWKV to MOC as well, so maybe you will be able to see one of the models. [00:48:00]Swyx: We actually recorded an episode with one of the RWKV core members. It's unclear because there's no academic backing. It's just open source people. Oh, I see. So you like the merging of recurrent networks and transformers? [00:48:13]Tianqi: I do love to see this model space continue growing, right? And I feel like in a lot of cases, it's just that attention mechanism is getting changed in some sense. So I feel like definitely there are still a lot of things to be explored here. And that is also one reason why we want to keep pushing machine learning compilation, because one of the things we are trying to push in was productivity. So that for machine learning engineering, so that as soon as some of the models came out, we will be able to, you know, empower them onto those environments that's out there. [00:48:43]Swyx: Yeah, it's a really good mission. Okay. Very excited to see that RWKV and state space model stuff. I'm hearing increasing chatter about that stuff. Okay. Lightning round, as always fun. I'll take the first one. Acceleration. What has already happened in AI that you thought would take much longer? [00:48:59]Tianqi: Emergence of more like a conversation chatbot ability is something that kind of surprised me before it came out. This is like one piece that I feel originally I thought would take much longer, but yeah, [00:49:11]Swyx: it happens. And it's funny because like the original, like Eliza chatbot was something that goes all the way back in time. Right. And then we just suddenly came back again. Yeah. [00:49:21]Tianqi: It's always too interesting to think about, but with a kind of a different technology [00:49:25]Swyx: in some sense. [00:49:25]Alessio: What about the most interesting unsolved question in AI? [00:49:31]Swyx: That's a hard one, right? [00:49:32]Tianqi: So I can tell you like what kind of I'm excited about. So, so I think that I have always been excited about this idea of continuous learning and lifelong learning in some sense. So how AI continues to evolve with the knowledges that have been there. It seems that we're getting much closer with all those recent technologies. So being able to develop systems, support, and be able to think about how AI continues to evolve is something that I'm really excited about. [00:50:01]Swyx: So specifically, just to double click on this, are you talking about continuous training? That's like a training. [00:50:06]Tianqi: I feel like, you know, training adaptation and it's all similar things, right? You want to think about entire life cycle, right? The life cycle of collecting data, training, fine tuning, and maybe have your local context that getting continuously curated and feed onto models. So I think all these things are interesting and relevant in here. [00:50:29]Swyx: Yeah. I think this is something that people are really asking, you know, right now we have moved a lot into the sort of pre-training phase and off the shelf, you know, the model downloads and stuff like that, which seems very counterintuitive compared to the continuous training paradigm that people want. So I guess the last question would be for takeaways. What's basically one message that you want every listener, every person to remember today? [00:50:54]Tianqi: I think it's getting more obvious now, but I think one of the things that I always want to mention in my talks is that, you know, when you're thinking about AI applications, originally people think about algorithms a lot more, right? Our algorithm models, they are still very important. But usually when you build AI applications, it takes, you know, both algorithm side, the system optimizations, and the data curations, right? So it takes a connection of so many facades to be able to bring together an AI system and be able to look at it from that holistic perspective is really useful when we start to build modern applications. I think it's going to continue going to be more important in the future. [00:51:35]Swyx: Yeah. Thank you for showing the way on this. And honestly, just making things possible that I thought would take a lot longer. So thanks for everything you've done. [00:51:46]Tianqi: Thank you for having me. [00:51:47]Swyx: Yeah. [00:51:47]Alessio: Thanks for coming on TQ. [00:51:49]Swyx: Have a good one. [00:51:49] This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe

Transcript
Discussion (0)
Starting point is 00:00:09 Hey, everyone. Welcome to the Latenspace podcast. This is Alessio, partner and CTO and resident and decibel partners. And I'm joined by my co-host as Swix, writer and editor of Laton Space. Hey, and we are here with TNC Chen, our TQ, as people call them, who is assistant professor in ML Computer Science at CMU, Carnegie Mellon University, also helping to run Catalyst Group, also chief technologists of AutoML. You wear many hats. Are those, you know, your, you're primary identities these days? Of course, of course. I'm also, you know, very enthusiastic open source, so I'm also a VPN person member of the Apache TVM project and so on. But yeah, these are the things I've been up to so far. Yeah. You also created Apache TVM, XG Boost, and MXNet. And we can
Starting point is 00:00:58 cover any of those in any amount of detail. But maybe what's one thing about you that people might not learn from your official bio or LinkedIn, you know, on the personal side? Let me say, yeah, so normally I really love coding, even though I'm trying to run all those things. So one thing that I keep a habit on is I try to do sketchbooks. I have a book, like real sketchbooks, to draw down the design diagrams. And that's sketchbooks. I keep sketching over the years, and now I have like three or four of them. And it's kind of usually a fun experience of sinking the design through
Starting point is 00:01:32 and also seeing how open source project evolves and also looking back at the sketches that we had in the past. to say, you know, all these ideas really turn into code nowadays. How many sketchbooks did you get through to build all this stuff? I mean, if one person alone built one of those projects, it would be a very accomplished engineer. Like, you built like three of these. What's that process like for you? Like, is the sketchbook like the start and then you think about the code or like?
Starting point is 00:01:59 Yeah. So usually I start sketching on high-level architectures, right? And also the project that works for over years, we also start to think about, you know, new direction. like of course, generally via language model comes in how it's going to evolve. So normally, I would say it takes like a one book a year, roughly at that rate. It's usually fun to, I find it's much easier to sketch things out and then gives a more like a high-level architectural guide for some of the future items. Have you ever published this sketchbooks?
Starting point is 00:02:30 Because I think people would be very interested at least on a historical basis. Like, this is the time where XG Bruce is born, you know? Yeah, not really. I started sketching like after XG books. So that's a kind of missing piece. But a lot of design details in TVM are actually part of the books that I try to keep a record of. Yeah, we'll try to publish them and publish something in the show notes. Maybe you can grab a little snapshot for visual aid.
Starting point is 00:02:56 Sounds good, yeah. And yeah, talking about X-Ruage Boost, so a lot of people in the audience might know it's a grading boosting library, probably the most popular out there. And it became super popular because many people started using them in like a machine learning. competitions. And I think there's like a whole Wikipedia page of like all state of the art models. They use XG Boots and like it's a really long list. When you were working on it, so we just had Triedao, who's the creator of a flash attention on the podcast. And I asked him this question. It's like when you were building flash attention, did you know that like
Starting point is 00:03:27 almost any transform race model will use it? And so I asked this thing question to you, when you were coming up with XG Boost, like could you predict it would be so popular or like what was the creation process and when you published there, what did you expect? We have no idea. Like, actually, the original reason that we built a library is that at that time, Deep learning just came out. Like, that was the time where Alex Nad just came out. And one of the ambition that myself and my advice that calls Kastrian,
Starting point is 00:03:57 then is we want to think about, you know, to test the hypothesis. Can we find alternatives to deep learning models? Because then, you know, there are other alternatives like, you know, supportive vector machines, linear models, and of course, tree-based models. And our question was, if you build those models and feed them with big enough data, because usually one of the key characteristics of deep learnings that are taking a lot of data, right? So we will be able to get the same amount of performance.
Starting point is 00:04:26 That's a hypothesis of course. If you look at now, right, that's a wrong hypothesis. But as a byproduct, what we find out is that, you know, most of the gradient boosting library all there is not efficient. enough for us to test that hypothesis. So I haven't to had quite a bit of experience of, you know, parts of building green boosting trees and their variants. So effective active action was kind of like a byproduct of that hypothesis testing. At that time, I'm also competing a bit in data science challenges.
Starting point is 00:04:56 Like I worked on KDD Cup and then Kagle can become bigger. So I kind of think maybe it's becoming useful to others. One of my friends convinced me to try to do a Python binding. that tends to be like a very good decision, right, to be effective. Usually, when we build it, we feel like, you know, maybe a command line interface is okay. And then we have a Python binding. We have R bindings. And then it realized, you know, it starts to getting interesting.
Starting point is 00:05:23 People start to contributing different perspectives, like visualization and so on. So we start to, you know, push a bit more onto, you know, building distributed support, to make sure that works on, animal platform and so on. And even at that time point, when I talked to Carlos, my advisor later, he said he never anticipated that we'll get to that level of success. And actually, why I pushed for gradient boosting trees, interesting, at that time, he also disagreed. He thinks that maybe we should go for kernel machines then.
Starting point is 00:05:51 And it turns out, you know, actually, we are both wrong in some sense and deep neural network was the king in the hill, but at least the gradient boosting direction getting to something fruitful. Interesting. I'm always curious when it comes to these improvements like what's the design process in terms of like coming up with it and how much of it is a collaborative with like other people that you're working with versus like trying to be you know obviously in academia is like very paper driven kind of research driven. I would say the actually boost improvement at that time point was more on like you know I'm trying to figure out right but it's combining lessons. Before that I did work on some of the other libraries on like a matrix factor at the issue.
Starting point is 00:06:35 That was like my first open source experience. Nobody didn't know about it. Because you'll find, likely if you go and try to search for the package of SVDVV sure, you'll find some like an SVN repo that's somewhere. It's actually being used for some of recommender system packages. So I'm trying to apply some of previous lessons there and trying to combine them. The later projects like MXNet and then TVM is much, much more collaborative in a sense that. But of course, XG boosts as it becomes.
Starting point is 00:07:05 bigger, right? So when we started that project, it's myself, and then we have, it's really amazing to see people coming in, like we have Michael, who was a lawyer, and now he works on the AI space as well, on contributing visualizations, and then we have people from our community contributing in different sense. So actually, it's even today, right, it's community of commuters driving the project, so it's definitely something collaborative and moving forward on getting some of the same continuous improved for our community. Let's talk a bit about TVM too
Starting point is 00:07:39 because we got a lot of things to run through in this episode. I would flag that at some point I'd love to talk about this comparison between sort of XG boosts or sort of tree-based type AI or machine learning compared to deep learning because I think there is a lot of interest around I guess merging the two disciplines, right?
Starting point is 00:07:58 And we can talk more about that. I don't know where to insert that, by the way. So we can come back to it later. Yeah, actually, when I said, you know, when we test the hypothesis, the hypothesis is wrong, it's kind of, I would say it's partially long. Because the hypothesis we want to test now is, you know, can you run tree-based models on image classification tasks? We're differently certainly the no-brainer right now today, right? But if you try to run it on tabular data, still, like, you'll find that most people opt for tree-based models. And that's the reason for that, in a sense, that, you know, when you are looking at tree-based models, the decision boundaries are naturally rules that you're looking at tree-based models.
Starting point is 00:08:32 that you're looking at, right? And they also have nice properties like, you know, being able to be agnostic to scale of the input and be able to automatic a compose feature together. And I know there are attempts. There are attempts on, you know, building neural network models that works for tablet data,
Starting point is 00:08:49 and I also sometimes follow them. I do feel like it's good to have a bit of diversity in the modeling space. Actually, when we're building TVM, we build cost models for the programs. Actually, we are using XG booths for that as well. I still think tree-based model is going to be quite relevant because, first, for all, it's really to get it work out of box.
Starting point is 00:09:11 And also, you will be able to get a bit of interoperability and control, like, in a monotonicity, and so on. So, yes, it's still going to be relevant. I also sometimes keep coming back to think about are there possible improvements that we can build on top of these models? And definitely I feel like it's a space that can have some of the potential in the future. Are there any current projects that you will call out as promising in terms of merging the two, I guess, directions? I think there are projects that tries to bring like a transformer type model for tabular data, right?
Starting point is 00:09:48 I don't remember specifics of them. But I think even nowadays, if you look at the people's, what people are using, like tree-based models, do one of their two? kit, right? So I think maybe eventually it's not even a replacement. It will be just an ensemble of models that you can call. Yeah. Perfect. Next step, about three years after XG boost, you built this thing called TVM, which is now a very popular compiler framework for models. Let's talk about, so this came out about at the same time as Onyx. So I think it would be great if you could maybe give a little bit of an overview of like how to do things work together, because it's kind of like the model, then goes to Onyx, then goes to the, the,
Starting point is 00:10:28 CBM, but I think a lot of people don't understand the nuances. I can get a bit of backstory on that. So actually, that's kind of an Asian history. Before AxiBoo's, I worked on deep learning for like two years or three years. I get a master before I start my PhD. And during my master, my Csars focused on, you know, applying convolutional, restrictive boats of my machine for ImageNet classification. That is the thing I'm working on.
Starting point is 00:10:56 And that was before Alex Nett moment. So effectively, I had to handcraft Nvidia Kuda kernels on, I think, the GTX 2070 card. I have a 2.27 card. It took me about six months to get one model working, and eventually that model is not so good. And we should have picked a better model. But that was like an Asian history that when I really got me into this deep learning field. And of course, eventually we find, you know, it didn't work out. So in my master, in that I ended up working on recommend assistant, which gets me a paper,
Starting point is 00:11:34 and I applied it and get a PhD. But I always want to come back to work on a deep learning field. So after actually boost, I think I started to work with some of folks on this particular MXNet. At that time, it was like the frameworks at Cafe, Ciano Torch, Pite Torch, haven't yet come out. and we're really working hard to optimize for performance on GPUs. At that time, I find, you know, it's really hard to, even for Nvidia GPU, right? It took me six months. And then it's amazing to say, like, you know, on different hardware,
Starting point is 00:12:09 how hard it is to go and optimize code for the platforms that are interesting. So that gets me thinking, like, you know, can we build something more generic and automatic? so that, you know, I don't need an entire team of so many people to go and build those frameworks. So that's the motivation of starting working on TVM. Their risk really to lower the bar of machine learning engineering needed to support depleting models on the platforms that were interested in. Yeah, so I think it started a bit earlier than on XB, but why it's got announced, I think, is in a similar time period, right, at that time. So yeah, so overall, how it works is that, you know, TVM, you will be able to take a subset of machine learning program that are represented in what we call a computational graph. Nowadays, we can also represent a loop level program.
Starting point is 00:13:03 We ingest from your machine learning models. Usually, you know, you have model formats onics, right? Or in Pythorch, they have like FX tracer that allows you to chase the FX graph. And then it goes through TVM. We also realize that, well, yes, it's need to be more customizable. so it will be able to perform some of the compilation optimizations, like fusion operator together, doing smart memory plannings, and more importantly, generate low-level code,
Starting point is 00:13:28 so that works for Nvidia and also is portable to other GPU backhands, even non-GPU backgrounds out there. So that's the project actually has been my primary focus over the past few years, and it's great to say how I started from, where I think we are the early, very early initiator of machine land compilation. I remember there was a visiting day. One of the students asked me, are you still working on Deep Plains? I tell them that, you know, I'm working on ML compilation.
Starting point is 00:13:57 And they said, okay, compilation, that sounds very ancient. You know, it sounds like a very old field. And why are you working on this? And now it's starting to getting more transactions. Like if you say Torch Compile and other things, I'm really glad to see this field starting picking up. And also we also continue innovating here. I think the other thing that I notice is, you know,
Starting point is 00:14:20 it's kind of like a big jump in terms of area of focus to go from like X-G-Bose to like TVM. You know, it's kind of like a different part of the stack. Why did you decide to do that? And I think the other thing about compiling to like different GPUs and eventually CPUs too, did you already see kind of like some of the strain that models could have, like just being focused on one
Starting point is 00:14:43 runtime, you normally being on Kudau on like that and yeah, how much of that went into it. I think it's less about trying to get impact more about, you know, wanted to have fun. Like I like to hack code. I had really fun hacking Kuda code. And of course, being able to generate Kudak code is cool, right? But, you know, now after being able to generate Krua code, okay, by the way, you can do it on other platforms. Isn't that amazing? So it's more of that attitude to get me started on this.
Starting point is 00:15:13 And also, I think when we look at different researchers, myself is more like, I think, a problem-solver type. So I like to look at a problem and say, you know, okay, what kind of tools we need to solve that problem? So regardless, you know, it could be building better models. When we build extra boost, we build certain regulations into it so that it's more robust. It also means building system optimizations, writing low-level code, maybe, you know, trying to write assembly and build compilers and so on.
Starting point is 00:15:42 So as long as they solve the problem, definitely go and try to do them together. And I also say it's a common trend right now. Like if you want to be able to solve machine learning problems, it's no longer at Agrisome later. You kind of need to solve it from both Aggerson data and systems angle. And this entire field of machine learning system, I think it's kind of emerging. And that's not a conference around it. And it's really good to see a lot more people are starting to look into this. Are you talking about ICML?
Starting point is 00:16:12 or something else? So machine learning and systems, right? So not only machine learning, but machine learning and system. So that's a conference called MLSys. It's a definitely smaller community than SMO, but I think it's also emerging and growing community where people are talking about what's implications are building systems for machine learning,
Starting point is 00:16:31 and how do you go and optimize things around that code design models and systems together. Yeah, and you were area chair for ICML and your reps as well. So you've just had a lot of conference and community organization experience. Is that also like an important part of your work? Well, it's kind of expected for academic, if I hold an academic job, right?
Starting point is 00:16:50 You need to do services for a community. Okay, great. Your most recent venture in MLSys is going to the phone with MLCL. You announced this in April. I have it on my phone. It's great. I'm running Lama 2, Vikunia. I don't know what other models you offer.
Starting point is 00:17:10 But maybe we can just kind of describe. your journey into MLC. And I don't know how this coincides with your work at CMU. Is that some kind of outgrowth? I think it's more like a focused effort that we want in the area of machine learning compilation. So it's kind of related to what we built in TVM. So when we built TVM was five years ago, right?
Starting point is 00:17:31 And a lot of things happen. We build end-to-end machine learning compiler that works, the first one that works. But then we capture a lot of lessons there. So then we are building a second. iteration called TVM Unity that allows us to be able to, you know, allow ML engineers to be able to quickly capture the new model and our demand and building optimizations for them. And the MOC is kind of like an MOC is more like a vertical-driven organization that we go and build tutorials and go and build projects like LIM to solutions so that to really show like,
Starting point is 00:18:06 okay, you can take machine link compilation technology and apply it and bring some. something fun forward. Yeah. So, yes, it runs all phones, which is really cool. But the goal here is not only making it around our phones, right? The goal is making it deploy universally. So we do run on Apple M2 Max, the $17 billion models. Actually, on a single-batch inference, more recently on Kuda,
Starting point is 00:18:30 we get, I think, the most best performance we can get out there already on the 4-bit inference. And actually, as I alluded earlier before the podcast, we just have. had a result on AMD. And on a single batch, actually, we can get the latest AMD GPU, this is the consumer card. It can get to about 80% of the 4019, so Vedia's best consumer card out there. So it's not yet, like, you know, on par, but thinking about how our diversity and what
Starting point is 00:19:03 you can enable. And the previous things you can get on that card is really amazing that what you can do with this kind of technology. So one thing I'm a little bit confused by is that most of these models are in Pi Torch, but you're running this inside of TVM. I don't know. Was there any fundamental change that you needed to do? Yeah. Was this basically the fundamental design of TVM? So the idea is that, of course, it comes back to program representation.
Starting point is 00:19:29 So effectively, TVM have this program representation called TVM script that contains more like computational graph and operational. representation. So yes, initially we do need to take a bit of effort of bringing those models onto the program representation that TVN support. Usually there are a mix of ways, depending on color model you're looking at. For example, for vision models and stable diffusion models, usually we can just do tracing that takes Pytosh model on the TBM. That part is still being robustified so that we can bring a model in. On language model tasks, actually what we do is we directly build some of the model constructors and try to directly map from HuggingFace models. The goal is, you know, if you have a HangingFase configuration, we will be able to
Starting point is 00:20:16 bring that in and apply optimization on them. So one fun thing about model compilation is that, you know, your optimization don't happen only at a soft language, right? If you write a Pytorch code, you just go and try to, you know, use better fuel operator at a source code level. Torch Compile might help you do a bit of things in there. In most of the models, compilations not only happens at the beginning stage, but we also apply generic transformations in between, also through a Python API, so you can tweak some of that. So that part of optimization helps a lot of uplifting in getting both performers and also portability on the environment. And another thing that we do have is what we call universal deployment. So if you get
Starting point is 00:21:00 the MO program into this TVM script format, where there are functions, that takes in tensor and output tensor, we will be able to have a way to compile so they will be able to load the function in any of the language runtime that TVM support. So if, for example, you could load it in JavaScript and that's a JavaScript function that you can take in tensors and output tensors
Starting point is 00:21:24 if you're loading Python, of course, and C++P and Java. So the goal there is really bring the ML model to the language that people care about and be able to run them on a platform. they like. It strikes me that, so I've talked to a lot of compiler people,
Starting point is 00:21:41 but you don't have a traditional compiler background. You're inventing your own discipline called machine learning compilation, MLC. Do you think that this will be a bigger field going forward? First of all, I do work with people working on compilation as well. So there are, you know,
Starting point is 00:22:00 we're also taking inspirations from a lot of early innovations in the field, like for example, TV, Usually, we take a lot of inspirations from Halite, which is just like image processing compiler. And of course, since then, we have evolved quite a bit so that to focus on the machine learning related compilations.
Starting point is 00:22:19 If you look at some of a conference publications, you'll find the machine learning compilation is already kind of a subfield. So if you look at papers in both machine learning venues, the MLSS conference, of course, and also system venues. Every year there will be papers around machine learning compilation. And in the compiler conference called CGO, there's a C4MO workshop that also kind of trying to focus on this area. So definitely it's already starting to gain interaction and becoming a field. I wouldn't claim that I invented this field, but definitely, you know, I helped to work with a lot of folks there.
Starting point is 00:22:54 And I try to bring a perspective. Of course, you know, trying to learn a lot from the compiler optimizations as well as trying to bring in knowledge in, you know, machine learning. and assistance together. So we had George Hots on the podcast a few episodes ago, and he had a lot to say about AMD and their software. So when you think about TVM, are you still restricted in a way by the performance of the underlying kernel, so to speak?
Starting point is 00:23:23 So like if your target is like a Kuta runtime, you still get better performance, no matter. Like TVM kind of helps you get there, but then that level you don't take care of, right? There are two parts in here, right? So first for there is the lower level round time, like Kuda round time. And then actually for Nvidia, a lot of the mood came from their libraries, like Kutla, CUDN, right, those library optimizations.
Starting point is 00:23:49 And also for specialized workloads, actually, you can specialize them. Because a lot of cases, you'll find that if you're going to do benchmarks, it's very interesting. Like two years ago, if you tried a benchmark Rastnet, for example, usually the Nvidia a library, it will give you the best performance. It's really hard to beat them. But as soon as you start to change the model to something, you know, maybe a bit of a variation of RASNET, not for the traditional image net detections,
Starting point is 00:24:18 but for lane detection and so on, there will be some room for optimization because people sometimes overfit two benchmarks. These are people go and optimize things, right? So the people overfit benchmarks. So that's the largest barrier, like being able to get a low-level kernel libraries, In that sense, the goal of TVM is actually we try to have a generic layer to both, of course, leverage libraries when available,
Starting point is 00:24:42 but also being able to automatically generate libraries when possible. So in that sense, we are not restricted by the libraries that they have to offer. That's why, if I'm able to run Apple M2, a web GPU where there's no library available, because we are kind of like automatic generating libraries. That makes easier to support less well-supported hardware. Like Fempovon GPU is one example from a runtime perspective. AMD, I think before their Locum driver was not very well supported. Recently, they are getting good.
Starting point is 00:25:17 But even before that, we would be able to support AMD through this GPU graphics back and called Vulcan, which is not as performance, but gives you that central portability across those hours. And I know we got other MLC stuff to talk about like WebLLM, but I want to wrap up on the optimization that you're doing. So there's kind of four core things, right, Colonel Fusion, which we talked a bit about in the flesh attention episode and the tiny grab one. Memory planning and loop optimization, I think those are like pretty, you know, self-explanatory.
Starting point is 00:25:49 I think the one that people have the most questions about is like, can you quickly explain those? Yeah, go for it. So there are kind of a different thing. kernel fusion means that if you have an operator like convolutions or in the case of transformer like an MOP, you have other operators follow that. You don't want to launch two GPU kernels. You want to be able to put them together in a smart way, right? And the memory planning is more about, you know, hey, if you run like Python code, right, every time when you generate a new array, you are effectively allocated a new piece of memory, right? Of course, Python and other framework try to optimize for you.
Starting point is 00:26:28 So there is a smart memory allocator behind the same. But actually, in a lot of cases, it's much better to statically allocate and plan everything ahead of time. And that's where, like, a compiler can come in. We need to, first of all, actually, for language model, it's much harder because dynamic shape. So you need to be able to what we call symbolic shape chasing. So we have like a symbolic variable that tells you,
Starting point is 00:26:49 like, the shape of the first tensor is N by 12. And the shape of the third tensor is also N by 12. Or maybe it's n times 2 by. 12. Although you don't know what N is, right, but you will be able to know that relation and be able to use that to a reason about like fusion and the decisions. So besides this, I think loop transformation is quite important. And it's actually non-traditional. Like originally, if you simply write a code and you want to get a performance, it's very hard. If you know, if you write matrix smart plan, right, simplest thing you can do is you do for IJK,
Starting point is 00:27:23 C-I-J plus equal, you know, A-I-K times B-I-K. But that, code is 100 times slower than best available code that you can get. So we do a lot of transformation, like being able to take that original code, trying to put things through shared memory, and making use of Tesla calls, making use of memory copies. And all these, actually,
Starting point is 00:27:42 all these things, we also realize that we cannot do all of them. So we also make the ML compilation framework as a Python package, so that people will be able to continuously improve that part of engineering in a more
Starting point is 00:27:58 transparent way. So we find that's very useful actually for us to be able to get good performance very quickly on some of the new models. Like when Lama took out, we'll be able to go and look at the whole here's the bottom bank and we can go and optimize those. And then the board one being a weight quantization. So everybody wants to know about that. And just to give people like an idea of like the memory saving, you know, if you're doing like AFP 32 is like four bytes per parameter, and A, it is like one byte per parameter. So you can like really shrink down the memory footprint. What are some of the trade-offs there? How do you figure out
Starting point is 00:28:32 what the right target is? And what are the precision trade-offs too? Right now, a lot of people, like, also, we also mostly use int four now for language models. So that really shrinks things down a lot, right? And more recently,
Starting point is 00:28:48 actually, we started to think that at least in MLC, we don't want to have a strong opinion on what kind of quotation we want to bring because there are so many research in the field. So what we can do we can allow developers to customize the correlation they want, but we still bring the optimum code for them. So we're working on this item called bring your own quantization,
Starting point is 00:29:08 effectively to be able to hopefully, you know, MOC will be able to support more coordination format, and definitely I think there's an open field that's being explored. Can you bring more sparsities, can you quantify the activations as much as possible and so on, and it's going to be something that's going to be relevant for quite a while. You mentioned something I wanted to double back on, which is most people use int4 through language models. This is actually not obvious to me. Are you talking about the GML type people or even the researchers who are training the models also using it for? Sorry. So I'm mainly talking about inference, not training. So when I'm doing like a training, of course,
Starting point is 00:29:47 int four is harder. Maybe you could do some form of mixed type precision for for inference. I think int four is is kind of like a lot of cases you will be able to get away within four in a lot of cases. And actually, that does bring a lot of savings in terms of like the memory overhead and so on. Yeah, that's great. Let's talk a bit about maybe the, yeah, GGML, then there's like Mojo. How should people think about MLC? Like, how do all these things play together? You know, I think GGML is focused on kind of like model level re-implementation and like improvements,
Starting point is 00:30:24 mojo is like a language, super sad. You're more at the compiler level. Do you all work together? Do people choose between them? So I think in this case, right, I think it's great to say the ecosystem becomes so rich with so many different ways. So in our case, I would say GGMO is more like,
Starting point is 00:30:42 you're implementing something from scratch in C, right? So that gives you ability to go and customize each of a particular hardware backend. But then you will need to write from a kuda kernels and you write the optimal information from AMD and so on. So the kind of engineering effort is a bit more broadened in that sense. Module, I've not looked at specific details yet. I think it's good to start to say, you know, it's a language, right?
Starting point is 00:31:09 I believe there will also be machine and compilation technologies behind it. So it's good to say, you know, interesting place in there. In the case of MLC, our case is that we do not want to have an opinion on how where, which language people want to develop, deploy things on, and so on. And we also realize that actually, there are two phases. We want to be able to develop and optimize your model.
Starting point is 00:31:32 By optimization, I mean, you know, really bring best Kudak current into some of a machine learning engineering in there. And then that's a phase where you want to deploy it as a part of app. So if you look at the space, you'll find that GMO is more like, you know, I'm going to develop and optimize in the C language, right, and the most of a low-level language that you have. And module is that you want to develop and optimize in module, right? And you deploy in module.
Starting point is 00:31:55 In fact, that's the philosophy they want to push for. In the case, we find that actually, if you want to develop models, machinery community likes Python. Python is a language that you should focus on. So in the case of MLC, we really want to be able to enable, not only be able to, you know, just define your model in Python. That's very common, right? But also do ML optimization, like engineering optimized,
Starting point is 00:32:18 Kuda, kernel optimization, memory planning, all those things in Python that makes you customizable and so on. But when you do deployment, we realize that people want a bit of universal flavor. If you are web developer, you want JavaScript, right? If you're maybe embedded system person, maybe you would like prefer C++ or C or Rust. And people sometimes do like Python in a lot of cases. So in the case of MOC, we really want to have this vision of, you know, you optimize, build a generic optimizations in Python,
Starting point is 00:32:48 And then you deploy that universally onto the environments that people like. That's a great perspective and comparison, I guess. One thing I wanted to make sure that we cover is that I think you are one of these emerging set of academics that also very much focused on your artifacts of delivery. Of course. Something we talked about to three years that he was very focused on his GitHub. And obviously, you treated XG boost like a product, you know. And then now you're publishing an iPhone app.
Starting point is 00:33:19 Okay, yeah, yeah. What is this thinking about academics getting involved in shipping products? I think there are different ways of making impact, right? Definitely, you know, there are academics that are writing papers and building insights for people so that people can build product on top of them. In my case, I think the particular field I'm working on machine learning systems, I feel like really we need to be able to get it to the hand of people so that really we see the problem, right? And we show that we can solve a problem. And it's a different way, it's a different
Starting point is 00:33:51 way of making impact. And there are academics that are doing similar things. Like, you know, if you look at some of the people from Berkeley, right, a few years, they will come up with big open source projects. Certainly, you know, I think it's just a healthy ecosystem to have different ways of making impacts. And I feel like really be able to do open source and work with open source community is really rewarding because we have a real problem to work on when we build our research, actually those research bring together and people will be able to make use of them. And we also start to see interesting research challenges that we wouldn't otherwise say, right, if we're just trying to do a prototype and so on. So I feel like it's something that
Starting point is 00:34:37 is one interesting way of making impact, make contributions. Yeah, you definitely have a lot of impact there. And having experience publishing Mac stuff before. The Apple App Store is no joke. It is the hardest compilation, human compilation effort. So one thing that we definitely wanted to cover is running in the browser. You have a 70 billion parameter model running in the browser. Can you just talk about how? Yeah, of course. I think there are a few elements that need coming. First of all, we do need a MacBook, the latest one, like M2MX, because you need the memory to be big enough to. cover that. So for a 17-million model, it takes you about, I think, 50 gigas of RAM. So the M2
Starting point is 00:35:22 max, the upper version, will be able to run it. And it also leverages machine learning competition. Again, what we are doing is the same. Whether it's running on iPhone, on server-class GPUs, on AMDs, or on MacBook, we all go through that same MOC pipeline. Of course, in certain cases, maybe we'll do a bit of customization iteration for either ones. And then it runs on the browser runtime that's this package of WebLM so that will effectively. So what we do is we will take that original model and compare it to what we call WebGPU. And then the WebI.M will be to pick it up. And the WebGPU is this latest GPU technology that major browsers are shipping right now. So you can get it in Chrome for them already.
Starting point is 00:36:07 That allows you to be able to access your native GPUs from a browser. And then effectively that language model is just invoking the WebGPU kernels through there. So we actually, when the Lama 2 came out, initially we asked a question about, you know, can you run $17 billion on a MacBook? That's what's the question we're asking. So first, we actually,
Starting point is 00:36:27 Jin Lu, who is the engineer, pushing this, he gets $17 billion on MacBook. We had a CLI version. So in the MOC, you will be able to run through metal accelerator. So effectively, you use the metal programming language to get
Starting point is 00:36:42 the GPU acceleration. So we find, okay, it works for the MacBook. Then we asked, you know, we had a web GPU back and why not try it there? So we just tried it out. And it's really amazing to say like, you know, everything up and running and actually it runs smoothly in that
Starting point is 00:36:58 case. So I do think there are some kind of interesting use cases already in this, right, because everybody have a browser. You don't need to install anything. I think it doesn't make sense yet to really run a 17 billion model on a browser because you kind of need to be able to download the weight and so on.
Starting point is 00:37:14 But I think we're getting there. Like, effectively, the most powerful models, you will be able to run on a consumer device. It's kind of really amazing. And also, in a lot of cases, there might be use cases, for example, if I'm going to build a chatboard that I talk to it and answers questions, maybe some of the component,
Starting point is 00:37:33 like the voice to text could run on the client side. And so there are a lot of possibilities of being able to have something hybrid. like that contains the edge component and something that runs on our server. Do these browser models have done a way for applications to hook into them? So if I'm using, say, you know,
Starting point is 00:37:52 you can use open AI or like you can use the local model. Of course. So right now, actually, we are building so there's a MPM package called WebIOLM, right? So that, you know, you will be able to, if only embedding onto your web app, you will be able to directly depend on Webim and it will be able to use it.
Starting point is 00:38:10 We are also having a REST API that's open-air compatible. So that REST API, I think right now, it's actually running on native backend. So if a Kudar server is faster to run on native backend, but also we have a web GPU version of it that you can go and run. So yeah, so we do want to be able to have easier integrations with existing applications, and open-A API is certainly one way to do that. Yeah, this is great. I actually did not know there's an NPM package that makes.
Starting point is 00:38:40 it very, very easy to try out and use. I want to actually, one thing I'm unclear about is the chronology, because as far as I know, Chrome shipped WebGPU the same time that you shipped WebLM. Okay, yeah. So did you have some kind of secret chat with Chrome? The good news is that Chrome is doing a very good job, I'm trying to have early release. So although the official shipment of the Chrome WebGPU is the same time of WebM, right,
Starting point is 00:39:08 actually you will be able to try web GPU technology in Chrome. They are unstable version called Canary. I think as early as two years ago, that's a WebGPO version. Of course, it's getting better. So we had TVM-based WebGPU backhand two years ago. Of course, at that time, there's no language models.
Starting point is 00:39:28 It's running on less interesting, well, still quite interesting models. And then this year, we really started to say it's getting matured and performance keeping up. So we have a more. serious push or bringing the language model compatible runtime onto the web GPU. I think you agree that the hardest part, or is this the model download, has there been conversations about a one-time model download and sharing between all the apps that might use
Starting point is 00:39:55 this API? That is a great point. I think it's already supported kind of already in some times that, you know, when we download the model, WebLM will cache it onto a special Chrome cache. So if a different web app Use the same web app JavaScript package You don't need to read download the model again So there's already something there Right but of course you have to download model once
Starting point is 00:40:17 At least to be able to use it Yeah Okay one more thing just in general before We're about to zoom out to OctoAI The last question is You're not the only project Working on I guess local models That's right
Starting point is 00:40:28 Alternative models There's GPT for All There's Olamma that this recently came out And there's a bunch of these What would be your advice to them on what's a valuable problem to work on and what is just thin wrappers around GGML. Like, what is
Starting point is 00:40:42 what are the interesting problems in this space basically? I think making API better is certainly something useful, right? In general, one thing that we do try to push very hard on is this idea of easier universal deployment. So we are also looking forward to actually have more integration with MOC.
Starting point is 00:40:58 That's why we're trying to build API like WebRAM and other things. So we're also looking forward to collaborate with all those ecosystems and working support. to bring in models more universally and be able to also, you know, keep up the best performance when possible here, more push-button way. So as we mentioned in the beginning, you were also the co-founder of Octomel, recently Octomel released Octo AI, which is a compute service.
Starting point is 00:41:23 Basically focuses on optimizing model runtimes and acceleration and compilation. What has been the evolution there? So Octo started as kind of like a traditional MLOps tool, where like people were building their own models and, like you help them on that side. And then it seems like now most of the market is shifting to starting from like pre-trained generative models. Yeah, what has been that experience for you and like what you've seen the market evolve
Starting point is 00:41:49 and how did you decide to release OctoAI? One thing that we find out is that on one hand is it's really easy to go and get something and running. So if you start to consider there's so many possible availability and scalability issues and even integration issues since becoming kind of interesting and complicated. So we really want to make sure to help people to get that part easy. And now a lot of things, if we look at the customers we talk to and the market, certainly generally we are something that is very interesting.
Starting point is 00:42:21 So that is something that we really hope to help elevate. And also building on top of technology, we build to enable, since, you know, like a portability across towers, then you will be able to not worry about this specific details, right? Just focused on getting the model out. We'll try to work on infrastructure and other things that helps on the other end. And when it comes to getting optimization on the runtime, I see when I, we run like an early adopters community and like most enterprises
Starting point is 00:42:54 issue is like how to actually run these models, you know, do you see that as like one of the big bottle next now? I think like a few years ago it was like, well, we don't have a lot of like machine learning talent. We're going to develop our own models, you know, versus now it's like there's these great models you can use. But like I don't know how to run them efficiently. That depends on how you define by round it. It's one hand, it's easy to download even mostly like you download it. You run on a laptop.
Starting point is 00:43:22 But then there's also different decisions. What if you are trying to serve or larger user requirements? What if that request changes? What if the availability of hardware changes? Like right now it's really hard to get the latest hardware on media, unfortunately, because everybody's trying to working on this thing, using the hardware that's out there. So I think they are kind of like when the definition of run changes, there are a lot more questions around things.
Starting point is 00:43:50 And also in a lot of cases, it's not only about running monos. It's also about being to solve a problem around them. Like how do you manage your model locations? And how do you make sure that you get your model close to your exclusion environment more efficiently? So definitely a lot of engineering challenges out there that we hope to elevate. And also, you know, if you think about our future, definitely I feel like right now the technology, given the technology and kind of hardware availability we have today, we will need to make use of all the possible hardware available out there.
Starting point is 00:44:20 You know, that would include mechanisms to cutting non-cost, bringing something to an adjunct, cloud in a more natural way. So I feel like this is a very early stage of where we are, but it's already good to say a lot of interesting progress. Yeah, that's awesome. I don't know how much you can go in depth into it, but what does it take to actually abstract all of this from the end user? You know, like they don't need to know what GPUs you run,
Starting point is 00:44:46 what cloud you're running them on, you take all of that away. What was that like as an engineering challenge? So I think they are engineering challenges on, In fact, first of all, you need to be able to support all the kind of hardware backhand you have. On one hand, if you look at the media library, you'll find very surprisingly, not too surprisingly, most of the latest libraries works well for on the latest GPU. But there are other GPUs out there in the cloud as well.
Starting point is 00:45:10 So certainly being able to have no-house and being able to do model optimization, is one thing, right. Also, infrastructures on being able to scale and sub, locate models. And in a lot of cases, we do find that on typical models, it also requires kind of vertical iterations. So it's not about, you know, build a civil bullet, and that civil bullet is going to solve all the problems. It's more about, you know, we're building a product. We'll work with the users, and we find out there are interesting opportunities in a certain point, and then when our engineer will go and solve that. And it will automatically reflect it in the service.
Starting point is 00:45:45 Awesome. We can jump into the lightning ground until I don't know, Sean, if you have more questions. or TQ, if you have more stuff you wanted to talk about that we didn't get a chance to touch on. Yeah, we have talked a lot. So, yeah. We always would like to ask, you know, do you have a commentary on other parts of AI and ML that is interesting to you? So right now, I think one thing that we are really pushing hard for is this question about how far can we bring open source, right?
Starting point is 00:46:13 I'm kind of like a hacker and I really like to put things together. So I think this unclear in the future of what the future of AI looks like. On one hand, it could be possible that you just have a few big players. You just talk to those bigger language models and that can do everything. On the other hand, one of the things that, well, in academic, I really excited pushing for this. One reason why I'm pushing for MLC is that can we build something where you have different models, you have personal models that knows the best movie you like,
Starting point is 00:46:44 but you also have bigger models that maybe no more, and you get those models intact with each other, and be able to have a wide ecosystem, or AI agents that helps each person's while still be able to do things like personalizations. Some of them can run locally, some of them, of course, running on the cloud, and how do they interact with each other? So I think that is, we are in a very exciting time
Starting point is 00:47:11 where the future is yet undecided, but I feel like there's something we can do to shape that future. well. One more thing, which is something I'm also pursuing, which is, and this kind of goes back into predictions, but also back in your history, do you have any idea, or are you looking out for anything post-transformers as far as architecture is concerned?
Starting point is 00:47:33 I think, you know, in a lot of these cases, you can find there are already problems in models for long contexts, right? There are space-based models where, like, you know, a lot of some of my colleagues from Albert, who he worked on this hippo, models, right? And there is open-source version called RWKV. It's like recurrent models that allows you to summarize things. Actually, we are bringing RWKV to MOC as well. So maybe you'll be able to see one of the models. We actually recorded an episode with one of the RWKV core members.
Starting point is 00:48:05 It's unclear because there's no academic backing. It's just open-source people. Oh, I see. So you like the merging of recurrent networks and transformers? I do love to see this model space continue growing, right? And I feel like in a lot of cases, it's just that attention mechanism is getting changed in some sense. So I feel like definitely there are still a lot of things to be exploring here. And that is also one reason why we want to keep pushing machine in compilation, because one of the things we are trying to push in was productivity. So that for machinery engineering, so that as soon as some of the model came out,
Starting point is 00:48:39 we will be able to empower them onto those environments that's out there. Yeah. Yeah, it's a really good mission. Okay, very excited to see that our, Olivia KV, and Stakeswitz model stuff. I'm hearing increasing chatter about that stuff. Okay, lightning rounds as always fun. I'll take the first one.
Starting point is 00:48:55 Acceleration. What has already happened in AI that you thought would take much longer? Emergence of more like a conversation chatbot ability is something that kind of surprised me before it came out. This is like one piece that I feel. Original eyes thought would take much longer, but yeah, it happens. It's funny because the original like Eliza chatbot was something that goes all the way back in time, right? And then we just suddenly came back again. Yeah, it's always too interesting
Starting point is 00:49:22 came back, but with kind of a different technology in some sense. What about the most interesting unsolved question in AI? That's a hard one, right? So I can tell you what kind of I'm excited about. So I think that I have always been excited about this idea of continuous learning and lifelong learning in some sense. So how an AI continues to evolve with the knowledge of stepping there. It seems that we're getting much closer with all those latent recent technologies. So being able to be able to develop systems, support, and be able to think about how AI continuously evolve is something that I'm really excited about.
Starting point is 00:50:01 So specifically, just to double-click on this, are you talking about continuous training? There's like a training. I feel like training adaptation and it's all similar things. You want to think about an entire life cycle, right? the life cycle of collecting data, training, fine-tuning, and maybe have your local context that's getting continuously curated and feed onto models. So I think all these things are interesting and relevant here.
Starting point is 00:50:29 Yeah, I think this is something that people are really asking. Right now, we have moved a lot into the sort of pre-training phase and off-the-shelf model downloads and stuff like that, which seems very counterintuitive compared to the continuous training paradigm that people want. So I guess the last question would be for takeaways. What's basically one message that you want every listener, every person to remember today?
Starting point is 00:50:54 I think it's getting more obvious now, but I think one of the things I always want to mention in my talks is that when you think about AI applications, originally people think about algorithms, a lot more. Our algorithm models, they are still very important. But usually when you build AI applications takes you know, both algorithm side, the system optimizations, and the data curations, right?
Starting point is 00:51:20 So it takes like a connection of so many facades to be able to bring together an AI system and be able to looking at from that hardest perspective is really useful. Always start to build modern AI applications. I think it's going to continue going to be more important in the future. Thank you for showing the way on this. And honestly, just making things possible that I thought would take a lot longer. So thanks for everything you've done. Thank you for having me.
Starting point is 00:51:47 Yeah. Thanks for coming on TQ. Have a good one.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.