Limitless Podcast - Kimi K2 is the Open Source Claude-Killer | US vs China AI

Starting point is 00:00:03 A bunch of AI researchers from China just released a brand new AI model called Kimi K2, which is not only as good as any other top model like Claude, but it is also 100% open source, which means it's free to take, customize and create into your own brand new AI model. This thing is amazing at coding. It beats any other model like creative writing and it also has a pretty insane voice mode. Oh, and I should probably mention that it is one trillion parameters in size, which makes it one of the biggest and largest models to ever be created.

Starting point is 00:00:36 Josh, we were winding down on a Friday night and this news broke. They dropped the bomb. Absolutely crazy bomb, especially with like Open AI rumored to release their open source model this week. You've been jumping into this. What's your take? Yeah, so last week we crowned Grock 4 as the new leading private model, close source model. This week we got to give the crown to Kimmy K2. We got another crown going for the open source team.

Starting point is 00:01:03 They are winning. I mean, this is better than Deep Seek and Deep Seek R2. This is basically Deep Seek R3, I would imagine. And if you remember back a couple months, Deepseek really flipped the world on his head because of how efficient it was and the algorithmic upgrades it made. And I think what we see with Kimmy K2 is a lot of the same thing. It's these novel breakthroughs that come as a downstream effect of their needing to be resourceful.

Starting point is 00:01:26 So China, they don't have the mega GPU clusters we have. They don't have all the cutting edge hardware. but they do have the software prowess to find these efficiencies. I think that's what makes this model so special. And that's what we're going to get into here is specifically what they did to make this model so special. I mean, look at these stats here, Josh, like one trillion parameters in total. It's 32 billion active mixture of expert model. So what this means is, although it's really large in size,

Starting point is 00:01:52 typically these AI models can become pretty inefficient if it's large in size. It uses this technique called mixture of experts, which means that whenever someone queries a model, it only uses or activates a number of parameters that are relevant for the query itself. So it's more smarter, it's much more efficient, and it doesn't use or consume as much energy as you would if you wanted to run it locally at home

Starting point is 00:02:15 or whatever that might be. It's also super cheap. I think I saw somewhere that this was 20% the cost of Claude, Josh, which is just insane. For all the nerds that kind of want to run, you know, really long tasks or just set and forget the AI to run on like your coding log or whatever that might mean, you can now do it at a much more affordable rate at one-fifth the cost than some of the top models that are out there. And it is as good as those models. So just insane kinds of things.

Starting point is 00:02:44 Josh, I know there's a bunch of things that you wanted to point out here on benchmarks. What do you want to get into? Yeah, it's really amazing. So they took 15.5 trillion tokens. And they condensed those out into a one trillion parameter model. And then what's amazing is when you use this model, like you said, it uses a thing called mixture of experts. So it has, I believe, 384 experts. And each expert is good at a specific thing. So let's say in the case you want to do a math problem, it will take a 32 billion parameter subset of the 1 trillion total parameters, and it will choose eight of these different experts in a specific thing. So in the case of math, it'll find an expert that has the calculator tool. It'll find an expert that has a fact,

Starting point is 00:03:25 a fact-checking tool or a proof tool to make sure that the math is accurate, it'll have just a series of tools to help itself, and that's kind of how it works so efficiently, is instead of using a trillion parameters at once, it uses just $32 billion, and it uses the eight best specialists out of the 384 that it has available to it. It's really impressive, and what we see here is the benchmarks that we're showing on screen, and the benchmarks are really good. It's up there in line with just about any other top model, except with the exception that this is open source. And there was another breakthrough that we had, which was the actual way that they handled the training of this. And yeah, this is the

Starting point is 00:04:00 loss curve. So what you're looking at on screen for the people who are listening, it's this really pretty smooth curve that kind of starts at the top and it trends down in a very predictable and smooth way. And most curves don't look like this. And if they do look like this, it's because the company has spent tons and tons of money on error correction to make sure this curve is so smooth. So basically what you're seeing is the training run of the model. And a lot of times what happens is you get these very sharp spikes and it starts to defer away from the normal training run. And it takes a lot of compute to kind of recalibrate and push that back into the right way. What they've managed to do is really make it very smooth.

Starting point is 00:04:37 And they've done this by increasing these efficiency. So if you can think about it, there's this analogy I was thinking of right before we hit the record button. And it's if you were teaching a chef how to cook, right? So we have chef Ejazz here. I am teaching him how to cook. I am an expert chef. and instead of telling him every ingredient and every step for every single dish, what I tell him is like, hey, if you're making this amazing dinner recipe,

Starting point is 00:04:59 all you need that matters is this amount of salt applied at this time, this amount of heat applied for this length of time, and the other stuff doesn't matter as much. So just put in whatever you think is appropriate, but you'll get the same answer. And that's what we see with this model is just an increased amount of efficiency by being direct, by being intentional about the data that they used to train it on, the data that they used to fetch in order to give you high-quality queries. And it's a really novel

Starting point is 00:05:25 breakthrough. They call it the Muan clip optimizer, which, I mean, it's a Chinese company. Maybe it means something special there. But it is a new type of optimizer. And what you're seeing in this curve is that it's working really well and it's working really efficient. And that's part of the benefit of having this open source is now we have this novel breakthrough. And we could take this and we could use this for even more breakthroughs, even more open source models. And that's part, that's been really cool to see. I mean, this is just time and again from China, so, so amazing from their research team.

Starting point is 00:05:56 So, like, just to kind of like pick up your comment on Deepseek, at the end of last year, we were utterly convinced that the only way to create a breakthrough model was to spend billions of dollars on compute clusters. And so therefore it was a pay-to-play game. And then Deepseek, a team out of China released their market, model and completely open source did as well. And it was as good as Open AI's frontier model, which was the top model at the time. And the revelation there was, oh, you don't actually just need

Starting point is 00:06:28 to chuck a bunch of compute at this. There are different techniques and different methods. If you get creative about how you design your model and how you run the training cluster, the training one, which is basically what you need to do to make your model smart, you can run it in different ways that is more efficient, consumes less energy. and therefore less amount of money, but is as smart, if not smarter than the frontier models that American AI companies are making. And this is just a repeat of that, Josh.

Starting point is 00:06:57 I mean, look at this curve. For those who are looking at this episode on video, it is just so clean. Yeah, it's beautiful. The craziest part about this is when DeepSeek was released, they pioneered something called reasoning or reinforcement learning, which are two separate techniques that made the model super-success.

Starting point is 00:07:17 smart with less energy and less compute spend. With this model, they didn't even implement that technique at all. So theoretically, this model can get so much more smarter than it already is. And they just kind of leverage a new method to make it as smart as it already is right now. So just such a fascinating kind of like progress in research from China and it just keeps on coming out. It's so impressive. Yeah, this is this was the exciting part to me is that we're seeing so many algorithm or exponential improvements in so many different categories. So this was considered a breakthrough by all means.

Starting point is 00:07:52 And this wasn't even the same type of breakthrough that DeepSeek had. So we get this now compounding effect where we have this new training breakthrough. And then we have Deepseek who has the reinforcement learning. And that hasn't even yet been applied to this new model. So we get the exponential growth on one end, the exponential growth on the reasoning end. Those come together. And then you get the exponential growth on the hardware stack where the GPs are getting much faster. And there's all of these different subsets of AI that are compounding on each other and

Starting point is 00:08:20 growing and accelerating quicker and quicker. And what you get is this unbelievable rate of progress. And that's what we're seeing. So reasoning isn't even here yet. And we're going to see it soon because it is open source so people can apply their own reasoning on top of it. I'm sure the Moonshot team is going to be doing their own reasoning version of this model. And I'm sure we're going to be getting even more impressive results soon. I see you have a post up here about the testing and overall performance, can you please share? Yeah, yeah. So this is a tweet that summarizes really well

Starting point is 00:08:49 how this model performs in relation to other frontier models. And the popular comparison that's taken for Kimmy K2 is against Claude. So Claude has a bunch of models out. Claude 3.5 is its earlier model, and then Claude 4 is its latest. And the general take is that this model is just better than those models,

Starting point is 00:09:10 which is just insane to say because for so long, long, Josh, we've said that Claude was the best coding model. And indeed it was. And then within the span of, what is it, five days, GROC4 released and it just completely blew Claude 4 out of the water in terms of coding. Now Kimmy K2, an open source model out of China who doesn't even have access to the research and kind of proprietary knowledge that a lot of American AI companies have also beat it as well. Right. So it kind of beats Claude at its own game. But it's also cheaper. It's 20% the cost of Claude 3.5, which is just an insane thing to say, which means that if you are a developer out there that wants to try your hand at kind of like vibe coding a bunch of things, or actually

Starting point is 00:09:53 seriously coding something, you know, that's quite novel, but you don't have the hands on deck to do that, you can now spin up a Kimi K2 AI agent, actually multiple of them for a very cost-efficient, reasonable, you know, salary. You don't have to pay like hundreds of thousands of dollars or, you know, hundreds of millions of dollars, which is what MET is doing to kind of like buy a bunch of these software engineers, you can spend, you know, the equivalent of maybe a Netflix subscription or 500 to a thousand bucks a month and spin up your own app. So super, super cool. And also one-handed perk that's there is it's that even if you have a lot of GPU sitting around, you can actually run this model for free. So that's the cost if you actually query it from the servers, but I'm

Starting point is 00:10:34 sure there's going to be companies that have access to XSGPUs. They can actually just download the model because it's open source, open weights, and they can run it on their own, and that brings the cost of compute down to the cost per kilowatt of the energy required to run the GPUs. So because it's open source, you really start to see these costs decline, but the quality doesn't. And that's, every time we see this, we see a huge productivity on lock in encoding output, and amount of queries used. It's like, this is freaking awesome. Yeah. Josh, I saw something else come up as well. So do you remember when Claude first released that frontier model? I think it was 3.5, or maybe it was 4, one of their brand. One of their

Starting point is 00:11:10 bragging rights was it had a one million token context window. Oh yes, which was huge. Yeah, which for listeners of this show is huge. It's like several book novels worth of words or characters you could

Starting point is 00:11:26 just bung into one single prompt. And the reason why that was such an amazing thing was for a while people struggled to kind of communicate with these AIs because they couldn't set the context. There wasn't enough bandwidth within their chat. log window for them to say, you know, and don't forget this and then there was this. And then,

Starting point is 00:11:44 you know, this detail and that detail, there just wasn't enough space. And models weren't performing enough to kind of consume all of this in one go. And then Claude came out and was like, hey, we have one million context windows. Don't worry about it. Chuck in all the research papers that you want, chuck in your essay, chuck in reference books and we got you. I saw this tweet that was deleted. I think you sent this to me. We got the screen shots. We always come with receipts. Yeah, I wonder why they deleted it. But a good catch from you. Yeah, let's get at this. What's your take on this job? was first posted, I think earlier today, yeah, like an hour ago, and then deleted, pretty shortly

Starting point is 00:12:19 afterwards. And this is from a woman named Crystal. Crystal works with the Moonshot team. She is part of the team that released Kimmy K2. And in this post, it says, Kimmy isn't just another AI. It went viral in China as the first to support a two million token context window. And then she goes on to say, we're in AI lab with just 200 people, which is ministerially small compared to a lot of the other labs they're competing with. And it was acknowledgement that they had a two million token context window. And for those who, just a quick refresher on the context window stuff, it's imagine you have like a gigantic textbook and you've read it once and you close it and you kind of have a fuzzy memory of all the pages. The context window allows you to lay all of those out

Starting point is 00:12:58 in clear view and directly reference every single page. So when you have two million tokens, which is roughly two million words of context, we're talking about like hundreds and hundreds of books and textbooks and knowledge, and you could really dump a lot of information in this for the AI to readily access. And that, if they released that, a 2 million token open source model, that's huge deal. I mean, even GROC 4 recently, I believe, what did we say it was? It was a 256,000 token context window, something like that. So GROC 4 is one-eighth of what they supposedly have accessible right now, which is a really, really big deal. So I'm hoping it was deleted because they just don't want to share that, not because it's not true.

Starting point is 00:13:39 I would like to believe that it's true because, man, that'd be pretty epic. Yeah, and the people are loving it, Josh. Check out this graph from OpenRouter, which basically shows the split of usage between everyone on their platform that are querying different models. So for context here, Open Router is a website that you can go to and you can type up a prompt, just like you do at Chad ChupT. And you can decide which model your process. your prompt goes to, or you could let OpenRadder decide for you,

Starting point is 00:14:09 and it kind of like divvies up your query. So if you have a coding query, it's probably going to send it to Claude, or now Kimi K2, or Grok4. But if you have something that's more like to do with creative writing or something that's like a case study, it might send it to OpenAI's O3 model, right? So it kind of like decides for you. OpenRatter released this graphic, which basically shows that Kimi K2 surpassed XAI in token market share,

Starting point is 00:14:35 just a few days after launching, which basically means that XAI spent, you know, hundreds of billions of dollars training up their Grog4 model, which just kind of beat out the competition just last week. Then Kimi K2 gets released, completely open source, and everyone starts to use that more than GROC4, which is just an insane thing to say and just shows how rapidly these AI models compete with each other

Starting point is 00:14:59 and surpass each other. I think part of the reason for this, Josh, is it's open source. right? Which means that not only are retail users like myself and yourself using it for our daily queries, you know, create this recipe for me or whatever, but researchers and builders all over the world that, you know, have so far been challenged or had this obstacle of, you know, pots of money basically to start their own AI company now have access to a frontier world-renowned model and can create whatever application, website or product. they want to make. So I think that's part of the usage there as well. Do you have any takes on this?

Starting point is 00:15:40 Yeah, and it's downstream of cost, right? Like, we always see this when a model is cheaper and mostly equivalent, the money will always flow to the cheaper model. It'll always get more queries. I think it's important to note the different use cases of these models. So they're not directly competing head-to-head on the same benchmarks. I think what we see is like when we talk about Claude, it's generally known as the coding model. And I don't think, like, OpenAI's 03 is not really competing directly with Claude because it's more of a general intelligence versus a coding-specific intelligence. K2 is probably closer to a Claude, I would assume, where it's really good at coding because it uses this mixture of experts. And I think that helps it find the tools. It uses this

Starting point is 00:16:18 cool new novel thing called like multiple tool use. So each one of these experts can use a tool simultaneously. And they could use these tools and work together to get better answers. So in the case of coding, this is a home run. Like it is very cheap cost for token, very high-quality output. I actually think you can compete with Open Air O3, Josh. Check this out. Oh? So, Rowan, yeah, Rowan Cheng put this out yesterday and he basically goes, I think we're at the tipping point for AI generated writing.

Starting point is 00:16:46 It's been notoriously bad, but China's Kimi K2, an open weight model, is now topping creative writing benchmarks. So just to put that into context, that's like having the top most, I don't know, smartest or slightly autistic software engineer, the top engineering company working on AI models, also being the best poet or creative script for and directing the next best movie or whatever that might be, or creating a Harry Potter novel series.

Starting point is 00:17:17 This model can basically do both. And what it's pointing out here is that compared to 03, it tops it. Look at this. Completely beats it. Yep. Okay, so I take that back. Maybe it is just better at everything.

Starting point is 00:17:30 Yeah, and that's some pretty impressive results. I think, like, what's worth pointing out here is, and I don't know whether any of the American air models do this, Josh, but mixture of experts seems to be clearly a win here. The ability to create an incredibly smart model doesn't come without, you know, this large storage load that is needed, right? One trillion parameters. But then combining it with the ability to be like,

Starting point is 00:17:55 hey, you don't need to query the entire thing. We've got you. We have a smart router which basically pulls on the best experts, as you described earlier, for whatever relevant query you have. So if you have a creative writing task or if you have a coding thing, we'll send it to two different departments of this model. That's a really huge win. Do any other American models use this?

Starting point is 00:18:16 Well, the first thing that came to my mind when you said that is GROC4, which doesn't exactly use this, but uses a similar thing, where instead of using a mixture of experts, it uses a mixture of agents. So Grockfor Heavy uses a bunch of distributed. agents that are basically clones of the large model. But that takes up a tremendous amount of compute. And that is the $300 a month plan. That's replicating GROC 4 though, right? So that's like taking the model and copy pasting it. So let's say GROC 4 was one trillion parameters just for ease of comparison. That's like creating, if there were those four agents, that's four trillion parameters,

Starting point is 00:18:51 right? So it's still pretty costly and inefficient. Is that what you're saying? It's actually the opposite direction of K2. So what they have used. So what they have used. used is just, and again, this is kind of similar to tracking sentiment between the United States and China, where the United States will throw compute at it, where China will throw like, a kind of clever resource at it. So Grock, yeah, when they use their mixture of agents, it actually just costs a lot more money, whereas K2, when they use their mixture of experts, well, it costs a lot less. Instead of using four trillion parameters in this case, it uses just $32 billion. And it kind of copies that $32 billion over and over. And it's really, it's a really elegant solution

Starting point is 00:19:27 that seems to be yielding pretty comparable results. So I think as we see these efficiency upgrades, I'm sure they will eventually trickle down into the United States models. And when they do, that is going to be a huge unlock in terms of cost per token, in terms of the smaller distilled models that we're going to be able to run on our own computers. But yeah, I don't know if any who are also using it at this scale. It might be novel just a K2 right now. And I think that this is the method that probably scales the best, Josh.

Starting point is 00:19:57 Like, I, it makes sense. Efficiency always wins at the end, right? And to see this kind of innovation come pretty early on in a technology's life cycle is just super impressive to see. Another thing I saw is there's two different versions of this model. I believe there's something called Kimi K2 Base, which is basically the model for researchers who want full control for fine tuning and custom solutions, right? So imagine this model as the entire parameter set. So you have access to 1 trillion parameters, all the weight designs and everything. And if you're a nerd that wants to nerd out, you can go crazy.

Starting point is 00:20:39 You know, if you have like your own GPU cluster at home or if you happen to have a convenient warehouse full of servers that you weirdly have access to, you can go crazy with it. You can, if you think about like the early gaming days of Counterstrike and then you could like mod it, you can basically. mod this model to your heart's desire. And then there's a second version called K2 Instruct, which is for drop-in general purpose chat and AI agent experiences. So this is kind of like at the consumer level, if you're experimenting with these things or if you want to run a experiment at home on a specific use case, you can kind of like take that away and do that for yourself. That's how I understand it, Josh. Do you have any takes on this? That makes sense. And I think that second version that you're describing is what's actually available publicly on

Starting point is 00:21:26 their website, right? So if you go to Kimmy.com, it has a text box. It looks just like chat, GBT, like you used to. And that's where you can run that second tier model, which you described as, that's the drop in general purpose chat. And then, yeah, for the hardcore researchers, there is a GitHub repo, and the GitHub repo has all the weights and all the code. And you can really download it, dive in, use the full thing. I was playing around with the Kimmy tool. And it's really cool. It's fast. Oh, I mean, it's lightning fast. If you go from a reasoning model to an inference model Like, Kimmy, you get responses like this. Like when I'm using GROC 4 or O3, I'm sitting there sometimes for a couple of minutes waiting

Starting point is 00:22:01 for an answer. This, you type it in and it just types back right away, no time waiting. So it's kind of refreshing to see that. But it's also a testament to how impressive it is. I'm getting great answers and it's just spitting it right out. So what happens when they add the reasoning layer on top? Well, it's probably going to get pretty freaking good. So the trend we're seeing, and we saw this last week with GROC 4 is typically we're expected

Starting point is 00:22:24 to wait a while when we send a prompt to a breakthrough model because it's thinking, it's trying to basically replicate what we have in our brains up here. And now it's just getting much quicker and much smarter and much cheaper. So the long story short is these incredibly powerful. I kind of think about it as how we went from massive desktop computers to slick cell phones, Josh,

Starting point is 00:22:49 and then we're going to eventually have chips in our brain. AI is just kind of like fast-tracking that entire lifecycle within like a couple of years, which is just insane. And these efficiency improvements are really exciting because you can see how quickly they're shrinking and allowing eventually for those incredible models to just run on our phones. So there's totally a world a year from now in which like a GROC-403 Kimi K2 capable model is small enough that it could just run inside in our phone and run on a mobile device or run locally on a laptop or you're offline.

Starting point is 00:23:19 And you kind of have this portable intelligence that's available everywhere anytime, even if you're not connected to the world. And that seems really cool. Like we were talking a few episodes ago about Apple's local free AI inference running on an iPhone, but how the base models still kind of suck. Like they don't really do anything super interesting. They're basically good enough to do what you would expect Siri to do but can't do. And these models, as we get more and more breakthroughs like this that allow you to run

Starting point is 00:23:44 much larger parameter counts on a much smaller device, it's going to start really superpowering these mobile devices. And I can't help but think about the Open AI hardware device. I'm like, wow, that'd be super cool. You had like 03 running locally in the middle of the jungle somewhere with no service and you still had access to all of its capabilities. Like that's probably coming downstream of breakthroughs like this, where we get really big efficiency unlocks.

Starting point is 00:24:11 I mean, it's not just efficiency though, right? It's the fact that if you can run it locally on your device, it can have access to all your private data without exposing all of that to the model providers themselves, right? So one of the major concerns of not just AI models, but also with mobile phones, is privacy. I don't want to share all my kind of like private health, financial and social media data, because then you're just going to have everything on me and you're going to use me. You're going to use me as a product, right? And that's kind of like been the quota for the last decade in tech. And so with AI, that's a supercharged version of it. The information gets more personal.

Starting point is 00:24:46 It's not just your likes. It's, you know, where Josh shops every day and, you know, who who he's dating and all these kinds of things, right? And that becomes quite personal and intrusive very quickly. So the question then becomes, how can we have the magic of an AI model without it being so obtrusive? And that is open source locally run AI or privately run AI. And Kimi K2 is a frontier model that can technically run on your local device. If you set up the right hardware for it and the way that we're trending,

Starting point is 00:25:15 you can basically end up having that on your device, which is just a huge unlock. And if you can imagine how you use OpenAI 03 right now, Josh, right? I know you use it as much as I do. The reason why you and I use it so much isn't just because it's so smart, but it's because it remembers everything about us. But I hate that Sam knows or has access to all that data. I hate that if he chooses to switch on personalized ads,

Starting point is 00:25:40 which is currently the model where most of these tech companies make money right now, he can. And I've got nothing to do about it because I don't want to use any other model apart from that, but if there was a locally run model that had access to all the memory and context, I'd use that insert. And this is suspicious. I mean, this is a different conversation in total, but isn't it interesting how other companies haven't really leaned into memory when it's seemingly the most important mode that there is?

Starting point is 00:26:07 Like, Grakfor doesn't have good memory rolled out. Gemini doesn't really have memory. There's no, Clod doesn't have memory the way that Open AI does, yet it's the single biggest reason why we both continue to go back to Chad GBT and Open AI. So that's just been an interesting thing. I mean, Kimmy is open source. I wouldn't expect them to lean too much into it. But for these close source models, that's just, it's another interesting just observation.

Starting point is 00:26:27 Like, hey, the most important thing isn't, doesn't seem to be prioritized by other companies just yet. Why do you think that is? So my theory, at least from XAI or GROC4's perspective, is Elon's like, okay, I'm not going to be able to build a better chat bot or chat messenger than OpenAI has. there's not too many features I can set GROC4 apart, then that O3 doesn't already do, right? But where I can beat O3 is at the app layer. I can create a better app store than they have because I haven't already created one.

Starting point is 00:27:06 That is sticky enough for users to continually use, and I can use that dataset to then unlock memory and context at that point, right? So I just saw today that they released, they being XAI, released a new feature for GROC4 called, I think it's Companions, Josh. Oh, yeah, play with it. These animated avatar-like characters, so they basically look like they're from an anime show.

Starting point is 00:27:34 And you know how you can use voice mode in OpenAI and you can kind of like talk to this realistic human-sounding AI? You now have a face and a character on Gron 4. And it's really entertaining, Josh. Like, I find myself kind of like engaged in this thing because I'm not just typing words. It's not just this binary to and fro with this chat messenger. It's this human, this cute, attractive human that I'm just like now speaking to. And I think that that's the strategy that a lot of these AI companies, if I had to guess,

Starting point is 00:28:04 are taking to kind of like seed their user base before they unlock memory. I don't know whether you have a take on that. Yeah, I have a fun little demo. I actually played around with it this morning and I was using it totally unhandling. No filter, very vulgar, but like kind of fun. It's like a fun little party trick. And yeah, I mean, that was a surprise to me this morning when I saw that rolled out. I was like, huh, that doesn't really seem like it makes sense. But I think they're just having fun with it. Can we for a second talk about the team? So we've mentioned just now how they've all come from China and how China's like really advancing open source AI models and they've completely beat out the competition in America. Mata's Lama being the obvious one. We've got Quinn from Alibaba, we've got Deep Seekar 1, now we have Kimi K2. The team is basically the AI Avengers of China, Josh. So these three co-founders all have deep AI ML backgrounds that hail from like the top American universities such as Carnegie Mellon.

Starting point is 00:29:05 One of them has like a PhD from Carnegie Mellon and machine learning, which is basically for those of you don't know is like God tier degree for AI. That means you're desirable and hireable by every other AI company after you graduate. But as such as that, they also have credibility and degrees from the top universities in China, especially this one university called Tsinghua, which seemed to be the top of their field. I looked them up on rankings for AI universities globally, and they often come in number three or four in the top 10 AI university. So pretty impressive from there. But what I found really interesting, Josh, was one of the co-founders.

Starting point is 00:29:44 was an expert in training AI models on low-cost optimized hardware. And the reason why I mentioned this is it's no secret that if you want a top-frontier AI model, you need to train it on Nvidia's GPUs. You need to train it on Nvidia's hardware. Nvidia's market cap, I think, at the end of last week, surpassed $4 trillion. That's $4 trillion with a T. that is more than the current GDP of the entire British economy. And the largest in the world.

Starting point is 00:30:20 There's never been a bigger company. There's never been a bigger company. It's just insane to grab your head around. And it's not without reason. They supply basically, or they have a grasp or a monopoly on the hardware that is needed to train top models. Now, Kimmy K2 comes along, casually drops a one trillion parameter model, one of the largest models ever released. and it's trained on hardware that isn't in videos. And Jensen Huang, I need to find this clip, Josh,

Starting point is 00:30:49 but Jensen Huang basically was on stage. I think it was at a private conference maybe yesterday, but he was quoted as saying 50% of the top AI researchers are Chinese and are from China. And what he was implicitly getting at is they're a real threat now. I think for the last decade we've kind of been like, ah, yeah, China's just going to copy, paste everything that comes out of America's tech sector. But when it comes to AI, we've kind of like maintained the same mindset up until now where they're really just competing with us. And if they have the hardware, they have the

Starting point is 00:31:22 ability to research new techniques to train these models like deep seeks reinforcement learning and reasoning. And then Kimmy K2's kind of like efficient training run, which you showed earlier, they've come to play, Josh. And I think it's worth highlighting that China has a very strong grasp on top AI researchers in the world and models that are coming out of it. Where are their $100 million offers? I haven't seen any of those coming through. None, dude. The most impressive thing is that they do it without the resources that we have.

Starting point is 00:31:55 Imagine if they did have access to the clusters of these like H-100s that NVIDIA is making. I mean, that would be, would they crush us? And we kind of have this timeline here. We're kind of running up against the edge of energy that we have available to us to train these massive models, whereas China does not have that constraint. They have significantly more energy to power these. So in the event, the inevitable event, that they do get the chips and they are able to train at the scale that we are, I'm not sure we're able to continue our rate of acceleration in terms of hardware manufacturing, large training, as fast as they will. And they already have done the hard work on the software efficiency side. They've cranked out every single efficiency because they are doing it on constrained hardware. So it's going to create this really interesting effect where they're coming at it from the

Starting point is 00:32:44 ingenuity software approach. We're coming at it from the brute force, throw a lot of compute added approach, and we'll see where both sides end up. But it's clear that China is still behind because they are the ones open sourcing the models. And we know at this point now, if you're open source of your model, you're doing it because you're behind. Yeah, yeah. I mean, one thing that did surprise me, Josh, was that they

Starting point is 00:33:05 released a $1 trillion parameter open source model. I didn't expect them to catch up that quickly. Like, $1 trillion is a lot. Yeah. Another thing I was thinking about is China has dominated hardware for so long now. So it wouldn't really surprise me if, like, I don't know, a couple years from now, they're producing better models at specific things, basically because they have better hardware than America, than the West.

Starting point is 00:33:32 But where I think the West will continue to dominate is at the application layer. And I don't know, if I was a betting man, I would say that most of the money is eventually going to be made on the application side of things. I think GROC 4 is starting to kind of show that with all these different kinds of novel features that they're releasing. I don't know if you've seen some of the games that are being produced from GROC4, Josh, but it is ultimately insane. And I haven't seen any similar examples come out of Asia from any of their AI models,

Starting point is 00:34:02 even when they have access to American model. So I still think America dominates at the app player. But Josh, I just came across this tweet, which you reminded me of earlier. Tell me about OpenAI's strategy to open source model because I got this tweet pulled up from Sam Altman, which is kind of hilarious. Yeah, all right.

Starting point is 00:34:20 So this week, if you remember from our episode last week, we were excited about talking about OpenAI's new open source model. Open AI, open source model, all checks out. This was going to be the big week. They released their new flagship. open source. Well, conveniently, I think the same day as K2 launched later in the day, or perhaps the very next morning, Sam Altman posted a tweet. He says, hey, we plan to launch our open weights model next week. We are delaying it. We need time to run additional safety tests and

Starting point is 00:34:47 review high-risk areas. We are not yet sure how long it will take us. While we trust the community will build great things with this model, once weights are out, they can't be pulled back. This is new for us and we want to get it right. Sorry to be the bearer of bad news. We are working super hard. So there's a few points of speculation. The first, obviously, being, did you just get your ass handed to you? And now you are going back to reevaluate before you push out a new model. So that's one possible thing where they saw K2. They were like, oh, boy, this is pretty sweet. This is our first open source model. We probably don't want to be lower than them. And there is the second point of speculation, which EJAS, you mentioned to me a little earlier today where maybe something went wrong

Starting point is 00:35:27 with the training one. And it's not quite that they're getting beat up by a Chinese company. It's that like they actually made it a mistake on their own accord. And can you explain to me specifically what that might be, what the speculation is at least? Yeah, well, I'll keep it short. I think it was a little racist under the hood. And I can't find the tweet, but basically one of these AI researchers slash product builders on X got access to the model supposedly, according to him. and he tested it out in the background. And he said, yeah, it's not really an intelligence thing. It's just worse than what you'd expect from an alignment and consumer-facing approach.

Starting point is 00:36:08 It was ill-mannered. It was saying some pretty wild shit, kind of the stuff that you'd expect coming out of 4chan. And so Sam Omband decided to delay whilst they kind of like figured out why it was kind of acting out. Got it. Okay. So we'll leave that speculation somewhere it is. There's a funny post that I'll actually share with you if you want to throw it up, which was actually from Elon.

Starting point is 00:36:30 And we'll abbreviate, but it was like, Elon was basically saying it's hard to avoid the LibTard slash Mecca Hitler approach both of them because they're on so polar opposite ends of the spectrum. And he said he spent several hours trying to solve this problem with the system prompt, but there's too much garbage coming in at the foundation model level. So basically, I mean, what happens with these models is you train them based on all the human knowledge that exists, right? So everything that we've believed, all the ideas that we've shared, it's been fed into these models. And what happens is you can try to adjust how they

Starting point is 00:37:01 interpret this data through the system prompt, which is basically an instruction that every single query gets passed through. But at some point, is reliant on this swath of human data that is just it's too overbearing. And that's kind of what Elon shared. And the difference between Open AI and GROC is that GROC will just ship the crazy update. And that's what they did. And they caught a lot of backlash trauma. But what I find interesting and what I'm sure Open AI will probably follow is this last paragraph where he says, our V7 foundation model should be much better and we're being far more selective about training data rather than just training on the entire internet. So what they're planning to do is solve this problem, which is what I assume open AI probably ran into in the case that

Starting point is 00:37:39 the AI training model kind of went off the rails and it started saying bad things about lots of people, is that you kind of have to rebuild the foundation model with new sets of data. And in the case of GROC, I know one of the intentions for B7 is actually to generate its own database of data based on synthetic data from their models. And I'm assuming OpenA will probably have to do this too if they want to calibrate a lot of times people call that the temperature, which is the variance of aggression in which a model uses. And I don't know, I think we're going to start to see interesting approaches from that because as they get smarter, you really don't want them to necessarily have these evil traits as the default. And it's very hard to, you're going to

Starting point is 00:38:19 get around that when you train them on the data that they've been trained on so far. It just goes to show how I guess cumbersome it is to train these models, Josh. It's such a hard thing. Yeah. Yeah. It's not something that you can just kind of like jump into the code and tweak a few things. Most of the time, you don't know what's wrong with the model or where it went wrong. I mean, we've talked about this on a previous episode, but essentially if you build out this

Starting point is 00:38:47 model, right, you spend hundreds of millions of dollars. And then you feed it a query. So you put something in and then you wait to see what it spits out. You don't really know what it's going to spit out. You can't predict it. It's completely probabilistic. And so if you release a model and it starts being a little racist or, you know, kind of crazy, you have to kind of like go back to the drawing board and you have to analyze many different sectors of this model.

Starting point is 00:39:13 Like was it the data that was poisoned or was it the way that we trained it or maybe it was a particular model way. that we tweak too much or whatever that might be. So I think over time it's going to get a lot easier once we understand how these models actually work. But my God, it must be so expensive to just continually rerun and retrain these models. Yeah, when you think about a coherent cluster of 200,000 GPUs, the amount of energy, the amount of resources,

Starting point is 00:39:38 just to retrain a mistake is huge. So I think, I mean, the more we go into it, the deeper we get, the more it kind of makes sense paying so much money for talent to avoid these mistakes where if you pay $100 million for one employee who will give you a strategic advantage to avoid having to do another training run, that will cost you more than $100 million. You've already, you're already in the profit. So you kind of start to see the scale, the complexity, the difficulties.

Starting point is 00:40:04 I do not envy the challenges that some of these engineers have to face, although I do envy the salary. I envy the salary. I envy the salary. And I envy the adventure. Like, how cool must that be trying to build super intelligence for the world as a human for the first time in like the history of everything. So it's got to be pretty fun.

Starting point is 00:40:21 This is where we're at now with the open source models, close source models. K2's pretty epic. I think that's a home run. I think we've crowned a new model today. Do you have any closing thoughts? Anything you want to add before we wrap up here? This is pretty amazing. I think I'm most excited for the episode that we're probably going to release a week from now, Josh,

Starting point is 00:40:41 when we've seen what people have built with this open source model. That's the best part about this, by the way. to remind the listener, anyone can take this model right now. You, if you're listening to this, can take this model right now, run it locally at home and tweak it to your preference. Now, yes, it's going to be, you know, you kind of need to know how to tweak model weights and stuff, but I think we're going to see some really cool applications get released over the next week, and I'm excited to play around with them personally. Yeah, if you're listening to this and you can run this model, let us know because that means you have quite a solid rig at your home. I'm not sure

Starting point is 00:41:14 the average person is going to be able to run this, but that is the beauty of the beauty of the open weights is that anybody with the capability of running this can do so. They could tweak it how they like. And now they have access to the new best open source model in the world, which I mean, just a couple months ago from now would have been the best model in the world. So it's moving really quickly. It's really accessible. And I'm sure as the weeks go by, I mean, hopefully we'll get open AI's model, open source model soon in the next few weeks. We'll be able to cover that. But until then, just lots of stuff going on. This was another great episode. So thank you everyone for tuning in again for rocking with us. We actually plan on making this like 20 minutes, but we just kind of kept tailing off into more interesting things. There's a lot of interesting stuff to talk about. I mean, there's really, you could take this in a lot of places. So hopefully this was interesting. Go check out Kimmy K2. It's really, really impressive. It's really fast. It's really cheap. If you're a developer, give it a try. And yeah, that's been another episode. We'll be back again later this week with another topic. And just keep on chugging along as the frontier.

Starting point is 00:42:15 of A-I models continues to head west. Also, we'd love to hear from you guys. So if you have any suggestions on things that you want us to talk more about, or maybe there's like some weird model or feature that you just don't understand and maybe we can do a job at explaining it, just message us. Our DMs are open or respond to any of our tweets and we'll be happy to oblige. Yeah, let us know. If there's anything cool that we're missing, send it our way and we'll cover it.

Starting point is 00:42:41 That'd be great. But yeah, we're all going on the journeys together. like we're learning this as we go. So hopefully today was interesting. And if you did enjoy it, please share with friends, likes, comment, subscribe, all the great things. And we will see you on the next episode. Thanks for watching. See you guys. See you.

Limitless Podcast - Kimi K2 is the Open Source Claude-Killer | US vs China AI

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.