Lex Fridman Podcast - #490 – State of AI in 2026: LLMs, Coding, Scaling Laws, China, Agents, GPUs, AGI

Episode Date: February 1, 2026

Nathan Lambert and Sebastian Raschka are machine learning researchers, engineers, and educators. Nathan is the post-training lead at the Allen Institute for AI (Ai2) and the author of The RLHF Book. S...ebastian Raschka is the author of Build a Large Language Model (From Scratch) and Build a Reasoning Model (From Scratch). Thank you for listening ❤ Check out our sponsors: https://lexfridman.com/sponsors/ep490-sc See below for timestamps, transcript, and to give feedback, submit questions, contact Lex, etc. Transcript: https://lexfridman.com/ai-sota-2026-transcript CONTACT LEX: Feedback – give feedback to Lex: https://lexfridman.com/survey AMA – submit questions, videos or call-in: https://lexfridman.com/ama Hiring – join our team: https://lexfridman.com/hiring Other – other ways to get in touch: https://lexfridman.com/contact SPONSORS: To support this podcast, check out our sponsors & get discounts: Box: Intelligent content management platform. Go to https://box.com/ai Quo: Phone system (calls, texts, contacts) for businesses. Go to https://quo.com/lex UPLIFT Desk: Standing desks and office ergonomics. Go to https://upliftdesk.com/lex Fin: AI agent for customer service. Go to https://fin.ai/lex Shopify: Sell stuff online. Go to https://shopify.com/lex CodeRabbit: AI-powered code reviews. Go to https://coderabbit.ai/lex LMNT: Zero-sugar electrolyte drink mix. Go to https://drinkLMNT.com/lex Perplexity: AI-powered answer engine. Go to https://perplexity.ai/ OUTLINE: (00:00) – Introduction (01:39) – Sponsors, Comments, and Reflections (16:29) – China vs US: Who wins the AI race? (25:11) – ChatGPT vs Claude vs Gemini vs Grok: Who is winning? (36:11) – Best AI for coding (43:02) – Open Source vs Closed Source LLMs (54:41) – Transformers: Evolution of LLMs since 2019 (1:02:38) – AI Scaling Laws: Are they dead or still holding? (1:18:45) – How AI is trained: Pre-training, Mid-training, and Post-training (1:51:51) – Post-training explained: Exciting new research directions in LLMs (2:12:43) – Advice for beginners on how to get into AI development & research (2:35:36) – Work culture in AI (72+ hour weeks) (2:39:22) – Silicon Valley bubble (2:43:19) – Text diffusion models and other new research directions (2:49:01) – Tool use (2:53:17) – Continual learning (2:58:39) – Long context (3:04:54) – Robotics (3:14:04) – Timeline to AGI (3:21:20) – Will AI replace programmers? (3:39:51) – Is the dream of AGI dying? (3:46:40) – How AI will make money? (3:51:02) – Big acquisitions in 2026 (3:55:34) – Future of OpenAI, Anthropic, Google DeepMind, xAI, Meta (4:08:08) – Manhattan Project for AI (4:14:42) – Future of NVIDIA, GPUs, and AI compute clusters (4:22:48) – Future of human civilization

Transcript
Discussion (0)
Starting point is 00:00:00 The following is a conversation all about the state-of-the-art in artificial intelligence, including some of the exciting technical breakthroughs and developments in AI that happen over the past year, and some of the interesting things we think might happen this upcoming year. At times, it does get super technical, but we do try to make sure that it remains accessible to folks outside the field without ever dumbing it down. It is a great honor and pleasure to do it. to be able to do this kind of episode with two of my favorite people in the AI community, Sebastian Rashka and Nathan Lambert.
Starting point is 00:00:39 They are both widely respected machine learning researchers and engineers, who also happen to be great communicators, educators, writers, and Twitterers ex-poster. Sebastian is the author of two books. I highly recommend for beginners and experts alike. First is build a large language model from scratch and build a reasoning model from scratch. I truly believe in the machine learning computer science world the best way to learn and understand something is to build it yourself from scratch. Nathan is the post-training lead at the Allen Institute for AI and author of the definitive book on reinforcement learning from human feedback. Both of them have great X accounts, great substacks.
Starting point is 00:01:31 Sebastian has courses on YouTube. Nathan has a podcast, and everyone should absolutely follow all of those. And now a quick few second mention of each sponsor. Check them out in the description or at Lexfreedman.com slash sponsors. It is, in fact, the best way to support this podcast. We got a bunch of great sponsors. box for intelligent content management, Quo, for your phone system like calls, tags,
Starting point is 00:01:59 contacts for your business, uplift desk, the desk I'm sitting behind, and my favorite office desk. Finn for customer service, AI agents, Shopify for selling stuff online, Code Rabbit, for AI powered code review, element for electrolytes, and of course, our longtime friend,
Starting point is 00:02:19 perplexity, for curiosity-driven knowledge, exploration. Choose wisely, my friends. And now, on to the full add reads. I try to make them interesting, but if you do skip, please still check out the sponsors. I enjoy their stuff. Maybe you will too. To get in touch with me, for whatever reason, go to Lex Freeman.com slash contact. If you can't tell, I'm trying to have a bit of a pep in my step at the moment, because at a long night, didn't get much sleep at all, so I am running on fumes, delirious, happy,
Starting point is 00:02:58 unsure of what is reality and what is a dream. In fact, we could right now be living inside of a dream. I have been going through a lot. I have been working insane hours, so much going on. I am so overwhelmed. Of course, as always, truly grateful and happy to be alive. but have not been able to publish as many episodes as it would like, so there's a bunch of sponsors we have to catch up on.
Starting point is 00:03:24 Your support truly means the world. Please check out all the sponsors. If you think it might be useful to you, buy their stuff, it really is the best way to support this podcast. All right, let's go. First up, this episode is brought to you by Box, a cloud-based platform for content management, file sharing, and all kinds of collaboration,
Starting point is 00:03:45 all kinds of content for your businesses. Like with a lot of companies, the big question is, how is AI leveraged to make whatever the business does better? A lot of companies kind of use it for the hype and the label. It's kind of hilarious to watch. People just say, like, powered by AI. I don't care if you're a bakery, powered by AI.
Starting point is 00:04:11 I don't know. But outside of all of the hype, it is one of the most incredible things that humans have ever created. And so companies that can leverage that well are the companies that win. And of course, Box is legendary for its file and content management,
Starting point is 00:04:30 especially when you're talking about scale. So obviously it's amenable for the utilization of AI to help automate some of the document processing, some of the workflow, some of the organization, and they do that exceptionally well. They have a system called, as you could imagine, Box AI that does just that. I love it. They do an excellent implementation.
Starting point is 00:04:51 On the interface side, on the backhand side, everything works extremely nicely. Help scale AI across your organization today and go to box.com slash AI. That's box.com slash AI to learn more. This episode is also brought to you by Quo, spelled QUO. also happens to be a company name with just three letters that will help you win at Scrabble. Are you allowed to these company names with Scrabble? How many points is Q? How many points is you? I'm imagining a lot.
Starting point is 00:05:28 That was one of the big confusions to me when I was first learning the English language. It always felt like Q should be at the end of the alphabet, maybe like QZ. It was always surprising to my limited brain capacity that Q was earlier on in the alphabet. what is it, O-P-Q. I can't even actually localize letters in the alphabet. I'm sure that's the case for a lot of people without reading the alphabet in my head sequentially. All of this has to do with short-term and long-term memory access,
Starting point is 00:06:01 the functioning, the limitation of human cognition, and maybe cognitive systems in general, all of it relevant to this particular episode and not so relevant to the, awesomeness of quo formerly known as open phone that I should be talking about. Of course, as is always the case, I think the point here and at the point everywhere in the point of life is to talk from the heart about whatever you want. And that's what I try to do with everything. And to generalize that even more, to talk whenever I want and to shut the F up whenever I want
Starting point is 00:06:40 and listen. And I prefer that more often than I prefer to talk. Insert clever transition here because talk is somehow relevant it is. So Quo, formerly known as open phone, helps over 90,000 businesses manage phone calls, texts, contacts, all kinds of phone-related stuff for business. You have a bunch of customers, a bunch of incoming calls, a bunch of people, on the business side, they have to answer those calls, have to manage it, what's the status of this particular request, voicemails, transcripts, all that kind of stuff, and obviously,
Starting point is 00:07:21 really nice, effective utilization of AI to make that really efficient. But really, what's really important for things like this is that the interface is good, that team collaboration is good, and quo delivered on that. Try quo for free, plus get 20% off your first six months. When you go to Quote.com slash Lex, that's Q-U-O-com slash Lex. Tell you your friends about it, because it just might help them win at Scrabble. Speaking of Scrabble, you usually want to play Scrabble on a table. It's such a magical experience. I just had a vision from a distant past of me sitting with a friend and playing Scrabble at a table.
Starting point is 00:08:01 What is this life, full of beautiful memories, and then it's over too soon? Yeah, that melancholy feeling is beautiful, I think. Insert another clever transition, a la Mark Norman, maybe, because of the name of this next company's Uplift Desk. As I said. Okay, it's my go-to favorite office desk, and it's also the desk that I use for podcast furniture. I have, I already lost count.
Starting point is 00:08:40 I have a lot of uplift desks, standing desks, in my place everywhere. It's desks everywhere. I have a mattress in the floor and uplift desks. So I have a Linux box for robotics. I have a machine where I do a lot of the editing. All of that is on a desk. I have the three tables for the podcast desk.
Starting point is 00:09:02 The very one you've seen over the past several years, that's all uplift desks. I usually don't put them in standing mode, but they are standing desks that allows me to do all kinds of stuff, really easy to work with, really nice material, really sturdy. I just love everything about Uplit Desk. When they said they want a sponsor after I've been using them for many years, I lost my mind.
Starting point is 00:09:23 I love when I've been in love with a company, in love with their product for such a long time, and I get to also sing in praises. I mean, come on, what are you going to tell me next that FFMPEG wants to sponsor this podcast? another sort of open source project is not a company that I've been in love with. Anyway, go to uplivdesk.com slash Lex and use code Lex to get four free accessories, free same-day shipping, free returns, a 15-year warranty, and an extra discount off your entire order.
Starting point is 00:09:58 That's U-P-L-I-F-D-E-S-K.com slash Lex, the, Spelling it out, really help anybody. I don't know, but they really said, pretty please. The one request is spell it out. Again, what is this life? Incredible. This episode is also brought to you by Finn, the number one AI agent for customer service,
Starting point is 00:10:24 find the niche and become number one. That's the idea here. Anybody building an AI company. And we talk about this. Is the dream of AGI dead? I think for a lot of companies' success is in the niche. But there is a few, and Finn delivers on that niche. It's trusted by over 6,000 customer service leaders at top companies, including AI companies.
Starting point is 00:10:49 When an AI company trusts your company to do as customer service, that means you're legit. 90-day money bag guarantee up to $1 million built to handle complex, multi-step pories like returns, exchanges. and disputes. Go to fin.a.ai slash Lex to learn more about transforming your customer service and scaling your support team. That's fin.a.i slash Lex. I don't know why I switched to this hyping voice. Crapy announcer, crappy radio jockey, crappy ad read voice. It is what it is. Thank you for sticking with me this long. I feel the love. and I send it right back at you. This episode is also brought to you by a company
Starting point is 00:11:41 whose engineers are also full of love, Shopify. It just brings a smile to my face. Every time I think about Shopify, I got to see their engineering booth at Neurips, which is a machine learning conference, really brilliant people, wonderful people. Of course, the CEO, Toby, is still pro-reux. programming, still building stuff, still in on the details of the engineering, and now is
Starting point is 00:12:11 talking quite a bit about utilization of LLMs for his own sort of pet projects, but also inside the company. It's just incredible when from the very top, the company is in love with engineering. It's a celebration of great engineering, just like the conversation with DHS, who is the guy behind Ruby on Rails that Shopify was built on. That conversation was a celebration of great engineering. The beauty of engineering as well. Anyway, listen to that episode
Starting point is 00:12:42 to see some of the magic of Ruby on Rails and the magic of Shopify and the magic of Toby that we talk about. Anyway, sign up for a $1 per month trial period at Shopify.com slash Lex. That's all lowercase. Go to Shopify.com slash Lex to take your business to the next level today. This episode is also brought to you by CodeRabbit,
Starting point is 00:13:04 a platform that provides AI-powered code reviews directly within your terminal. We talk a lot in this episode about the timeline for the full automation of the human programmer. I think we're quite far away from taking the human out of the loop. That review process, the debugging process, all of that, That's such a crucial part of programming, especially just like we talk about in the episode. When we're not talking about a personal website where HTML slop is something that a web browser magically, automagically, I don't know how they're possibly able to do such incredible job of rendering slop, but a web browser is in fact able to render slop.
Starting point is 00:13:57 including AI slop. It just finds a way. So really the question is, when you have production code, something that a lot of users are relying on, how do you review that code? How do you make sure you're catching the errors? How are you making sure that you put a backstop to hallucinations and the logical errors that AI coding agents can generate?
Starting point is 00:14:19 Anyway, CodeRabbit supports all programming languages. Install code rabbit cly today at code rabbit.a.com. That's code rabbit.a.ai slash Lex. This episode is also brought you by Element. My daily zero sugar and delicious electrolyte mix reminds me of the fact that I need to get to editing the video of me in the jungle when Paul Rosling and I is such an incredible human.
Starting point is 00:14:55 Congratulations to Paul and all of his, success. Go get his book. It's an incredible book. Again, he's an incredible person with an incredible mission. And yes, I need to edit and publish, hoping to at the very least. The story of our journey in the jungle, because it was a beautiful celebration of nature in the jungle and friendship and the full richness of the human experience. It was beautiful. The reason I mentioned that is always as part of that journey, severely dehydrated. And And I remember dreaming of element of a cold drink of water with the electrolytes. Your body craves it.
Starting point is 00:15:37 And it craves it. Electrolites, sodium, potassium, magnesium. When you're deprived, it's not just water, it's electrolytes. So anyway, I always remember that. Get a free eight-count sample pack with any purchase. Try it to drink element.com slash Lex. This is the Lex Friedman podcast. support it, please check out our sponsors in the description where you can also find links to contact me,
Starting point is 00:16:04 ask questions, get feedback, and so on. And now, dear friends, here's Sebastian Rasker and Nathan Lambert. So I think one useful lens to look at all of this through is the deep seek, so-called deep seek moment. This happened about a year ago in January 2025, when the open way, Chinese company Deepseek released Deepseek R1, that I think it's fair to say, surprised everyone with near or at state-of-the-art performance with allegedly much less compute for much cheaper. And from then to today, the AI competition has gotten insane, both on the research level, on the product level, it's just been accelerating.
Starting point is 00:17:04 Let's discuss all of this today, and maybe let's start with some spicy questions if we can. Who is winning at the international level? Would you say it's the set of companies in China or the set of companies in the United States? And Sebastian, Nathan, it's good to see you guys. So Sebastian, who do you think is winning? So winning is a very broad, you know, term. I would say you mentioned the deep seek moment, and I do think DeepSig is definitely winning the hearts of the people who work on open weight models
Starting point is 00:17:36 because they share these as open models. winning, I think, has multiple timescales to it. We have today, we have next year, we have in 10 years. One thing I know for sure is that I don't think nowadays, 2026, that there will be any company who is, let's say, having access to a technology that no other company has access to. And that is mainly because researchers are frequently changing jobs, changing labs, they rotate.
Starting point is 00:18:05 So I don't think there will be a clear winner in terms of, of technology access. However, I do think there will be, the differentiating factor will be budget and hardware constraint. So I don't think the ideas will be proprietary, but the way or the resources that are needed to implement them. And so I don't see currently take it all scenario where a winner takes it all. I can't see that at the moment. Nathan, what do you think? You see the labs put different energy into what they're trying to do? And I think to demarcation. the point in time when we're recording this. The hype over Anthropics
Starting point is 00:18:42 Claude Opus 4.5 model has been absolutely insane, which is just, I mean, I've used it and built stuff in the last few weeks, and it's almost gotten to the point where it feels like a bit of a meme in terms of the hype, and it's kind of funny because this is very organic, and then if we go back
Starting point is 00:18:58 a few months ago, we can get the release date, and the notes as Gemini 3 from Google got released, and it seemed like the marketing and just like wow factor, that release was super high. But then at the end of November, Claude Opus 4.5 was released and the hype has been growing. But Gemini 3 was before this. And it kind of feels like people don't really talk about it as much, even though when it came out, everybody was like, this is Gemini's moment
Starting point is 00:19:23 to retake kind of Google's structural advantages in AI. And Gemini 3 is a fantastic model, and I still use it as just kind of differentiation is lower. And I agree with Sebastian, what you're saying, with all these, like, the idea space is very fluid. But, um, culturally anthropic is known for betting very hard on code, which is called code thing, is working out for them right now. So I think that even if the ideas flow pretty freely, so much of this is bottlenecked by human effort and kind of culture of organizations where anthropic seems to at least be presenting is the least chaotic.
Starting point is 00:19:56 It's a bit of an advantage, and if they can keep doing that for a while. But on the other side of things, there's a lot of ominous technology from China where there's way more labs than deep sea. So deep seek kicked off a movement within China, I say kind of similar to how ChatGBTGPT kicked off a movement in the U.S. where everything had a chatbot. There's now tons of tech companies in China that are releasing very strong frontier open weight models to the point where I would say that Deep Seek is kind of losing its crown as the preeminent open model maker in China and the likes of Z.a.i.i.i.s with their GLMMS models, MimiMax's models, Kimmy Moonshot, especially in the last few months, has shown more. brightly. The new deep seek models are still very strong, but that's kind of a, it could look back as a big narrative point where in 2025 deep seek came and then all, and it kind of provided this platform for way more Chinese companies that are releasing these fantastic models to kind of have
Starting point is 00:20:51 this new type of operation. So these models from these Chinese companies are open weights. And depending on this trajectory of business models that these American companies are doing, could be at risk. But currently, a lot of people are paying for AI software in the U.S. and historical. Berkeley and China and other parts of the world, people don't pay a lot for software. So some of these models like Deep Seek have the love of the people because they are open-weight. How long do you think the Chinese companies keep releasing open-weight models? I would say for a few years. I think that like in the U.S., there's not a clear business model for it.
Starting point is 00:21:26 I have been writing about open models for a while, and these Chinese companies have realized it, so I get inbound from some of them. And they're smart and realize the same constraints, which is that a lot of U.S. tech companies and other IT companies won't pay for an API subscription to Chinese companies for security concerns. This has been a longstanding habit in tech. And the people at these companies then see open weight models as an ability to influence and take part of a huge growing AI expenditure market in the U.S. And they're very realistic about this.
Starting point is 00:21:55 And it's working for them. And I think that the government will see that that is building a lot of influence internationally in terms of uptake of the technology. So there's going to be a lot of incentives to keep. it going, but building these models and doing the research is very expensive. So at some point, I expect consolidation, but I don't expect that to be a story of 2026, where there will be more open model builders throughout 2026 than there were in 2025, and a lot of the notable ones will be in China. You were going to say something? Yes, you mentioned Deep SIG losing its crown.
Starting point is 00:22:27 I do think, to some extent, yes, but we also have to consider, though, they are still, I would say, slightly ahead and the other ones it's not that deep seek got worse it's just like the other ones are using the ideas from deep deep seek for example you mentioned Kimi same architecture they're training it and then again we have this leapfrogging where they might be at some
Starting point is 00:22:47 point in time a bit better because they have the more recent model and I think this comes back to the fact that there won't be a clear winner it will just be like that one person releases something the other one comes in and the most recent model is probably always the best model yeah we'll also see the
Starting point is 00:23:03 Chinese companies have different incentives. So, like, deep seek is very secretive, where some of these startups are, like, the mini-maxes and z.aIs of the world, those two literally have filed IPO paperwork, and they're trying to get Western mind share and do a lot of outreach there. So I don't know if these incentives will kind of change the model development, because deep seek famously is built by a hedge fund, high flyer capital. And we don't know exactly what they, like, we don't know what they use the models for or if they care about this. They're secretive in terms of communication, they're not secret in terms of the technical reports that describe how their models work. They're still open on that front. And we should also say on the Opus 4-5 hype,
Starting point is 00:23:43 there's the layer of something being the darling of the X echo chamber on Twitter echo chamber and the actual amount of people they're using the model. I think it's probably fair to say that Chad GBT and Gemini are focused on the broad user base that just want to solve problems in their daily lives. And that user base is gigantic. So the hype about the coding may not be represented to the actual use. I would say also a lot of the usage patterns are,
Starting point is 00:24:15 like you said, name recognition, brand and stuff, but also muscle memory almost, where, you know, like Chachapidi has been around for a long time. People just got used to using it, and it's kind of like almost like a flywheel. They recommend it to other users and that stuff. One interesting point is also the customization of LLLMs. For example, chatypT has a memory feature, right? And so you may have a subscription
Starting point is 00:24:38 and you use it for personal stuff, but I don't know if you want to use that same thing at work, you know, because it's a boundary between private and work. If you're working at a company, they might not allow that, or you may not want that. And I think that's also an interesting point where you might have multiple subscriptions. One is just clean code. It keeps, it's nothing of your personal images that you, or hobby projects in there. It's just like the work thing. And then the other one is your personal thing. So I think that's also something where two different use cases, and it doesn't mean you only have to have one. I think the future is also multiple ones. What model do you think won 2025 and what model do you think is going to win 26?
Starting point is 00:25:16 I think in the context of a consumer chatbots is a question of are you willing to bet on Gemini over chat GPT, which I would say in my gut feels like a bit of a risky bet because open AI has been the incumbent and there's so many benefits to that. in tech. I think the momentum, if you look at 2025, was on Gemini's side, but they were starting from such a low point. I think RIP barred and these earlier attempts of getting started, I think huge credit for them for powering through the organizational chaos to make that happen. But also, it's hard to bet against chat to open AI because they always come off because they're very good at landing things. And I think, like, personally, I have very mixed.
Starting point is 00:26:00 reviews of GPT-5, but it had to have saved them so much money with the headline feature being a router where most users are no longer charging, like, charging their GPU costs as much. So I think it's very hard to dissociate the things that I like out of models versus the things that are going to actually be a general public differentiator. What do you think about 2026? Who's going to win? I'll say something, even though it's risky. I will say that I think Gemini will continue to take progress on chat chupit i think google scale when both of these are operating at such extreme scales and like google has the ability to separate that research and product a bit better where you hear
Starting point is 00:26:40 so much about open ai being chaotic operationally and chasing the high impact thing which is a very startup culture and then on the software and enterprise side i think anthropic will have continued to success as they've again and again been set up for that and obviously google's cloud has a lot of offerings, but I think this kind of like Gemini name brand is important for them to build. And Google's cloud will continue to do well. But that's kind of a more complex thing to explain in the ecosystem because that's competing with the lakes of Azure and AWS rather than on the model provider side. So in infrastructure, you think CPU's given advantage? Largely because the margin on Nvidia chips is insane. And Google can develop everything from
Starting point is 00:27:24 top to bottom to fit their stack and not have to pay this margin. And they've had a head start in building data centers. So all of these things that have both high lead times and very hard margins on high costs, Google has just kind of a historical advantage there. And if there's going to be a new paradigm, it's most likely to come from Open AI where their kind of their research division again and again has kind of shown this ability to land a new research idea or a product. I think like deep research, Soirah, O1 thinking models,
Starting point is 00:27:53 like all these definitional things have come from Open AI, and that's got to be one of their top traits as an organization. So it's kind of hard to bet against that, but I think a lot of this year will be about scale and optimizing what could be described
Starting point is 00:28:07 as low-hanging fruit in models. And clearly, there's a trade-out between intelligence and speed. This was what Chad GPT-5 was trying to solve behind the scenes. It's like, do people who actually want intelligence, the broad public, or do they want speed? I think it's a nice variety, actually, or the option to have a toggle there.
Starting point is 00:28:29 I mean, first, for my personal usage, most of the time when I look something up, I use JetGPD to ask a quick question, get the information, I want it fast. For, you know, most daily tasks, I use the quick model. Nowadays, I think the auto mode is pretty good where you don't have to specifically say thinking or, you know, non-thinking and stuff. Then again, I also sometimes want the pro mode. very often what I do is when I have something written, I put it into a chatubiti and say, hey, do a very thorough check. Are all my references correct?
Starting point is 00:28:58 Are all my thoughts correct? Did I make any formatting mistakes? And are the figure numbers wrong or something like that? And I don't need that right away. It's something, okay, I finish my stuff, maybe have dinner, let it run, come back, and it goes through this. And I think, see, this is where I think it's important to have this option. I would go crazy if for each query I would have to wait 30 minutes or 10 minutes,
Starting point is 00:29:18 that's me. I'm like saying over here, losing my mind that you use the router and the non-thinking model. I'm like, how do you live with that? It's like my reaction. I've been heavily on chat to BT for a while. Never touched five non-thinking.
Starting point is 00:29:36 I find it's tone and then its propensity of errors. It's just like a higher likelihood of errors. Some of this is from back when opening I released 03, which was the first model to do this deep search and find many sources. integrate them for you. So it became habituated with that. So I will only use GPT 5.2 thinking or pro when I'm finding any sort of information query for work, whether that's a paper or some code reference that I found. And it's just like I will regularly have like five pro queries
Starting point is 00:30:05 going simultaneously, each looking for one specific paper or feedback on the equation or something. I have a fun example of where I just needed the answer as fast as possible for this podcast before I was going on the trip. I have a local GPU running at home, and I wanted to run a long RL experiment. And usually I also unplug things, because you never know if you're not at home. You don't want to have things plugged in.
Starting point is 00:30:28 And I accidentally unplugged the GPU. It was like my wife was already in the car, and it's like, oh, dang. And then basically I wanted as fast as possible a bash script that runs my different experiments in the evaluation. And it's something I know, I learned how to use the bash, interface or bash terminal but in that moment i just needed like 10 seconds give me the command this is a
Starting point is 00:30:51 hilarious situation but yeah so what did you use so i did the non-thinking fastest model it gave me the bash command i to chain different scripts to each other and then the thing is like you have the t thing where you want to route this to a lock file top of my head i was just like in a hurry i could have thought about it by the way i don't know if there's a representative case why waiting in the car you have to run, you don't plug the GPU, you have to generate a bash script. It sounds like a movie, like you wish it impossible. I use Gemini for that. So I use thinking for all the information stuff, and then Gemini for fast things or stuff
Starting point is 00:31:24 that I could sometimes Google, which is like it's good at explaining things. And I trust that it has this kind of background of knowledge, and it's simple. And the Gemini app has gotten a lot better, and it's good for that sort of things. And then for code and any sort of philosophical discussion, I use Claude Opus 4.5, also always with extended thinking. Extended thinking and inference time scaling is just a way to make the models marginally smarter and I will always edge on that side when the progress is very high because you don't know when that'll unlock a new use case and then sometimes use GROC for
Starting point is 00:31:55 real-time information or finding something on AI Twitter that I knew I saw and I need to dig up and I just fixated on. Although when GROC 4 came out, the GROC 4, what is super heavy, which was like their pro variant, was actually very good and I was pretty impressed with it. and then it's just kind of like muscle memory, lost track of it with having the chat GPT app open. So I use many different things. Yeah.
Starting point is 00:32:18 I actually do use Grok 4 heavy for debugging, for like hardcore debugging and the other ones can't solve it. I find that it's the best at. And I, it's interesting because you say chat GPT is the best interface. For me, for that same reason, but this could be just momentum. Gemini is the better interface for me. I think because I found a lot of, with their best needle in the haystack.
Starting point is 00:32:45 If I ever put something that has a lot of context, but I'm looking for a very specific kinds of information, make sure it tracks all of it. I find at least the Gemini for me has been the best. So it's funny with some of these models, if they win your heart over for one particular feature on a one particular day for that particular query, that prompt, you're like, this model is better.
Starting point is 00:33:09 And so you'll just stick with it for a bit until it does something really dumb. There's like a threshold effect, some smart thing, and then you fall in love with it, and then it does some dumb thing, and you're like, you know what,
Starting point is 00:33:20 I'm going to switch and try a clot and chat GPT and all that kind of stuff. This is exactly like you use it until it breaks, until you have a problem, and then you change the L.M. And I think it's the same how we use anything, like our favorite text editor,
Starting point is 00:33:34 operating systems, or the browser. I mean, there are so many browser options, Safari, Firefox, Chrome, all the characteratively similar, but then there are edge cases, maybe extensions you want to use, and then you switch. But I don't think there is any
Starting point is 00:33:47 one who types the same thing, like the website, into different browsers and compares them. You only do that when the website doesn't render if something breaks, I think. So that's a good point. I think you use it until it breaks and then you explore other options, I think. On the long context thing, it was also a Gemini user for this, but the
Starting point is 00:34:04 GPT 5.2 release blog had crazy long context scores where a lot of people were like, did they just figure out some algorithmic change, it went from like 30% to like 70% or something in this minor model update. So it's also very hard to keep track of all of these things. But now I look more favorably at GPT5.2's long context. So it's just kind of like, how do I actually get to testing this? It's never ending battle. It's interesting that none of us talked about the Chinese models from a user usage perspective. What does that say? Does that mean the Chinese models are not as good?
Starting point is 00:34:39 or does that mean we're just very biased and U.S. focused? I do think that that's currently the discrepancy between just the model and the platform. So I think the open models, they are more known for the open weights, not their platform yet. There are also a lot of companies that are willing to sell you the open model inference that are very low cost. I think like open router, it's easy to do the look at multi-model things. You can run deep seek on perplexity. I think all of us sitting here are like, we use OpenAI GPT5 Pro consistently. all willing to pay for the marginal intelligence gain. And anyone that's like these models from
Starting point is 00:35:15 the U.S. are better. And in terms of the outputs, I think that the question is, will they stay better for this year and for years going? But it's like, so long as they're better, I'm going to pay for you to use them. I think there's also analysis that shows that like the way that the Chinese models are served, this you could argue due to expert controls or not, is that they use fewer GPUs for replica, which makes them slower and have different errors. And it's like speed and intelligence. If these things are in your favor as a user, I think in the U.S., a lot of users will go for this. And I think that that is a good thing that will spur these Chinese companies to want to compete in other ways, whether it's like free or substantially lower costs or it'll breed
Starting point is 00:35:57 creativity in terms of offerings, which is good for the ecosystem. But I just think of the simple thing is the U.S. models are currently better and we use them. And I try these other open models. I'm like, fun, but not going to, I don't go back to it. We didn't really mention programming. That's another use case that a lot of people deeply care about. So I use basically half and half cursor and clog code because I find them to be like fundamentally different experience and both useful. What do you guys, you program quite a bit? So what do you use?
Starting point is 00:36:30 What's the current vibe? So I use the Codex plugin for VS code. You know, it's very convenient. It's just like a plug-in. and then it's a chat interface that has access to your repository. I know that Cloud Code is, I think, a bit different. It is a bit more agentic. It touches more things.
Starting point is 00:36:44 It does a whole project for you. I'm not quite there yet where I'm comfortable with that because maybe I'm a control freak, but I still would like to see a bit what's going on. And Codex is kind of right now for me, like the sweet spot where it is helping me, but it is not taking completely over. I should mention one of the reasons I do use ClaudeCode
Starting point is 00:37:04 is to build the skill of programming with English. I mean, the experience is fundamentally different. You're, as opposed to micromanaging the details of the process of the generation of the code and looking at the diff, which you can't incursor, if that's the idea you use, and in changing, altering, looking and reading the code and understanding the code deeply as you progress versus just kind of like thinking in this design space.
Starting point is 00:37:34 and just guiding it at this macro level, which I think is another way of thinking about the programming process. Also, we should say that cloud code, it just seems to be somehow a better utilization of Cloud Opus 4-5. It's a good side-by-side for people to do. So you can have Cloud Code open, you can have Curcer Open, and you can have VS code open, and you can select the same models on all of them and ask questions.
Starting point is 00:37:59 It's very interesting. Like, the cloud code is way better in that domain. it's remarkable. All right. We should say that both of you are legit on multiple fronts, researchers, programmers, educators, tweeterers, and on the book front too. So Nathan, at some point soon,
Starting point is 00:38:20 hopefully has an RLHF book coming out. It's available for pre-order, and there's a full digital pre-print, just making it pretty and better organized for the physical thing, which is a lot of why I do it, because it's fun to create things that you think are excellent in the physical form when so much of our life is digital. I should say going to perplexity here,
Starting point is 00:38:39 Sebastian Roshka is a machine learning researcher and author known for several influential books. A couple of them that I wanted to mention, which is a book I highly recommend, build a large language model from scratch, and the new one, build a reasoning model from scratch. So I'm really excited about that. Building stuff from scratch is one of the most powerful ways of learning.
Starting point is 00:39:00 Honestly, building an element from scratch is a lot of fun. it's also a lot of to learn. And like you said, it's probably the best way to learn how something really works because you can look at figures, but figures can have mistakes. You can look at concepts,
Starting point is 00:39:13 explanations, but you might misunderstand them. But if you see there is code and the code works, you know it's correct. I mean, there's no misunderstanding. It's like, it's precise. Otherwise it wouldn't work.
Starting point is 00:39:24 And I think that's like, kind of like the beauty behind coding. It is kind of like, it doesn't lie. It's math, basically. So even though with math, I think, You can have mistakes in a book you would never notice.
Starting point is 00:39:35 Because you're not running the math when you are reading the book, you can't verify this. And with code, what's nice is you can verify it. Yeah, I agree with you about the LN from scratch book. It's nice to tune out everything else, the internet, and so on, and just focus on the book. But, you know, I read several, like, you know, history books. It's just less lonely somehow.
Starting point is 00:39:58 It's really more fun. Like, for example, on the programming front, I think it's genuinely more fun to program with an LLM. And I think it's genuinely more fun to read with an LLM. But you're right, like, this distraction should be minimized. So you use the LLM to basically enrich the experience, maybe add more context, maybe the rate of aha moments for me in a small scale
Starting point is 00:40:25 is really high with LLM's. 100%. I would say I also want to correct myself. I'm not suggesting not to, use LLLMs, I suggest doing it in multiple passes, like one pass, just offline focus mode, and then after that, I mean, I also take notes, but I try to resist the urge to immediately look things up. I do a second pass. It's just like for me more structured this way, and I get I mean, sometimes things are answered in the chapter, but sometimes also it just helps to
Starting point is 00:40:53 let it sink in and think about it. Other people have different preferences. I would highly recommend using LLMs when reading books. For me, it's just, it's not the first to do? It's like the second pass. By way of recommendation, I say I do the opposite. I like to use the LLM at the beginning to lay out the full context of like, what is this world that I'm now stepping into? But I try to avoid clicking out of the LLM into the world of like Twitter and blogs and because then you're now down this rabbit hole, you're reading somebody's opinion, there's a flame war about a particular topic and all of a sudden you're no longer, you're now in the realm of the internet and Reddit and so on.
Starting point is 00:41:32 But if you're purely letting the LLM give you the context of why this matters, what are the big picture ideas, but sometimes books themselves are good at doing that, but not always. This is why I like the chat GPT app that gives the AI a home in your computer when you can focus on it rather than just being another tab in my mess of internet options. And I think Claude CodCode in particular does a good job of making that a joy, where it seems very engaging as a product design to be an interface that your AI will then go out into the world.
Starting point is 00:42:06 And it's something that is very kind of intangible between it and Codex is that it just feels kind of warm and engaging, where Codex can often be as good from OpenAI, but it just kind of like feels a little bit rougher on the edges, whereas CloudCode is makes it fun to build things, particularly from scratch where you just don't, like you don't have to care, but you trust that it'll make something.
Starting point is 00:42:26 Like obviously it's good for websites and kind of, of refreshing tooling and stuff like this, which I use it for, or data analysis. So my blog, we scrape Hugging Face. We keep the download numbers for every dataset and model over time now. So we have them. And it's like, Claude was just like, yeah, I've made use of that data, no problem. And I was like, that would have taken me days. And then I have enough situational awareness to be like, okay, these trends obviously make
Starting point is 00:42:49 sense and you can check things. Because that's just the kind of wonderful interface where you can have an intermediary and not have to do the kind of awful low-level work that you would have to do to maintain different web projects and do this stuff. All right, so we just talked about a bunch of the closed weight models. Let's talk about the open ones. So tell me about the landscape of open L.M models, which are interesting ones, which stand out to you and why.
Starting point is 00:43:16 We already mentioned Deep Seek. Do you want to see how many we can name off the top of our head? Yeah, without looking at notes. Deepseek, Kimmy, Minimax, Z.A.I., Ant Ling. We're just going Chinese. Let's throw in Mistrawl AI Gemma. Yeah, GPTOSS,
Starting point is 00:43:35 the open source model by JetGPT. Actually, Nvidia Nymotron had a or NVIDIA had a very cool one, Nimotron 3. There's a lot of stuff, especially at the end of the year, Quinn, one maybe the one. Oh yeah,
Starting point is 00:43:45 Quinn was the obvious name. I was trying to get through the, you can get at least 10 Chinese and at least 10 Western. I think that, I mean, opening I released their first open model since GPT2. When I was writing about Open AID's Open Model Release, they're all like, don't forget about GPT2, which I thought was really funny because it's just such a different time. But GPTOSS is actually a very strong model and does some things that the other models don't do very well. And I think that selfishly I'll promote a bunch of Western companies. So both in the US and Europe have these fully open models. So I work at Allen Institute for AI. We've been building Olmo, which releases data and code and all of this. And now we have actual companies.
Starting point is 00:44:26 for people that are trying to release everything so that other people can train these models. So there's the Institute for Foundation models or slash L.m. 360, which is like had their K2 models of various types. Apperdis is a Swiss research consortium. Hugging Face has small LM, which is very popular in Vydea's Neumetron that started releasing data as well. And then Stanford's Marin Community Project, which is kind of making it. So there's a pipeline for people to open a GitHub issue and implement a new idea and then have it run in a stable language modeling stack. So this space, that list was way smaller in 2024. So I think it was like just AI2.
Starting point is 00:45:05 So that's a great thing for more people to get involved and to understand language models, which doesn't really have a like a Chinese company that has an analog. While I'm talking, I'll say that the Chinese open language models tend to be much bigger. And that gives them this higher peak performance as MOEs, where a lot of these things that we like a lot, whether it was Gemma and Nemetron have tended to be smaller models from the U.S., which is starting to change from the U.S. in Europe. Mistral Large 3 came out, which was a giant M-O-E model, very similar to deep-seek architecture in December, and then a startup RCAI and both Nemetron have, Nemetron and its Nvidia, have
Starting point is 00:45:44 teased M-O-E models of this way bigger than 100 billion parameters, like this 400 billion parameter range coming in this, like, Q1, 2026 timeline. So I think this kind of balance is set to change this year in terms of what people are using the Chinese versus U.S. Open models for, which I'm personally. So I can be very excited to watch. First of all, huge props for being able to name some major things. Did you actually name Lama? No. I feel like RIP.
Starting point is 00:46:13 This was not on purpose. RIP Lama. All right. Can you mention what are some interesting models that stand out? So you mentioned Quinn 3 is obviously. see standout. So I would say the years almost book ended by both Deepseek version 3 and R1. And then on the other hand, in December, Deep Seek version 3.2, because what I like about
Starting point is 00:46:33 those is they always have an interesting architecture tweak that others don't have. But otherwise, if you want to go with, you know, like the familiar, but really good performance, Quinn 3 and like Nathan said, also GPD OSS. And I think GPDOSS, what's interesting about it is kind of like the first public or like open weight model that was really true. trained with tool use in mind, which I do think is kind of a little bit of a paradigm shift where the ecosystem was not quite ready for it.
Starting point is 00:47:00 So with tool use, I mean that the LLM is able to do a web search, to call a Python interpreter. And I do think it's a standout because I think it's a huge unlock because one of the most common complaints about LLMs are, for example, hallucinations, right? And so in my opinion, one of the best ways to solve hallucinations is to not try to always remember information
Starting point is 00:47:22 or make things up. For math, why not use a calculator app or Python? If I asked the NLM, who won the soccer
Starting point is 00:47:30 World Cup in 1998, instead of just trying to memorize, it could go do a search. I think mostly, it's usually still
Starting point is 00:47:37 Google search. So JGPD, GPDOSS, they would do a tool call to Google, maybe find the FIFA website, find, okay, it was France.
Starting point is 00:47:46 It would get you that information reliably instead of just trying to memorize it. So I think it's a huge unlock, which I think, think right now is not fully utilized yet by the open source, open weight ecosystem.
Starting point is 00:47:57 A lot of people don't use tool call modes because I think it's first, it's a trust thing. You don't want to run this on your computer where it has access to tools, could wipe your hard drive or whatever. So you want to maybe containerize that. But I do think, you know, that is like a really important step for the upcoming years to have this ability. So a few quick things. First of all, thank you for defining what you mean by.
Starting point is 00:48:22 tool use. I think that's a great thing to do in general for the concepts we're talking about. Even things as sort of well established as MOEs, you have to say that means mixture of experts, and you kind of have to build up an intuition for people what that means, how it's actually utilize, what are the different flavors? So what does it mean that there's just such explosion of open models? What's your intuition? If you're releasing an open model, you want people to use it as the first and foremost thing. And then after that comes things like transparent. and trust. I think when you look at China, the biggest reason is that they want people around the world to use these models. And I think a lot of people will not, if you look outside
Starting point is 00:49:02 of the U.S., a lot of people will not pay for software, but they might have computing resources where you can put a model on it and run it. I think there can also be data that you don't want to send to the cloud. So the number one thing is getting people to use models, use AI or use your AI, that might not be able to do it without having access to the model. I guess we should state explicitly. So we've been talking about these Chinese models and open weight models, oftentimes the way they're run is locally. So it's not like you're sending your data to China or to whoever developed, uh, to Silicon Valley, whoever developed a model. A lot of American startups make money by hosting these models from China and selling them,
Starting point is 00:49:41 selling tokens, it's called like selling tokens, which means somebody will call the model to do some piece of work. I think the other reason is for U.S. companies, like, Open AI is so GPU deprive. Like, they're so, they're at the limits of the GPUs. Whenever they make a release, they're always talking about, like, our GPUs are hurting. And I think there's like, in one of these like GPTOSS release sessions, Sam Altman said, like, oh, we're releasing this because we can use your GPUs. We don't have to use our GPUs.
Starting point is 00:50:09 And opening I can still get distribution out of this, which is another very real thing. It doesn't cost them, though, anything. And for the user, I think also, I mean, there are users who just, use the model locally, how they would use Chachipidi, but also for companies, I think it's a huge unlock to have these models because you can customize them, you can train them, you can add post-training,
Starting point is 00:50:30 add more data, specialize them into, let's say, law, medical models, whatever you have. And the appeal, you mentioned Lama, the appeal of the open-weight models from China is that the open-weight models are also, the licenses are even friendlier. I think they are just unrestricted open-source licenses, where if we use something like Lama or Jemma,
Starting point is 00:50:49 there are some strings attached. I think it's like an upper limit in terms of how many users you have. And then if you exceed, I don't know, so many million users, you have to report your finance situation to, let's say, matter or something like that.
Starting point is 00:51:01 And I think, well, it is a free model, but there are strings attached and people do like things where strings are not attached. So I think that's also one of the reasons besides performance, why the open weight models from China are so popular because you can just use them. There's no catch in that sense.
Starting point is 00:51:19 The ecosystem has gotten better on that front, but mostly downstream of these new providers providing such open licenses. That was funny when you pulled up perplexity. It said Kimi K2 thinking hosted in the U.S., which is just like an exact. I've never seen this, but it's an exact example of what we're talking about where people are sensitive to this. But Kimi K2 thinking and Kimi K2 is a model that is very popular. People say that has very good, like, creative writing and also in doing some software things. There's just these little quirks that people pick up on with different models that they like. What are some interesting ideas that some of these models have explored that you can speak to, like that particular interesting to you? Maybe you can go chronologically. I mean, there was, of course, DeepSeek R1 that came out in January if we just focus on 2025.
Starting point is 00:52:02 However, this was based on Deepseek version 3, which came out the year before in December 2004. There are multiple things on the architecture side. What is fascinating is you can still, I mean, that's what I do in my from scratch coding projects, you can still start with GPD2 and you can add things to that model to make it into this other model. So it's all still kind of like the same lineage, the same, it is a very close relationship between those.
Starting point is 00:52:26 But top of my head, DeepSeek, what was unique there is the mixture of experts. I mean, they were not inventing mixture of experts. We can maybe talk a bit more what mixture of experts means. But just to list these things first before we dive into detail, mixture of experts, but then they also had multi-head latent attention,
Starting point is 00:52:43 which is a tweak to the attention mechanism, where this was, I would say, 2025, the main distinguishing factor between these open weight models, different tweaks to make inference or KV cache size. We can also define KV cache in a few moments, but to kind of make it more economical to have long context,
Starting point is 00:53:05 to shrink the KV cash size. So what are tweaks that we can do? And most of them focused on the attention mechanism. There is multi-hat latent attention in Deepseek. There is a group query attention which is still very popular. It's not invented by any of those models.
Starting point is 00:53:19 It goes back a few years, but that would be the other option. Sliding window attention, I think almost reuses it, if I remember correctly. So there are these different tweaks that make the models different. Otherwise, I put them all together in an article once
Starting point is 00:53:34 where I just compare them. They are very surprisingly similar. It's just different numbers in terms of how many repetitions of the transformer block you have in the same. and like just little knobs that people tune. But what's so nice about it is it works no matter what you can tweak things.
Starting point is 00:53:52 You can move the normalization layers around. You get some performance gains. And I almost always very good in Appalachian studies, showing what actually what it does to the model if you move something around. Ablation studies doesn't make it better or worse. But there are so many, let's say, ways you can implement a transformer and make it still work. Big ideas that are still prevalent is a mixture of experts. multi-atlatan attention, sliding window attention, group query attention.
Starting point is 00:54:17 And then at the end of the year, we saw a focus on making the attention mechanism scale linearly with inference token prediction. So there were Quen3 Next, for example, which added a gated delta net. It's like kind of like inspired by state-based models where you have a fixed state that you keep updating, but it makes essentially this attention cheaper or it replaces attention with a cheaper operation. And it maybe is it useful to step back and talk about transform architecture in general?
Starting point is 00:54:46 Yeah, so maybe we should start with the GPT2 architecture, the transformer that was derived from the attention is all you need paper. So the attention is all you need paper had a transformer architecture that had two parts, an encoder and a decoder. And GPT went just focusing in on the decoder part.
Starting point is 00:55:05 It is essentially still a neural network and it has this attention mechanism inside and you predict one token at a time, you pass it through an embedding layer, there's the transformer block. The transformer block has attention modules and a fully connected layer, and there are some normalization layers in between,
Starting point is 00:55:23 but it's essentially neural network layers with this attention mechanism. So coming from GPT2, when we move on to GPTOSS, there is, for example, the mixture of experts layer. It's not invented by GPDOSS. It's a few years old. But it is essentially a tweak to make the model larger without consuming more compute in each forward pass.
Starting point is 00:55:46 So there is this fully connected layer. And if listeners are familiar with multilayer perceptrons, you can think of a mini multi-layer perceptron, a fully connected neural network layer inside the transformer. And it's very expensive because it's fully connected. If you have 1,000 inputs,000 outputs, it's like 1 million connections. and it's a very expensive part in this transformer, and the idea is to kind of expand that into multiple feed-forward networks. So instead of having one, let's say you have 256,
Starting point is 00:56:16 but it would make it way more expensive because now you have 256, but you don't use all of them at the same time. So you now have a router that says, okay, based on this input token, it would be useful to use this fully connected network. And in that context, it's called an expert. So a mixture of experts means you have multiple, experts. And depending on what your input is, let's say it's more math heavy. It would use
Starting point is 00:56:39 different experts compared to, let's say, translating input text from English to Spanish. It would maybe consult different experts. It's not quite clear, I mean, it's clear cut to say, okay, this is only an expert for math and for Spanish. It's a bit more fuzzy. But the idea is essentially that you pack more knowledge into the network, but not all the knowledge is used all the time. That would be very wasteful. So you're kind of like during the token. generation you are more selective. There's a router that selects which tokens should go to which expert. It's more complexity. It's harder to train. There's a lot of, you know, that can go wrong, like collapse and everything. So I think that's by almost three still uses dense. I mean,
Starting point is 00:57:19 you have, I think, all more models with mixture of experts, but dense models where dense means also, also it's jargon. There's a distinction between dense and spars. So mixed of experts is considered spars because we have a lot of experts, but only few of them are active. So that's called sparse. And then dense would be the opposite, where you only have one fully connected module and it's always utilized. So maybe this is a good place
Starting point is 00:57:42 to also talk about KV Cash, but actually before that, even zooming out, like fundamentally, how many new ideas have been implemented from GPD 2 to today? Like, how different really are these architectures? Picture, like the mixture of experts,
Starting point is 00:58:00 the attention mechanism in GPDOS, that would be the group query attention mechanism. So it's a slight tweak for multi-head attention to group query attention. So there we have two. I think they replaced
Starting point is 00:58:10 a layer norm by RMS norm, but it's just like a different normalization layer. Not a big change. It's just like a tweak. The non-linear activation function, people familiar with deep new networks, I mean, it's the same as
Starting point is 00:58:23 changing sigmoid with Ralu. It's not changing the network fundamentally. It's just like a tweak. A little, little tweak. And that's about it, I would say. It's not really fundamentally that different. It's still the same
Starting point is 00:58:35 architecture. So you can convert one from one, you can go from one into the other by just adding these changes basically. It's fundamentally still the same architecture. Yep. So for example, you mentioned my book earlier, that's a GPD2 model in the book because it's simple and it's very small. So
Starting point is 00:58:51 124, 1020 million parameters approximately. But in the bonus materials, I do have almost three from scratch, Gemma 3 from scratch and other types of from scratch models. And I always started with my GPD2 model and just you know, tweaked or added different components, and you get from one to the other.
Starting point is 00:59:07 It's kind of like a lineage in a sense, yeah. Can you build up an intuition for people? Because sort of when you zoom out, you look at it, there's so much rapid advancement in the AI world. And at the same time, fundamentally the architectures have not changed. So where is all the turbulence, the turmoil of the advancement happening? Where is the gains to be had? So there are the different stages where you develop the network or train the network.
Starting point is 00:59:38 You have the pre-training. Now, back then they, it was just pre-training with GPT2. Now you have pre-training, mid-training and post-training. So I think right now we are in the post-training focus stage. I mean, pre-training still gives you advantages if you scale it up to better, higher quality data. But then we have capability unlocks that we're not there with GPD2. For example, chat-GPT, It is basically a GPT3 model, and GPT3 is the same as GPT2 in terms of architecture.
Starting point is 01:00:09 What was new was adding the supervised fine-tuning and the reinforcement learning with human feedback. So it's more on the algorithmic side rather than the architecture. I would say that the systems also change a lot. I think if you listen to Nvidia's announcements, they talk about these things like, you now do FP8, you can now do FP4. And what is happening is these labs are figuring out how to utilize more compute to put it into one model, which lets them train faster. and that lets them put more data in.
Starting point is 01:00:34 And then you can find better configurations faster by doing this. So you can look at like they essentially the tokens per second per GPU is a metric that you look at when you're doing large scale training. And you could get, you can go from like 10K to 13K by turning on FP8 training, which means they're using less memory per parameter in the model. And by saving less information, you do less communication. You can train faster. So all of these like system things underpin way faster experimentation. on data and algorithms, that is kind of like, it's this kind of loop that keeps going, where it's kind of hard to describe when you look at the architecture and they're exactly the same,
Starting point is 01:01:13 but the code-based use to train these models is going to be vastly different. And you can probably, like, I don't, the GPUs are different, but you probably train GPTOS 20B way faster in wall clock time than GPT2 was trained at the time. Yeah, like you said, they had, for example, in the mixture of experts, this NVP4 optimization, for example where you get more throughput. But I do think this is for the speed. This is true, but it doesn't give the model new capabilities in a sense. It's just how much can we make the computation coarser without suffering in terms of model performance degradation.
Starting point is 01:01:49 But I do think, I mean, there are alternatives popping up to the transformer. There's text diffusion models, completely different paradigm. And there's also, I mean, though text diffusion models might use transformer architectures, but it's not an auto-regressive transformer. And also Mamba models. It's a state-space model. But they do have trade-offs. And what's right is there's nothing that has replaced the auto-regressive transformer as state-of-the-art model.
Starting point is 01:02:15 So like for state-of-the-art, you would still do that, go with that thing. But there are no alternatives for the cheaper. And like alternatives that are kind of making compromises, but it's not just one architecture anymore. There are little ones coming up. But if we talk about the state of the art, it's pretty much still the transform architecture, auto-regressive, derived from GPT2 essentially. I guess the big question here is we talked quite a bit here
Starting point is 01:02:41 on the architecture behind the pre-training. Are the scaling laws holding strong across pre-training, post-training, inference, contact size, data, synthetic data. I'd like to start with the technical definition of scaling law, which kind of informs all of this. The scaling law is a power law relationship between, you can think of the x-axis, so kind of what you are scaling is a combination of compute and data, which are kind of similar. And then the y-axis is like the held-out prediction accuracy over our next tokens.
Starting point is 01:03:12 We talk about models being auto-aggressive. It's like if you keep a set of text that the model has not seen, how accurate will it get when you'll train. And the idea of scaling laws came when people figured out that that was a very predictable relationship. And I think that that technical term is continuing. And then the question is like, what do users get out of it? And then there are more types of scaling where OpenAI's 01 was famous for introducing inference time scaling. And I think less famously for also showing that you can scale reinforcement learning training and get kind of this log x-axis and then a linear increase in performance on y-axis. So there's kind of these three axes now where the traditional scaling laws are talked about for pre-training,
Starting point is 01:03:55 which is how big your model is and how big your data set is. And then scaling reinforcer learning, which is like how long can you do this trial and error learning that we will talk about, we'll define more of this. And then this inference time compute, which is just letting the model generate more tokens on a specific problem. So I'm kind of bullish where they're all really still working, but the low-hanging fruit has mostly been taken,
Starting point is 01:04:16 especially in the last year on reinforceral learning with verifiable rewards, which is this RLVR and then inference time scaling, which is just why these models, so different to use where previously you would get that first token immediately. And now they'll go off for seconds, minutes, or even hours, generating these hidden thoughts before giving you the first word of your answer. And that's all about this inference time scaling, which is such a wonderful kind of step function in terms of how the models change abilities.
Starting point is 01:04:44 They kind of enabled this tool-use stuff and enabled this much better software engineering that we were talking about. And this is, when we say enabled almost entirely downstream of the fact that this reinforced learning with verifiable rewards training, just kind of let the models pick up these skills very easily. So let the models learn. So if you look at the reasoning process, when the models are generating a lot of tokens,
Starting point is 01:05:07 what it will be often doing is it tries a tool, it looks at what it gets back, it tries another API, it sees what it gets back, and if it solves the problem. So the models, when you're training them very quickly learn to do this, and then at the end of the day, that gives this kind of general foundation, where the model can use CLI commands very nicely,
Starting point is 01:05:25 your repo and handle Git for you and move things around and organize things or search to find more information, which if we're sitting in these chairs a year ago, it's something that we didn't really think of the models being doing. So this is just kind of something that has happened this year and is totally transformed how we think of using AI, which I think is very magical. It's such an interesting evolution and just unlock so much value. But it's like, it's not clear what the next avenue will be in terms of unlocking stuff like this. I think that there's, we'll get to continue a learning later, but there's a lot of buzz around certain areas of AI, but no one knows when the next step function will really come. So you, you've actually said quite a lot of things there
Starting point is 01:06:07 and said profound things quickly. It would be nice to unpack them a little bit. You say you're bullish, basically on every version of scaling. So we just even start at the beginning. Pre-training, Are we kind of implying that the low-hanging fruit on pre-training scaling has been picked? Is pre-training hit a plateau or is even pre-training still you're bullish on? Pretraining has gotten extremely expensive. I think to scale up pre-training, it's also implying that you're going to serve a very large model to the users. So I think that it's been loosely established the likes of GPT4 and similar models where around one trillion, like this order of trillion parameters at the biggest size.
Starting point is 01:06:52 There's a lot of rumors that they've actually gotten smaller as training has gotten more efficient. You want to make the model smaller because then your costs of serving go down proportionally. These models, the cost of training them is really low relative to the cost of serving them to hundreds of millions of users. I think DeepSeek had this famous number of about $5 million for pre-training at cloud market rates.
Starting point is 01:07:13 I think almost three. Section 2.4 in the paper, we just detailed how long we had the GPU clusters sitting around for training, which includes engineering issues, multiple seeds, and it was like about $2 million to rent the cluster to like deal with all the problems and headaches of training a model. So these models are pretty, like a lot of people could get one to $10 million to train a model, but the recurring costs of serving millions of users is really billions of dollars of compute. I think that you can look at close, like a thousand GPU rental. You can pay $100,000 a day for,
Starting point is 01:07:47 and these companies could have millions of GPUs. You can look at how much these things cost to sit around. So that's kind of a big thing. And then it's like if scaling is actually giving you a better model, like is it going to be financially worth it? And I think it will kind of slowly will push it out as AI solves more compelling tasks. So like the likes of Cloud Opus 4.5, making Cloud code just work for things. I think I launched this project called like the Atom Project,
Starting point is 01:08:12 which is like American Truly Open models in July. And that was like a true vibe-coded website. And, like, I have a job, make plots and stuff. And then I came back to refresh it in the last few weeks. And it's like Claude Opus 4.5 versus whatever model at the time was like just crushed all the issues that it had from building in June and July. And like, it might be a bigger model. There's a lot of things that go into this. But that's like, there's still progress coming.
Starting point is 01:08:37 So what you're speaking to is the nuance of the y-axis of the scaling laws that the way it's experienced versus on a benchmark, the actual intelligence is, might be different, but still, your intuition about pre-training, if you scale the size of compute, will the models get better? Not whether it's financially viable, but just from the law aspect of it, do you think the models will get smarter? Yeah. And I think that there's, and this sometimes comes off as like, almost like disillusioned from people, leadership, AI companies saying this, but they're like, it's held for 13 orders of magnitude of computers. I'm playing, like, why would it ever end. I think fundamentally it is pretty unlikely to stop. It's just like eventually we're not even going to be able to test the bigger scales because of all the problems that come with more
Starting point is 01:09:23 compute. I think that there's a lot of talk on how 2026 is a year when very large Blackwell compute clusters, it's like gigawatt scale facilities that hyperscalers are coming online. And these were all contracts for power and data centers that were signed and sought out in like 22 and 2023. So before or right after chat GPT. So it took this
Starting point is 01:09:47 two to three year lead time to build these bigger clusters to train the models. Well, there's obviously immense interest in building even more data centers than that. So that is like kind of
Starting point is 01:09:55 the crux that people are saying. It's like these new clusters are coming. The labs are going to have more compute for training. They're going to utilize this. But it's not a given. And it's like I've seen so much progress that I expect it.
Starting point is 01:10:07 And I expect a little bit bigger models and I expect I would say it's more like we will see a $2,000 subscription this year. We've seen $200 subscriptions. It's like that can 10x again. And these are the kind of things that could come. And they're all downstream of this bit bigger model that offers just a little bit more cutting edge.
Starting point is 01:10:26 So, you know, it's reported that XAI is going to hit that one gigawatt scale early 26 and full two gigawatt by year end. How do you think they'll utilize that in the context of scaling laws? is a lot of that inference, is a lot of that training? It ends up being all of the above. So I think that all of your decisions when you're training a model come back to pre-training.
Starting point is 01:10:53 So if you're going to scale RL in a model, you still need to decide on your architecture that enables this. We're talking about other architectures than using different types of attention. We're also talking about mixture of experts models. This sparse nature of MOE models makes it much more efficient
Starting point is 01:11:08 to do generation which becomes a big part of post-training. And it's like, you need to have your architecture ready so that you can actually scale up this compute. I still think most of the compute is going in at pre-training. Because you can still make a model better, you still want to go and revisit this. You still want the best-based model that you can.
Starting point is 01:11:28 And in a few years, that I'll saturate, and the RL compute will just go longer. Is there people who disagree with you that say, basically, pre-training is dead? It's all about scaling inference. scaling post-training, scaling context, continual learning, scaling data, synthetic data.
Starting point is 01:11:47 People vibe that way and describe it in that way, but I think it's not the practice that is happening. It's just the general vibe of people saying this thing is dead. The excitement is elsewhere. So the low-hanging fruit in RL is elsewhere. Like, for example, we released our model in November for every company has deadlines. Our deadline was like November 20th. And for that, our RRL run was five days,
Starting point is 01:12:06 which compared to 2024 is a very long time to just be doing post-strander. training at a model of like 30 billion parameters. It's not a big model. And then in December, we had another release, which was just, we let the RL run for go for another three and a half weeks, and the model got notably better, so we release it. And like, that's a big amount of time to just allocate to something that is going to be your peak for the year. So it's like, there's these types of decisions that happen when they're training a model where they just like can't, they can't leave it forever. You have to keep pulling in the improvements you have from your researchers. So that's like you redo pre-training. You'll do this post-training for a month, but then you
Starting point is 01:12:45 need to give it to your users. You need to do safety testing. So it's kind of just like, I think there's a lot in place that reinforces this cycle of just keep updating the models. There's things to improve. You get a new compute cluster that lets you do something maybe more stable or faster. It's like you hear a lot about Blackwell having rollout issues where at AI2, most of the models were pre-training are on like 1 to 2,000 GPUs. But when you're pre-training, on 10,000 or 100,000 GPUs, you hit very different failures. So GPUs are known to break in weird ways. And doing 100,000
Starting point is 01:13:15 GPU run is like, you're pretty much guaranteed to always have at least one GPU that is down. And you need to have your training code handle that redundancy, which is just a very different problem. Whereas like, what we're doing, like, I'm playing with post-training on DJX, Spark, or you have your book, it's like, or people learning ML, it's like, what they're
Starting point is 01:13:31 battling to train these biggest models is just like mass distributed scale. And it's a very different, but that's somewhat different than are these, like that's a systems problem in order to enable the scaling laws, especially at pre-training. You need all of these GPUs at once.
Starting point is 01:13:49 When we shift to reinforcement learning, it actually lends itself to heterogeneous compute because you have many copies of the model. And to do a primer for a language model, reinforcement learning, what you're doing is you have two sets of GPUs. One is, you can call it the actor and one you call it the learner. The learner is where your actual
Starting point is 01:14:09 reinforce and learning updates are going to do. These are traditionally policy gradient algorithms. Proximal policy optimization, PPO, and group relative policy optimization, GRPO, are the two popular classes. And on the other side, you're going to have actors, which are generating completions. And these completions are the things that you're going to grade. So reinforceal learning is all about optimizing reward. And in practice, what you can do is that you can have a lot of different actors in different parts of the world doing different types of problems and then you send it back to this highly network compute cluster to do this actual learning, where you take the, where you take the gradients and you need to have a tightly mesh
Starting point is 01:14:49 network where you can do different types of parallelism and spread out your model for efficient training. So there's just like a lot of every different type of training and serving has these considerations you need to scale. Like we talked about pre-training, we talked about RL, and then inference time scaling is like, how do you serve a model that's thinking for an hour to 100 million users? I'm like, I don't really. know about that, but I know that's a hard problem. And in order to give people this intelligence, there's all the systems problems and we need more compute and you need more stable compute to do it. But you're bullish on all of these kinds of scaling is what I'm hearing, on the inference,
Starting point is 01:15:23 on the reasoning, even on the pre-training. Yeah, so that's a big kind of worms here. But so basically two of the knobs are the training and the inference gaining where you can get gains. And so in a world where we had, let's say, infinite compute resources, you want to do all of them. So you have training, you have inference scaling, and training is like a hierarchy. It's pre-training, mid-training, post-training. Changing the model size, more training data,
Starting point is 01:15:49 making training a bigger model. It gives you more knowledge in the model. The model, let's say, has a better, it's like a better base model back in a day, or still we call it foundation model. And it unlocks, but you don't, let's say, have the model be able to solve your most complex tasks during pre-training or after pre-training.
Starting point is 01:16:09 you still have these other unlock phases where you have mid-training or long context, for example, post-training with LR, that unlocks capabilities that the model has in terms of just knowledge in the pre-training. And I think, sure, if you, so do mod pre-training, you get a better base model that you can unlock later. But like Nathan said, it just becomes too expensive. So we don't have infinite compute. So you have to decide, do I want to spend that compute more on making the model larger? But, you know, it's like a trade-off. It's like an ideal world you want to do all of them. And I think in that sense, scaling is still pretty much alive.
Starting point is 01:16:41 You would still get a better model. But like we saw with GPD 4.5, it's just not worth it. I mean, it's like, because you can, let's say you can unlock more performance with other techniques at that current moment, especially if you look at inference scaling, that's one of the biggest gains this year with 01, where it took a smaller model further than pre-training a larger model like GPD 4.5. So it's like, I wouldn't say pre-training scaling is dead. it's just like there are other more attractive ways to scale right now at the moment.
Starting point is 01:17:10 But at some point, you know, you will still want to make some progress on the pre-training. The thing is also to consider where you want to spend your money. If you spend it more on the pre-training, it's like a fixed cost. You train the model. And then it has this capability forever. You can always use it and so forth. With inference scaling, you don't spend money during training. You spend money later per query.
Starting point is 01:17:33 And then it's also like the math, how long is my model going to be on the market. If I replace it in half a year, maybe it's not worth spending $5 million, $10 million, $100 million on the training it longer. Maybe it's just, I will just do more inference scaling and get the performance from there.
Starting point is 01:17:50 It maybe costs me $2 million in terms of user queries. It becomes a question of how many users you have and then doing the math. And I think that's also where it's interesting, where ChachyPD is in a position, I think they have a lot of users, where they need to go a bit cheaper, where they have that GPD5 model
Starting point is 01:18:04 that is a bit smaller. other companies that have, as if your customers have other tradeoffs, for example, there was also the Math Olympiad or some of these math problems where Chagabit, or may I, they had a proprietary model, and I'm pretty sure it's just like a model that has been maybe fine-tuned a little bit more,
Starting point is 01:18:24 but most of it was during inference scaling to achieve this peak performance in certain tasks where you don't need that all the time. But, yeah, long story short, I do think all of these pre-training mid-traneers, training, post-training, infant scaling, there are also things you want to do.
Starting point is 01:18:39 It's just finding, at the moment, in this year, it's finding the right ratio that gives you the best bang for the buck, basically. I think this might be a good place to define pre-training, mid-training,
Starting point is 01:18:49 and post-training. So pre-training is the classic training one next token prediction at a time. You have a big corpus of data. And Nathan also has a very interesting insights there because of almost three. It's a big portion of the paper focuses on the right data mix.
Starting point is 01:19:03 So pre-training is, essentially just, you know, train across entropy loss training on next token prediction on a vast corpus of internet data, books, papers, and so forth. It has changed a little bit over the years in the sense people used to throw in everything they can. Now, it's not just raw data, it's also synthetic data where people re, let's say, rephrase certain things. So synthetic data doesn't necessarily mean purely AI made up data. It's also taking something from an article, Wikipedia article and then rephrasing it as a Q&A question
Starting point is 01:19:38 or summarizing it, rewording it, and making better data that way. Because I think of it also like with humans, if someone, let's say, reads the book compared to a messy, I don't know, no offense, but like Reddit post or something like that. I do think
Starting point is 01:19:54 you learn, no offense, but I think... There's going to be a post about this, Sirrash. Some Reddit data is very coveted and excellent for training. You just have to filter it. I think that's the idea. I think it's like, if someone took that and rephrases that in a, let's say, more concise
Starting point is 01:20:14 and structured way, I think it's higher quality data that gets the LLM, maybe the same, you get the same LLLM out of it at the end, but it gets their faster. It trains faster because let's say if the grammar and the punctuation is correct, it already learns the correct way versus getting information from a messy way and then learning later how to correct that and stuff like that. So I think that is how pre-training evolved and how still, why scaling still works is that it's not about just the amount of data. It's also the tricks to make that data better for you in a sense.
Starting point is 01:20:50 And mid-training is, I mean, it used to be called pre-training. I think it's called mid-training because it was awkward to have pre-training and post-training, but nothing in the middle, right? It sounds a bit weird to have pre-training and post-training, but what's the actual training? So the mid-training is usually similar to pre-training, but it's a bit more, I would say, specialized in pre-training. It's the same algorithm. But what you do is you focus, for example, on long-contact. Like, one example, you have long-context documents.
Starting point is 01:21:16 The reason you don't do that during just pre-training is because you don't have that many long-context documents. So you have a specific phase. And one problem of LLMs is also still, it's a neural network. It has the problem of catastrophic forgetting. So you teach it something. It forgets other things. And you want to, it's not 100% forgetting, but it's like no free lunch.
Starting point is 01:21:36 It's also the same with humans. If you ask me some math I learned 10 years ago, I don't know, I would have to look at it again. Nathan was actually saying that he's consuming so much content that there's a catastrophic forgetting issue. Yeah, I'm like trying to learn so much about AI. I was like I was learning about pre-training parallelism. I'm like, I lost something and I don't know what it was.
Starting point is 01:21:54 I don't want to anthropomorphize LLMs, but it's, I think, the same kind of in that sense, how humans learned. I mean, the quantity is not always better because, yeah, it's like being selective. And the mid-training is being selective in terms of quality content at the end. So the last thing the LM has seen is the quality stuff. And then post-training is all the fine-tuning, supervised fine-tuning, DPO, reinforcement learning with verifiable rewards, with human feedback and so forth. So the refinement stages.
Starting point is 01:22:25 And it's also interesting. It's like the cost thing, right? I mean, it's like pre-training, you spend a lot of money. on that right now, RL a bit less. RL, you don't really, I would say, teach it knowledge. It's more like unlocking the knowledge. It's more like a skill learning, like how to solve problems with the knowledge that it has from pre-training. There are actually three papers this year, last year, 2025 on RL for pre-training.
Starting point is 01:22:47 But I mean, I don't think anyone does that in production. Toy examples for now. Toy examples, right. But to generalize RL post-training is more like the skill unlock where pre-training is like soaking up the knowledge essentially. A few things that could be helpful for people. A lot of people think of synthetic data as being bad for training the models. You mentioned the deep C got an OCR, which is optical character recognition paper. A lot of labs did.
Starting point is 01:23:14 AI2 had one, had multiple. And the reason that each of these labs has these is because there's vast amounts of PDFs and other digital documents on the web that are in formats that aren't encoded with text easily. So you use these old OCR, or deep seek OCR, and we called our OCR, to extract what can be trillions of tokens of candidate data for pre-training. And pre-training dataset size is on the order of trillions is measured in trillions of tokens. Smaller models from researchers can be something like 5 to 10 trillion. Quinn is documented going up to like 50 trillion, and there's rumors that these closed labs can go to like 100 trillion tokens. And just getting this potential data to put in, I think they,
Starting point is 01:23:57 they have a very big funnel, and then the data you actually train the model on is a small percentage of this. Like the, this character recognition data would be described as synthetic data for pre-training in a lab. And then there's also the things like chat GPT now gives wonderful answers, and you can train on those best answers. And that's synthetic data. It's very different than like early chat TPT, lots of hallucinations, data when people became
Starting point is 01:24:20 grounded in synthetic data. One interesting question is, if I recall correctly, almost three was trained with less data, than specifically some other open weight models, maybe even almost two, but you still got better performance. And that might be one of the examples, how the data helps. It's mostly down to data quality.
Starting point is 01:24:35 I think if we had more compute, we would train for longer. I think we ultimately see that as a, like just like something we would want to do. And especially with big models, you need to have more compute because we talk about having more parameters, we talk about knowledge.
Starting point is 01:24:47 And essentially there's a ratio where big models can absorb more from data, and then you're going to, you get more benefit out of this. It's like one of these, Any logarithmic graph in your mind is like a small model will level off sooner if you're measuring trunks of tokens and bigger models need more. But mostly as we aren't training that big of models right now with AI2 and getting the highest quality data we can is the natural starting point. Is there something to be said about the topic of data quality?
Starting point is 01:25:14 Is there some low hanging fruit there still where the quality can be improved? It's like turning the crank. So I think historically in the open there's been like a canonical best pre-training data. set that has moved around between who has the most recent one or the best of recent effort. Like AI2's Dolmo was very early with the first Olmo and Hugging Face had Fine Web. And there's a DCLM project, which has been kind of like a, which is, it stands for data comp language model. There's been data comp for other machine learning projects. And they have had a very strong data set.
Starting point is 01:25:46 And a lot of it is the internet is becoming fairly closed off. So we have common crawl, which I think is hundreds of trillions of tokens and you filter it. And it looks like being a lot of scientific work where you're training classifiers and making decisions based on how do you prune down this data set into the highest quality stuff and the stuff that suits your tasks. So previously language models were tested a lot more on like knowledge and just kind of conversational things, but now they're expected to do math and code. So to train a reasoning model, you need to remix your whole dataset.
Starting point is 01:26:17 And there's a lot of actually wonderful scientific methods here where you can you can take your gigantic dataset. you sample a lot of really tiny things from different sources. So you say you have GitHub, Stack Exchange, Reddit, Wikipedia. You can sample small things from them, and you train small models on each of these mixes and measure their performance on your evaluations. And you can just do like basic linear regression,
Starting point is 01:26:37 and it's like, here's your optimal dataset. But if your evaluations change, your dataset changes a lot. So a lot of old mode three was new sources for reasoning to be better at math and code. And then you do this mixing procedure and it gives you the answer. And I think that's a lot of that's happened at labs this year. There's new hot things, whether it's like coding environments or web navigation. You just need to bring in new data.
Starting point is 01:26:57 You need to change your whole pre-training so your post-training can work better and stuff like this. So that's like the constant re-evolution and the redetermining of what they care about for their models. Are there fun anecdotes of what sources of data are particularly high quality that we wouldn't expect? You mentioned Reddit sometimes can be a source. Reddit is very useful. I think that, like, PDFs is definitely one. Especially archive. Yeah, so, like, AI2 has run Semantic Scholar for a long time,
Starting point is 01:27:28 which is a, like what you can say as a competitor to Google Scholar with a lot more features. And to do this, AI2 has found and scraped a lot of PDFs for openly accessible papers that might not be, like, behind the closed paid garden of a certain publisher. So, like, truly open scientific PDFs. And if you, like, you sit on all of these and you process it, and you can get value out of it. And I think that, like, a lot of that style of work has been done
Starting point is 01:27:55 by the Frontier Labs much earlier. And it's just, like, you need to have a pretty skilled researcher that understands how things change models and they bring it in and they clean it. And it's a lot of labor that, like, I think of a lot of Frontier Labs when they scale researchers a lot more, it goes into data. You have people, like, if you join a Frontier Lab,
Starting point is 01:28:13 you want to have impact, the best way to do it is just find new data that's better. And then like the fancy, glamorous algorithmic things, like figuring out how to make O1 is like the sexiest thought of a scientist. It's like, oh, I figured out to scale RL. And there's a group that did that. But I think of most of the contributions is like, I'm going to make the data better or I'm going to make the infrastructure better so that everybody in my team can run experiments 5% faster. At the same time, I think it's also one of the closest guarded secrets what your training data is for legal reasons. So there's also, I think, a lot of work that goes into hiding what your trading data was essentially.
Starting point is 01:28:47 like trying the model to not give away the sources because of legal reasons. The other thing to be complete is that some people are trying to train on only licensed data where Common Crawl is a scrape of like the whole internet. So if I host multiple websites, I'm happy to have them train language models, but I'm not explicitly licensing what governs it. And therefore this like common crawl is largely unlicensed, which means that your consent really hasn't been provided for how to use the data. There's another idea where you can train,
Starting point is 01:29:17 language models only on data that has been licensed explicitly. So that the kind of governing contract is provided. And I'm not sure if Approdis is the copyright thing or the license thing. I know that the reason that they did it was for an EU compliance thing where they wanted to make sure that their model fit one of those checks. And also on that note, also, for example, there's also the distinction between the licensing. So some people, like you said, they just purchased the license. I'd say they buy a book online, let's say an Amazon Kindle book or let's say a money book or something,
Starting point is 01:29:49 and then use that in the training data. And that is like the gray zone because you paid for the content and you might want to train it. But then there are also restrictions where even that shouldn't be allowed. And so that is like where it gets a bit fuzzy. And yeah, I think that is right now still a hot topic. And also big companies like Open AI, they approached private companies for their proprietary data. And private companies, they become more and more, let's say, protective of their data because they know,
Starting point is 01:30:17 okay, this is going to be my mode in a few years. And I do think that's like the interesting question where if LLMs become more commoditized, and I think a lot of people learn about LLMs, there will be a lot more people able to train LLLMs. Of course, there are infrastructure challenges, but if you think of big industries like pharmaceutical industries, law, finance industries,
Starting point is 01:30:38 I do think they at some point will hire people from other frontier labs to build their in-house models on their proprietary data, which will be then again another unlock with pre-training that is currently not there because even if you wanted to, you can't get that data. You can't get access to clinical trials most of the time
Starting point is 01:30:56 in these types of things. So I do think scaling in that sense might be still pretty much alive if you also look in domain-specific applications because we are still right now in this year just looking at general purpose LLLNs on Chachipede Anthropic and so forth. They are just general purpose. They're not
Starting point is 01:31:12 even, I think, scratching the surface of what an LM can do if it is really specifically trained and designed for a specific task. I think on the data thing, this is one of the things where like this happened in 2025, and we totally forget it, is Anthropic lost in court and was owed $1.5 billion to authors. Anthropic, I think, bought thousands of books and scanned them and was cleared legally for that because they bought the books, and that is kind of going through the system. And then the other side, they also torrented some books. And I think this torrenting was the path where the courts said that they were then
Starting point is 01:31:44 culpable to pay this billions of dollars to authors, which is just like such a mind-boggling lawsuit that kind of just came and went. Like that is so much money from the VC ecosystem. These are core cases that will define the future of human civilization because it's clearly that data drives a lot of this. And there's this very complicated human tension of, I mean, you can empathize. You're both authors. There's some degree to which, I mean, you put your heart and soul and your sweat and tears into the writing that you do, it feels a little bit like theft for somebody to train your data without giving you credit.
Starting point is 01:32:21 And like Nathan said, also two layers to it. Someone might buy the book and then train on it, which could be argued fair or not fair, but then there are literally straight up companies who use pirated books where it's not even compensating the author. That is, I think, where people got a bit angry about it specifically. Yeah, but there has to be some kind of compensation. scheme. This is like moving towards
Starting point is 01:32:42 something like Spotify streaming did originally for music. What does that competition look like? You have to define those kinds of models. You have to think through all of that. One other thing I think people are generally curious about it, I'd love to get your thoughts. As LLMs are used more
Starting point is 01:32:59 and more, if you look at even archive, but GitHub, more and more of the data is generated by LLMs. What do you do in that kind of world? How big of a problem is that. Largest problems to infrastructure and systems, but from an AI point of view, it's kind of inevitable. So it's basically LLM generated data that's curated by humans essentially, right?
Starting point is 01:33:22 Yes, and I think that a lot of open source contributors are legitimately burning out. If you have a popular open source repo, somebody's like, oh, I want to do open source AI, it's good for my career, and they just vibe code something. And they throw it into the, you might get more of this than I do. So I have actually a case study here. I have a repository called ML Extend that I developed as a student 15 years, 10 years ago. And it is a reasonably popular library still for certain algorithms, I think, especially like frequent data mining stuff.
Starting point is 01:33:55 And there was recently, I think, two or three people, who submitted a lot of PRs in a very short amount of time. I do think LMS have been involved in submitting these PRs. Me as the maintainer, there are two things. First, I'm a bit overwhelmed. I don't have time to read through it because especially it's an older library that is not a priority for me. At the same time, I kind of also
Starting point is 01:34:14 appreciate it because I think something people forget is it's not just using the LL, there's still a human, you have a human layer that verifies something and that is in a sense also how data is labeled, right? So that's like one of the most expensive things is getting labeled data for
Starting point is 01:34:30 REL back in human feedback phases. And this is kind of like that where it goes through phases and then you get actually higher quality data out of it. I don't mind it in a sense. It can feel overwhelming, but I do think there is also value in that.
Starting point is 01:34:44 It feels like there's a fundamental difference between raw LLM generated data and LLM generated data with human in a loop that does some kind of verification, even if that verification is a small percent of the lines of code. I think this goes with anything
Starting point is 01:35:00 where people think also sometimes, oh, I can just use an LLM to learn about X, Y, Z, which is true. can, but there might be a person who is an expert who might have used an LLM to write a specific code. There is kind of like this human work that went into it to make it nice and throwing out the not so nice part to make it to kind of like predigested for you. And that saves you time.
Starting point is 01:35:23 And I think that's that that's the value at where you have someone filtering things or even using the LLMs correctly. I think this is still labor that you get for free with you, for example, read an article, let's say substake article. I could maybe ask an LLM to give me opinions on that, but I wouldn't even maybe know what to ask. I think there is still value in reading that article compared to me going to the LLM
Starting point is 01:35:48 because you are the expert, you select what knowledge is actually spot on, should be included, and you give me this very, this executive summary. And this is kind of a huge value at because now I don't have to waste three, five hours to go through this myself,
Starting point is 01:36:05 maybe get some incorrect information and so on. And so I think that's also where the future still is for writers, even though there are LLMs, that expert can kind of save your time. It's kind of fascinating to actually watch, and I'm sure you guys do this, but for me to look at the difference between the summary
Starting point is 01:36:23 and the original content, even if it's a page-long summary of a page-long content, it's interesting to see how the summary, LM-based summary takes the edge off, Like what is the signal it removes from the thing? The voice is what I talk about a lot. Voice, well, voice, I'd love to hear what you mean by a voice. That's really powerful.
Starting point is 01:36:46 But sometimes there's like literally insights. Like in removing an insight, you're actually fundamentally changing the meaning of the thing. So I'm continuously disappointed how bad LMs are at really getting to the core insights, which is what a great summary does. Even if you go, and I have these extensive, extremely elaborate prompts
Starting point is 01:37:09 where I'm like really trying to dig for the insights. And it's still not quite there, which, I mean, that's a whole deep philosophical question about what is human knowledge and wisdom and what does it mean to be insightful and so on? But when you talk about the voice, what do you mean? So when I write, I think a lot of what I'm trying to do is take what you think as a researcher,
Starting point is 01:37:31 which is very wrong, which a researcher is trying to encapsulate an idea at the frontier of their understanding. And they're trying to put what is a feeling into words. And I think that my writing, I tried to do this as the writing, which makes it come across as raw, but also high information in a way that it's like some people will get it and some won't. And that's kind of the nature of research.
Starting point is 01:37:51 And I think this is something that language models don't do well. Particularly, they're all trained with this reinforced learning from human feedback, which is designed to take feedback from a lot of people. And in a way, average how the model behaves from this. And I think that there's, it's going to be hard for a model to be very incisive when there's that sort of filter in it. And I think this is kind of a wonderful fundamental problem for researchers in RLHF is like, this provides so much utility in making the models better. But also the problem formulation is kind of like, there's this knot in it that you can't get past. So that's what I think of is like these language models don't have this prior and their deep expression that they're trying to get at.
Starting point is 01:38:32 I don't think it's impossible to do. I think there's stories of models that really shock people. I think of, like, I would love to have tried Bing Sydney. And does, like, does that have more voice? Because it would so often go off the rails on people. And what is historically, obviously, a scary way, like, telling a reporter to leave its life is a crazy model to potentially put in general adoption. But that's kind of like a tradeoff.
Starting point is 01:38:56 Like, is this RLHF process, like, in some ways adding limitations? That's a terrifying place to be. as one of these frontier labs and companies, because millions of people are using them. There was a lot of backlash last year with the GPT40 getting removed. And I personally never used the model, but I've talked to people at OpenAI
Starting point is 01:39:16 where they're to the point where they get emails from users that might be detecting subtle the differences in the deployments in the middle of the night, and they email them, and they're like, my friend is different. And they, like, find these employees' emails and send them things because they're so attached to,
Starting point is 01:39:32 this set, what is a set of model weights and a configuration that is deployed to the users. We see this with TikTok. You open it, I don't use TikTok, supposedly in like five minutes, the algorithm gets you. It's like, it's locked in. And I don't, like, those are language models doing recommendations. Like, I think there are ways that you can do this with the language model. Within, like, five minutes of chatting with it, the model just gets you. And that is something that people aren't really ready for. Like, I think that, like, don't give that to kids. Like, don't give that their kids, at least until we know what's happening. But there's also going to be this mechanism.
Starting point is 01:40:05 What's going to happen with these LLMs is they're used more and more. Unfortunately, the nature of the human condition is such that people commit suicide. And so what journalists would do is they will report extensively on the people who commit suicide, and they would very likely link it to the LLMs because they have that data about the conversations. If you're really struggling in your life, if you're depressed, if you're thinking about suicide, you're going to probably talk to LLMs about it. And so what journalists will do is they will say, well, the suicide was committed because of the LLM. And that's going to lead to the companies
Starting point is 01:40:39 because of legal issues and so on, more and more and more taking the edge off of the LLM. So it's going to be as generic as possible. It's so difficult to operate in this space because, of course, you don't want an LLM to cause harm to humans at that level. But also, this is also, this is also, you the nature of the human experience is to have a rich conversation, a fulfilling conversation, one that challenges you from which you grow. You need that edge. And that's something extremely difficult for AI researchers on the RLHF front to actually have to solve. Because you're actually dealing with the human condition. Like a lot of researchers at these companies are so well motivated. And there's definitely the likes of Anthropic and Open AI are culturally so want to do good
Starting point is 01:41:28 through this for the world. And it's such a, I'm like, oh, I don't want to work on this. Because on the one hand, a lot of people see AI as a health ally, as somebody they can talk to about their health confidentially. But then it bleeds all the way into this, like talking about mental health and things where it's heartbreaking that this will push, like be the thing where somebody goes over the edge. But other people might be saved. And I'm like, I don't, like, there's things that as a researcher training models,
Starting point is 01:41:57 It's like I don't want to train image generation models and release them openly because I don't want to enable somebody to have a tool on their laptop that can harm other people. Like I don't have the infrastructure at my company to do that safely. But it's like, like, there's a lot of areas like this where it's just, it needs people that will approach it with the complexity and just kind of conviction of like, it's just such a hard problem. But also we as a society as users of these technologies need to make sure that we're having the complicated conversation about it versus just fearmonger.
Starting point is 01:42:27 Big tech is causing harm to humans or stealing your data, all that kind of stuff. It's more complicated than that, and you're right. There's a very large number of people inside these companies, many of which you know, many of which I know, they're deeply care about helping people. They are considering the full human experience of people from across the world, not just Silicon Valley.
Starting point is 01:42:48 People across the United States, people across the world, what that means, what their needs are. It's really difficult to design this one system that is able to help all these different kinds of people, people across the different age groups, cultures, mental states, mental conditions, all that kind of stuff. I wish that the timing of AI was different with the relationship of big tech to the average person.
Starting point is 01:43:09 So like big tech's reputation was so low. And with how AI is so expensive, it's like inevitably going to be a big tech thing where it takes so many resources and people say that U.S. is quote unquote betting the economy on AI with this buildout. And it's like to have these be intertwined at the same time is just makes for such a hard communication environment, it would be good for me to go talk to more people in the world that hate big tech and see AI as a continuation of this. And one of the things you actually recommend, one of the antidotes that you talk about,
Starting point is 01:43:41 is to find agency in this whole system, as opposed to sort of sitting back in a powerless way and consuming the AI slop as it quickly, rapidly takes over the internet, more fine agency by using it to build stuff, build apps, build. So one that actually helps you build the intuition, but two, it's empowering because you're going to understand how it works, what the weaknesses are and it allows you, it gives your voice power to say like, this is fucked up, this is bad, this is bad uses the technology, and this is a good use of technology.
Starting point is 01:44:15 And you're more plugged into the system than, so you can understand it better and you can steer it better. I think it's a good point you brought up agency. Instead of ignoring it and saying, I'm not going to use it. I think it's probably long-term healthier to say, okay, it's out there. I can't put it back, you know, like internet computers back then when they came out. How do I make best use of it?
Starting point is 01:44:36 And how does it help me to up-level myself? The one thing I worry here, though, is like if you just fully use it for something you love to do, the thing you love to do is no longer there. And that could potentially, I feel likely to burnout. For example, if I use an LM to do all my coding for me, now there's no coding. I'm just managing something that is coding for me. Two years, let's say later, if I just do that eight hours a day, have something code for me, do I feel fulfilled still?
Starting point is 01:45:04 Like, is this like, yeah, I mean, is this like hurting me in terms of being excited about my job, excited about what I'm doing, am I still proud to build something? So there's a, on that topic of enjoyment, it's quite interesting, we should just throw this in there, that there's this recent survey of about 791, professional developers, professional meaning 10 plus years of experience. That's a long time. Yeah. That's a junior developer?
Starting point is 01:45:34 Yeah, in this day and age. So the results here on many fronts are surprising. So they break it down by junior and senior developers. But, I mean, it just shows that both junior, senior developers use AI generated code in code. they ship. So this is not just for fun sort of intermediate kind of learning things. This is code. They ship. And so it's 25% meant like most of them use around 50% or more. And what's interesting is for the category of over 50% of your code that you ship as AI generated, senior developers are much more likely to do so. But you don't want AI to take away the thing you love.
Starting point is 01:46:18 I think it speaks to my experience, these particular results I'm about to say. So together about 80% of people find it either somewhat more enjoyable or significantly more enjoyable to use AI as part of the work. I think it depends on the task. From my personal usage, for example, I have a website where I sometimes tweak things on the website. I personally don't enjoy this. So in that sense, if the AI can help me to implement something on my website,
Starting point is 01:46:47 I'm all here for it. It's great. But then at the same time, when I solve a complex problem, well, if there's a buck and I hunt this buck and I find the bug, it's the best feeling in the world. It's like you get so much joy like, oh, it's like you feel like great. But now if you don't even think about thinking about the buck, you just go directly to the LLM, well, you never have this kind of feeling, right? But then there could be the middle ground where, well, you try yourself, you can't find it,
Starting point is 01:47:15 you use the LLM and then you don't get frustrated because it helps you and you move on to something that you enjoy. And so I think looking at these statistics, I think also the, the difference is, or what is not factored in, it's averaging over all the different scenarios where we don't, so we don't know if it's for the core task or if it's for something mundane that people would not have enjoyed otherwise. So in a sense, AI is really great for doing mundane things that take a lot of work. So for example, my wife the other day,
Starting point is 01:47:44 she has like a podcast for like book discussions, a book club, and she was like transferring show notes from Spotify to YouTube. and then the links somehow broke and she had in some episodes because it discussed so many books like 100 links or something and it would have been really painful to go in there and fix each link manually and so I
Starting point is 01:48:03 suggested hey let's try chat chachibati we copied the text into chat chitbccc. And instead of two hours going from link to link fixing that it made that type of work much more seamless. There was no frustration fixed. I think everyone has a use case where AI is useful for something like that that would be really boring, really mundane.
Starting point is 01:48:23 I, for me personally, since we're talking about coding, and you mentioned debugging, a lot of the sources of the enjoyment for me, more on the cursor side than the clog code side, is the, I have a friend, I have a co, what's that called, a pair programmer, like it's less lonely. You made debugging sound like this great joy. No, I would say debugging is like a dream,
Starting point is 01:48:51 of water after you've been going through a desert for days. So you skip the whole desert part where you're suffering. So sometimes it's nice to have a friend who can't really find the bug but can give you some intuition about the code and you're together with that friend going to the desert and then together find that drink of water. So at least for me, maybe it speaks to the loneliness of the programming experience. That is a source of joy. It's maybe also related to delayed gratification.
Starting point is 01:49:24 I'm a person who, you know, even as a kid, I like the idea of Christmas presents, having them, getting them better than actually getting the presents. I would look forward to the day I get the presents, but then it's over and I'm disappointed. And maybe it's something like also with, let's say, food. I think food tastes better when you're really hungry. And with, yeah, you're right with debugging.
Starting point is 01:49:47 It is not always, you know, great. it's often frustrating. But then if you can solve it, then it's great. But there's also like a sweet, goldy lock zone. If it's too hard and it's, you know, wasting your time. But I think that is another challenge, though. How will people learn? I mean, the chart we looked at,
Starting point is 01:50:08 we saw that more senior developers are shipping more AI generated code than the junior ones. And I think it's very interesting because intuitively you would think it's the junior developers because they don't know, let's say, how to do the thing yet, because they are more junior, and so they use AI to do that thing. It could either mean the AI is not good enough yet to solve that task, but it could also mean experts are more effective at using it. They know where and better how to use it and review the code, and they trust the code than more.
Starting point is 01:50:37 And so I think one issue in the society in the future will be, though, how do you become an expert if you never try to do the thing yourself? And I think one way it's always like for me, how I learn is by trying things myself, like math textbooks, if you look at the solutions, yeah, you learn something. But I think you learn actually better if you try first and then you appreciate the solution differently because you know how to put it into your mental framework. And if LMS are here all the time, would you actually go through the length at struggling? Would you be willing to struggle? Because struggle is not nice, right? I mean, it's struggling.
Starting point is 01:51:15 And if you use the LM to do everything at some point, you will never really take the next step. And then you will maybe not get that unlock that you would get as an expert using an LLM. So it's like, you know, it's like, I think there's like a goalie lock sweet spot where maybe the trick here is you make dedicated offline time where you study two hours a day and the rest of the day use LLMs. But I think it's important also for people to still invest in themselves, in my opinion, to not just, you know, LLM everything.
Starting point is 01:51:43 Yeah, there is. Now, we together a civilization, that we each individually have to find that Goldilog's own, and in the program and context as developers. Now, we've had this fascinating conversation that started with pre-training and mid-training. Let's get to post-training. A lot of fun stuff in post-training.
Starting point is 01:52:01 So what are some of the interesting ideas in post-training? The biggest one from 2025 is learning this reinforcement with verifiable rewards. You can scale up the training there, which means doing a lot of this kind of, iterative generate grade loop, and that lets the models learn both interesting behaviors on the tool use and software side. This could be searching, running commands on their own and seeing outputs, and then also that training enables this inference time scaling very nicely. And it just
Starting point is 01:52:31 turned out that this paradigm was very nicely linked in this, where it's this kind of RL training enables inference time scaling. But inference time scaling could have been found in different ways. So it was kind of this perfect storm of the models change a lot in the way that they're is a major factor in doing so. And this has changed how people approach post-training dramatically. Can you describe R-LVR popular by DeepSeek R1? Can you describe how it works? Yeah, fun fact, I was on the team that came up with the term RLVR,
Starting point is 01:53:02 which is from our two or three work before Deep Seek, which is we don't take a lot of credit for being the people to popularize the scaling RL, but it is fun as what academics get as an aside. is the ability to name and influence the discourse because the closed labs can only say so much that one of the things you can do as an academic is like you might not have the compute to train the model, but you can frame things in a way that ends up being, I describe it as like a community can come together around this RLVR term, which is very fun. And then deep secret is the people that did the training breakthrough, which is they scaled the reinforcement learning, which was
Starting point is 01:53:39 you have the model generate answers and then grade the completion if it was right. And then that accuracy is your reward for reinforcement learning. So reinforcement learning is classically an agent that acts in an environment, and the environment gives it a state and a reward back, and you try to maximize this reward. In the case of language models, the reward is normally accuracy on a set of verifiable tasks, whether it's math problems, coding tasks,
Starting point is 01:54:07 and it starts get blurry with things like factual domains, like that is also in some ways verifiable. or constraints on your instruction, like respond only with words that start with A. Like, all of these things are verifiable in some way. And the core idea of this is you find a lot more of these problems that are verifiable, and you let the model try it many times while taking these RL steps, these RL gradient updates,
Starting point is 01:54:36 the infrastructure evolved from this reinforced learning from human feedback, where in that era, the score they were trying to optimize was a learned reward model. of aggregate human preferences. So you kind of change the problem domains and that let the optimization go on to much bigger scales, which kind of kickstarted a major change and what the models can do and how people use them. What kind of domains is RLVR amenable to? Math and code are the famous ones. And then there's a lot of work kind of on what is called the rubrics, which is related to where people might have heard as LM as a judge, which is like for each problem, all of a set of problems. in my trading data set, I will then have another language model and ask it, what would a good
Starting point is 01:55:19 answer to this problem look like? And then you could try the problem a bunch of times over and over again and assign a score based on this rubric. So that's not necessarily verifiable like a math and code domain, but this rubrics idea and other scientific problems that might be a little bit more vague is where a lot of the attention is, where they're trying to push this set of methods into these kind of more open-ended domains, so the models can learn a lot more. I think that's called reinforcement learning with AI feedback, right? That's the older term from it that was coined in Anthropics constitutional AI paper. So it's like a lot of these things come in cycles.
Starting point is 01:55:54 Also just one step back for the RLVR. So I think the interesting, beautiful thing here is that you ask the LLM, let's say, a math question, and then you know the correct answer. And you let the LLM, like you said, figure it out. But how it does it, I mean, you don't really constrain it much. There are some constraints you can add, like use the same language, don't switch between Spanish and English. But let's say you're pretty much hands off. You only give the question and the answer. And then the LM has to, you know, just the task to arrive at the right answer. But the beautiful
Starting point is 01:56:25 thing here is what happens in practice is that the LM will do a step-by-step description. Like, you know, like as a student or like as a mathematician, how you would derive the solution. It will give you, it will use those steps. And that helps actually the model to improve its own accuracy. And then like you said the inference scaling. So inference scaling loosely means basically spending more compute during using the LM during inference. And here the inference scaling is that the model would use more tokens. And also I think in the R1 paper, they showed the longer they train the model, the longer the responses are. They grow over time. They use more tokens. So it becomes more expensive, it becomes more expensive for simple tasks. But these explanations, they help the model
Starting point is 01:57:08 with the accuracy. They're also interesting a lot of papers showing what the model explains. does not necessarily have to be correct or maybe it's even unrelated to the answer, but for some reason it still helps the model, like this is the fact that it is explaining. And I think it's also, again, I don't want to anthropomorphize these LLMs, but it's kind of like how we humans operate, right?
Starting point is 01:57:28 If there's a complex math problem, let's say in a math class, you usually have a notepaper and you do it step by step, you cross out things. And the model also self-corrects and that was, I think, the aha moment in the R1 paper.
Starting point is 01:57:41 They called it aha moment because the model itself, recognized it made a mistake and then said, ah, I did something wrong and so let me try.
Starting point is 01:57:47 And I think that's just so cool that this falls out of just giving it the correct answer and having it figure out how to do it that it kind of
Starting point is 01:57:56 does in a sense what a human would do. Although LLMs don't think like humans, it's kind of like an interesting coincidence. And the other
Starting point is 01:58:04 nice side effect is it's great for us humans often to see these steps. It builds trust, but also we learn we can double-check things.
Starting point is 01:58:12 There's a lot in here. I think some of the debate, there's been a lot of debate this year on if the language models, like these aha, I think the aha moments are kind of fake, because in pre-training, you essentially have seen the whole internet. So you have definitely seen people explaining their work, even verbally, like a transcript of a math lecture. You try this, oh, I messed this up. And what reinforced learning is, this RLVR is very good at doing is amplifying these behaviors, because they're very useful in enabling the model to think longer and to check its work. And I agree that it is very beautiful that this trend.
Starting point is 01:58:43 kind of the model learns to amplify this in a way that is just so useful at the final answers being better. I can give you also a hands-on example. I was training the Gwren3 base model with RLVR on Math 500. The base model had an accuracy of about 15%, just 50 steps, like in a few minutes, with RLVR. The model went from 15% to 50% accuracy. And you can't tell me it's learning anything about fundamentally about math. The point in example is weird because there's been two papers this year, one of which I was on that talks about data contamination in Quinn, and specifically that they train on a lot of this special mid-training phase that we said like a minute on because it's weird, because they train on problems that are
Starting point is 01:59:25 almost identical to math. Exactly. And so you can see that basically the RL, it's not teaching the model any new knowledge about math. You can't do that in 50 steps. So the knowledge is already there in the pre-training. You're just unlocking it. I still disagree with the kind of premise, because there's a lot of weird complexities that
Starting point is 01:59:40 you can't prove because one of the things that points to weirdness is that if you take the quen three so-called base model, and you could Google on the screen, you could Google like math dataset hugging face, and you could take a problem. And what you do, if you put it into quen three base, all these math problems have words. So it would be like Alice has five apples and takes one and gives three to whoever, and there are these word problems with these quen-based models, why people are suspicious of them is if you change the numbers but keep the words, Quen will produce like a very high, without tools will produce a very high accuracy, like decimal representation of the answer, which means there's some, like, at some time, it was shown problems that were almost identical to the test set, and it was using tools to get a very high precision answer. But a language model without tools will never actually have this.
Starting point is 02:00:30 So it's kind of been this big debate in the research community is like how much of these reinforce learning papers that are training on Quinn and measuring specifically. on this math benchmark where there's been multiple papers talking about contamination is like how much can you believe them? And I think this is what caused the reputation of RLVR being about formatting because you can get these gains so quickly
Starting point is 02:00:50 and therefore it must already be in the model. But there's a lot of complexity here that it's not really like controlled experimentation. So we don't really know. But if it weren't true, I would say distillation wouldn't work great. I mean, distillation can work to some extent. But the thing is, that is, I think, the biggest problem
Starting point is 02:01:08 and LM research this contamination because we don't know what's in the data. Unless you have a new data set, it's really impossible. And the same you mentioned the math data set, which is if a question and an answer and an explanation is given, but then also even something simpler like MMLU, which is a multiple choice benchmark, if you just change the format slightly, like, I don't know, you use a dot instead of a parenthesis or something like that.
Starting point is 02:01:35 The model accuracy will vastly differ. I think that that can be. like a model issue rather than a general issue. It's not even malicious by the developers of the LM like, hey, we want to cheat at that benchmark. It's just, it has seen something at some point. And I think the only fair way to evaluate an LM is to have a new benchmark that is after the cutoff date when the LLM was deployed. Can we lay out what would be the sort of the recipe of all the things that would be going to post-training? And you mentioned our RLVR was a really exciting, effective
Starting point is 02:02:04 thing, maybe we should elaborate. RLHF still has it in a really important component to play. What kind of other ideas are there on post-training? I think you can kind of take this in order. I think you could view it as what made 01, which is this first reasoning model possible, or what will the latest model be? And they actually have, you're going to have similar interventions at these, where you start with mid-training and the thing that is rumored to enable 01 and similar models is really, careful data curation where you're providing a broad set of like what is called reasoning traces, which is just the model generating words in a forward process that is reflecting,
Starting point is 02:02:46 like breaking down a problem into intermediate steps and trying to solve them. So at mid-training, you need to have data that is similar to this to make it so that when you move into post-training primarily with this verifiable rewards, it can learn. And then what is happening today is you're figuring out which, problems to give the model and how long you can train it for and how much inference you can enable the model to use when solving these verifiable problems. So as models get better, certain problems are no longer, like the model will solve them 100% of the time and therefore there's very little signal in this. If we look at the GRPO equation, this one is famous for this because essentially
Starting point is 02:03:28 the reward given to the agent is based on how good a given action, action is a completion is relative to the other answers to that same problem. So if all the problems get the same answer, there's no signal in these types of algorithms. So what they're doing is they're finding harder problems, which is why you hear about things like scientific domains, which is like that's so hard, like getting anything right there. If you have a lab or something, it just generated so many tokens or much harder software problems. So the frontier models are all pushing into these harder domains and they can train on more problems and the model will learn more skills at once. the RLHF link to this is kind of like
Starting point is 02:04:05 RLHF has been and still is kind of like the finishing touch on the models where it makes the models more useful. By improving the organization or style or tone, there's different things that resonates to different audiences. Like some people like a really quirky model and RLHF could be good at enabling that personality. And some people hate this like markdown
Starting point is 02:04:24 bulleted list thing that the models do. But it's actually really good for quickly parsing information. In RLHF, this human feedback stage is really great. for just putting this into the model at the end of the day. So it's what made ChatGBTGBT is so magical for people. And that use has actually remained fairly stable. This formatting can also help the models get better at math problems, for example. So it's like the border between style and formatting and like the method that you use to answer a problem is actually,
Starting point is 02:04:58 they're all very closely linked in terms of when you're training these models, which is why RLHF can still say make a model better at math, but these verifiable domains are a much more direct process to doing this, because this kind of makes more sense with the problem formulation, which is why it kind of ends up all forming together. But to summarize, it's like mid-training is give the model the skills it needs to then learn. RL and verifiable rewards is let the model try a lot of time. So put a lot of compute into trial and error learning across hard problems.
Starting point is 02:05:28 And then RLHF would be like, finish the model, make it easy to use, and kind of just round the model out. Can you comment on the amount of compute required for RLVR? It's only gone up and up, so I think GROC4 was famous for saying they used a similar amount of compute for pre-training and post-training. Back to the scaling discussion, they involve very different hardware for scaling.
Starting point is 02:05:50 Pre-training is very compute-bound, which is like this flop's discussion, which is just how many matrix multiplications can you get through one time. And because RL, you're generating these answers, you're trying the model in the real world environments, it ends up being much more memory-bound because you're generating long sequences, and the attention mechanisms have this behavior
Starting point is 02:06:09 where you get a quadratic increase in memory as you're getting to longer sequences. So the compute becomes very different. So when in pre-training, we would talk about a model, I think if we go back to like the Biden administration executive order, it's like 10 to the 25th flops to train a model. If you're using flops in post-training, it's a lot weirder because the reality is just like
Starting point is 02:06:28 how many hours are you? you allocating how many GPUs for. And I think in terms of time, the RL compute is getting much closer because you just can't put it all into one system. Like pre-training is so computationally dense where all the GPUs are talking to each other and it's extremely efficient, where RL has all these moving parts and it can just take a long time to generate a sequence of 100,000 tokens. Like if you think about GBT 5.2 Pro taking an hour, it's like what if your training run has a sample for an hour and you have to make it so that's handled efficiently? So I think in GPTBT, hours or just like wall clock hours, the RL runs are probably approaching the number of days
Starting point is 02:07:06 as pre-training, but they probably aren't using as many GPUs at the same time. There's rules of thumb where in labs, it's like you don't want your pre-training runs to last more than like a month because they fail catastrophically. And if you were planning a huge cluster to be held for two months and then it fails on day 50, the opportunity costs are just so big. So you kind of don't want to just, people don't want to put all their eggs in one basket, which is like GBT4 was like the ultimate Yolo run and nobody ever wanted to do it before
Starting point is 02:07:33 where it took like three months to train and everybody was shocked that it worked where I think people are a little bit more cautious and incremental now. So RLVR is more, let's say, unlimited how much you can train and get still benefit where RLHF, because it's a preference tuning, you reach a certain point where it doesn't really make sense
Starting point is 02:07:51 to spend more RL budget on that. So just a step back with preference tuning. So there are multiple people that can give multiple, let's say explanations for the same thing and they can both be correct but at some point you learn a certain style and it doesn't make sense to you know iterate on it my favorite example is like if relatives ask me what laptop they should buy i give them an explanation or ask them like yeah what is your um use case like they for example prioritize battery life and storage other people like us for example we would prioritize RAM and compute and so but both both answers are correct but different people
Starting point is 02:08:27 require different answers and with preference tuning well, you're trying to average somehow. Like you are asking data label us to give you the right, or not the right, the preferred answer. And then you train on that. But at some point, yeah, you learned that average preferred answer. And there's no, I think, reason to keep training longer on it because, you know, it's just a style where with our LVR, you literally give the model,
Starting point is 02:08:49 well, you let the model solve more and more complex, difficult problems. And so I think that it makes more sense to allocate more budget long term to LRVR. And also that right now we are in LRVR 1.0 plant where it's still like that simple thing where we have a question and answer, but we don't do anything with the one stuff in between. So there was also, I mean, multiple research papers also by Google, for example, on process reward models that also give scores for the explanation, how correct is the explanation. And I think that will be the next thing, let's say our LVR 2.0 for this year, focusing in between questions. and answer like how to leverage that information, the explanation to improve the explanation and help it to get better accuracy. But then, so that's one angle.
Starting point is 02:09:37 And there was a deep seek math version two paper where they also had interesting inference scaling there. First, they had developed models that grade themselves, a separate model. And I think that that will be one aspect and the other like Nathan mentioned. It will be for LR branching into other domains. the place where people are excited are value functions, which is pretty similar. So process reward models are kind of like process reward models assign how good something is to each kind of intermediate step in a reasoning process where value functions apply value
Starting point is 02:10:12 to every token the language model generates. Both of these have been largely unproven in the language modeling and this reasoning model era. People are more optimistic about value functions forever. For whatever reason now, I think process. reward models were tried a lot more in this pre-01 pre-reasoning model era and a lot of people had a lot of headaches with them. So I think a lot of it is the human nature of like value models have a very deep history in reinforcement learning. They're one of the first things that were
Starting point is 02:10:41 core to like deep reinforced learning existing is like training value models in this. So right now the literature, people are excited about trying value models, but there's very little proof in it. And there are negative examples in trying to scale up process reward models. these things don't always hold in the future. I think we came to this discussion by talking about scaling and a simple way to summarize what you're saying with like you don't want to do too much RLHF, which is eventually the signal scales,
Starting point is 02:11:06 is people have worked on RLHF for language models for years, especially in intense interest after ChatGBTBT. And the first release of a reasoning model trained with RLVR, opening eyes 01, had a scaling plot where if you increase the training compute logarithmically, you get a linear increase in evaluations. And this has been reproduced, multiple times. I think deep seek how to plot like this. But there's no scaling law for RLHF
Starting point is 02:11:29 where if you log increase the compute, you get some performance. In fact, the seminal scaling paper for RLHF is scaling laws for reward model over optimization. So it's like that's a big line to draw with RLVR and the methods we have now and in the future, like they will follow the scaling paradigm, which is like the best runs you can let to run for an extra 10x and you get a few X performance, but you can't do this with RLHF. And that is. just going to be field-defining and how people approach them, where I'm a shill for people academically to do RLHF. And that's a good way to describe it. It's like, to do the best RLHF, you might not need the extra 10 or 100X of compute, but to do the best RLVR you do,
Starting point is 02:12:11 so I think there's a, what I say is a seminal paper from what was a meta-internhip. It's like the art of scaling, reinforcer learning with language models. They're, what they describe as a framework is scale RL. And their incremental experiment was like 10,000 B 200 hours, which is like thousands or tens of thousands of dollars per experiment. And they do a lot of them, which is just like this cost is not accessible to the average academic, which is a hard equilibrium where it's trying to figure out how to learn from each community. I was wondering if it could take at this point a bit of a tangent and talk about education and learning. If you're somebody listening to this, who's a smart person interested in programming, interested in AI. So I presume
Starting point is 02:12:57 building something from scratch is a good beginning. So can you just take me through what you would recommend people do? So I would personally start, like you said, implementing a simple model from scratch that you can run on your computer. The goal is not if you build a model from scratch to have like something you use every day for your personal projects. Like it's not going to be your personal assistant replacing an existing open weight model. Chachapidi, it's to see what exactly goes into the LLM, what exactly comes out of the LLM, how the pre-training works in that sense, on your own computer preferably. And then you learn about the pre-training, the supervised fine-tuning, the attention mechanism,
Starting point is 02:13:36 you get a solid understanding of how things work. But at some point you will reach a limit because small models can only do so much. And the problem with learning about LLMs at scale is, I would say it's exponentially more complex to make a larger model because it's not that the model just becomes larger. You have to now think about sharding your parameters across multiple GPUs. Even for the KV cache, there are multiple ways you can implement it. One is just to understand how it works, just to grow the cache. It's like a cache you grow step by step by, let's say, concatening lists growing it.
Starting point is 02:14:09 But then that wouldn't be optimal in GPUs. You wouldn't do that. You would pre-allocate a tensor and then fill it in. But that adds, again, another 20, 30 lines of code. and for each thing, you add so much code. And I think the trick with the book is basically to understand how the LLM works. It's not going to be your production level LLM. But once you have that, you can understand the production level LLLLLLL.
Starting point is 02:14:29 So you're trying to always build an LN that's going to fit on one GPU. Yes, most of them I have, I have some bonus materials on some MEO models. I think one or two of them, they may require multiple GPUs, but the goal is to have it on one GPU. And the beautiful thing is also you can self-verify. It's almost like RL. the VR, when you code these from scratch, you can take an existing model from the Hugging Face Transformer Library. So the Hanging Phase Transformer Library is great, but if you want to learn about
Starting point is 02:14:58 LLMs, I think that's not the best place to start because the code is so complex, because it has to fit so many use cases. Also, some people use it in production. It has to be really sophisticated, and it's really intertwined and really hard. It's not linear to read. It was started as a fine-tuning library. And then it grew to be like the standard representation of every model architecture and the way this loaded. So HuggingFace is like the default place to get a model and Transformers is the software that enables it so people can easily load a model and do something basic with it. And all frontier labs that have open weight models have a Hanging Phase Transformers version of it like from Deepseek to GPTOSS. That's like the canonical weight that you can load there. But again,
Starting point is 02:15:40 also even Transformers, the library is not used in production. people use than SG Lang or VLM, and it adds another layer of complexity. We should say that the Transformers Library has like 400 models. So it's a one library that tries to implement a lot of LLMs. And so you have a huge code base, basically. It's like huge. It's like it's, I don't know, maybe millions of thousands of lines of code. And it's like understanding the part that you want to understand is finding the needle in the haystack.
Starting point is 02:16:09 But what's beautiful about it is you have a working implementation. And so you can work backwards from it. What I recommend doing, but I also do is if I want to understand, for example, how almost three is implemented, I would look at the weights in the model hub, the config file, and then you can see, they use so many layers.
Starting point is 02:16:26 They use, let's say, group query attention or multi-head attention in that case. Then you see all the components in like a human readable, I don't know, 100 lines of config file. And then you start, let's say, with your GPT2 model and add these things, you know. And the cool thing here is you can then load the pre-trained weights and see if they work in your model.
Starting point is 02:16:45 And you want to match the same output that you get with a transformer model. And then you can use it as a basically as a verifiable reward to make your architecture correct. And then it's kind of sometimes it takes me a day to with almost three. The challenge was rope for the position embeddings. They had a yarn extension and there was some custom scaling there. And I couldn't quite match these things. And in this struggle, you kind of understand things. But the cool thing is, at the end, you know you have it correct because you can unit tested.
Starting point is 02:17:15 You can check against the reference implementation. And I think that's maybe one of the best ways to learn, really, like to basically reverse engineer something. I think that that is something that everybody that's interested in getting to AI today should do. And I think that's why I liked your book is like I came to language models from this RL and robotics field. Like I never had taken the time to just like learn all the fundamentals. and this transformer architecture I described as being so fundamental as deep learning was a thing that I had to learn in the past and people need to do this. I think that where a lot of people kind of get overwhelmed is how do I apply this to have impact or find a career path? Because AI and language models make this fundamental stuff so accessible and people with motivation will learn it.
Starting point is 02:18:02 And then it's like how do I get the cycles on goal to contribute to research? And I think that I'm actually fairly optimistic in this because the field moves so fast that a lot of times the best people like don't fully solve a problem because there's a bigger lower, like a bigger problem to solve that's very low hanging fruit. So they move on. And I think that a lot of what I was trying to do in this RLHF book is like take post trading techniques and just describe how people think about them influencing the model and what people are doing. And then it's remarkable how many things I just think are just like people stop studying. them or don't. So I think people trying to get narrow after doing the fundamentals is good. And then reading the relevant papers and being engaged in the ecosystem, it's like you actually, the proximity that random people have online from the leading researchers, like no one knows who all the anonymous account on X and ML is very popular for whatever reason. And no one knows who all these people are. Like it could just be random people that study the stuff deeply, especially with the AI tools and just be like, I don't understand this. Keep.
Starting point is 02:19:07 digging into it, I think is a very useful thing. But there's a lot of research areas that are maybe three papers that you need to read. And then one of the authors will probably email you back, but you have to put a lot of effort into these emails to understand the field. Like, I think it would be for a newcomer, easily weeks of work to feel like they can truly grasp, like what is a very narrow area. But I think going narrow after you have the fundamentals be very useful to people because it's like I became very interested in character training, which is like how you make the model funny or sarcastic or serious. And like, what do you do to the data to do this? And it's like, a student at Oxford reached out to me. It's like,
Starting point is 02:19:49 hey, I'm interested in this and I advised him. And I was like, that paper now exists. And it's like, I don't know, there's like two or three people in the world that were very interested in this. He's a PhD student, which gives you an advantage. But like, for me, that was a topic. I was waiting for someone be like, hey, I have time to spend cycles on this. And I'm sure there's a lot more very narrow things or you're just like, oh, it doesn't make sense that there was no answer to this. And I think that it's just like,
Starting point is 02:20:13 there's so much information coming that people are like, I can't grab onto any of these. But if you just actually stick in an area, I think there's a lot of interesting things to learn. Yeah, I think you can't try to do it all because it would be very overwhelming and you would burn out if you try to keep up with everything. For me, for example, I haven't kept up with computer vision a long time,
Starting point is 02:20:30 just focused on LMs. But coming back to your book, for example, think this is also a really great book and a really good bang for the buck because you want to learn about RLHF. I wouldn't go out there and read RLHF papers because you would be spending two years. I contradict. I just edited the book and I was like there's a chapter where I had to be like X papers say one thing and X papers say another thing and we'll see what comes out to be true. What are some of the, just to go through some of the table of context, some of the ideas we might have missed in the bigger picture of the post training. So first of all, you do the problem. Set up,
Starting point is 02:21:01 Training overview, what a preference is, preferences, preferences data in the optimization tools, reward modeling, regularization, instruction tuning, rejection sampling, reinforcement learning, i.e. policy gradients, direct alignment algorithms. Then constitutional AI and AI feedback, reasoning and inference time scaling, to use and function calling,
Starting point is 02:21:22 synthetic data and distillation, evaluation, and then open question section, over optimization, style, and information. and then product, UX, character, and post training. So what are some ideas worth mentioning that connect both the educational component and the research component? You mentioned the character training.
Starting point is 02:21:40 This is pretty interesting. Character training is interesting because there's so little out of it, but we talk about how people engage with these models and like we feel good using them because they're positive, but that can go too far. It could be too positive.
Starting point is 02:21:51 And it's like, essentially it's, how do you change your data or decision-making to make it exactly what you want? And Open AI has this thing called a model spec, which is essentially their internal guideline for what they want to model to do. And they publish this to developers. So essentially, you can know what is a failure of Open AI's training, which is like they have the intentions and they haven't met it yet, versus what is something that they actually wanted to do and that you don't like.
Starting point is 02:22:18 And that transparency is very nice. But all the methods for curating these documents and how easy it is to follow them is not very well known. I think the way the book is designed is that the reinforce learning chapter is obviously. obviously what people want because everybody hears about it with RLVR. And it's the same algorithms and the same map. But it's just like you can use it in very different documents. So I think the core of RLHF is like how messy preferences are is essentially rehash of a paper I wrote years ago. But this is essentially the chapter that will tell you why RLHF is never,
Starting point is 02:22:50 ever fully solvable. Because like the way that even RL is set up is that as soon, that preferences can be quantified and that multiple preferences can be reduced to single values. And I think it relates in the economics literature to the von Neumann-Morgensen utility theorem. And like that is the chapter where all of that philosophical, economic, and like psychological context, it tells you what gets compressed into doing RLHF. So it's like you have all of this. And then later in the book, it's like you use this RL math to make the number go up.
Starting point is 02:23:25 And I think that that's why I think it would be very rewarding for people to do research on. is because it's like quantifying preferences is something that is just like, humans have designed the problem in order to make preferences studyable. But there's kind of fundamental debates on like, an example is in a language model response. You have different things you care about whether it's accuracy or in style. And when you're collecting the data, they all get compressed into like,
Starting point is 02:23:49 I like this more than another. And it's like, like, that is happening. And there's a lot of philosophical, there's a lot of research in other areas of the world that go into like, how should you actually do this? I think social choice theory is the subfield of economics around how you should aggregate preferences. And there's like, I went to a workshop that published a white paper. I'm like, how can you think about using social choice theory for RLHF?
Starting point is 02:24:14 So I mostly would want people that get excited about the math to come and have things or they can stumble into and learn this kind of broader context. I think there's a fun thing. I just keep a list of all the tech reports I like of reasoning models. So in chapter 14, which is kind of like a short summary of RLVR, there's just like a gigantic table where I just like list every single reasoning model that I like. So there's just like, I think in education, a lot of it needs to be like at this point it's like what I like because the language models are so good at the math where it's like famous paper, direct preference optimization, which is like a much simpler way of solving the problem than RL. The derivations and the appendix skip steps of math. And it's like I tried for this book.
Starting point is 02:24:55 like I redid the derivations and I'm like, what the heck is this log trick that they use to change the map? But doing it with language models, they're like, this is the log trick. And I'm like, I don't know if I like this, that the math is so commoditized. I think like some of the struggle and reading this appendix
Starting point is 02:25:11 and following the math, I think is good for learning. And I, yeah, so we're actually returning to this often just on the topic of education. You both have brought up the word struggle quite a bit. So there is value. If you're not struggling as part of this process,
Starting point is 02:25:29 you're not fully following the proper process for learning, I suppose. Some of the providers are starting to work on models for education, which are designed to not give. Actually, I haven't used them, but I would guess they're designed to not give all the information at once and make people work to do this. So I think you could train models to do this, and it would be a wonderful contribution where, like,
Starting point is 02:25:50 all of this stuff in the book, you have to reevaluate every decision for it, which is such a great example. I think there's a chance you work on it, AI2, which I was like, oh, I think this would be so fun. It makes sense. I do something like that. Did that the other day for video games, for example.
Starting point is 02:26:04 I sometimes for my past time play video games. Like, I like video games with puzzles. So, you know, like Zelda and Metroid. And there's this new game where I got stuck. And I really got stuck and was, okay, I, you know, I don't want to struggle like two days. And so I use an LLM. But then you say, hey, please don't add any spoilers.
Starting point is 02:26:22 Just, you know, I'm here and there. What do we have to do next? the same thing you can do, I guess, for math, where you say, okay, I'm here at this point, I'm getting stuck. Don't give me the full solution, but what is something I could try, you know, like where you kind of carefully probe it. But the problem here is, I think it requires discipline. And a lot of people do math for, like, you mean there are a lot of people who enjoy math,
Starting point is 02:26:42 but there are also a lot of people who need to do it for their homework. And then it's like the shortcut. And yeah, we can develop an educational LLM, but the other LLM is still there and there's still a temptation to use the other LLMs. I think a lot of people, especially in college, They understand the stuff they're passionate about, they're self-aware about it, and they understand it shouldn't be easy.
Starting point is 02:27:00 Like I think we just have to develop a good taste. We've got about research taste, like school taste, about stuff that you should be struggling on and stuff you shouldn't be struggling on, which is tricky to know because sometimes you don't have good long-term vision about what would be actually useful to you in your career. But you have to develop that taste. talking to maybe my fiance or friends about this.
Starting point is 02:27:27 And it's like, there's this brief 10-year window where all of the homework and all the exams could be digital. But before that, everybody had to do all the exams in Blue Book because there was another way. And now after AI, everybody's going to need to be in Blue Books and oral exams because everybody could cheat so easily. It's like this brief generation that had a different education system that like everything could be digital. But you still couldn't cheat. And now it's just going to go back. It's just very funny. You mentioned character training, just zooming out on a more general topic, for that topic, how much compute was required, and in general, to contribute as a researcher, are there places where not too much compute is required where you can actually contribute as an individual researcher?
Starting point is 02:28:12 On the character training thing, I think this research is built on fine-tuning about 7 billion parameter models with Laura, which is like essentially you only fine-tune a small subset of. of the weights of the model. I don't know exactly how many GPU hours that would take. But it's doable. Not doable for every academic. So the situation for some academics is like so dire that the only work you can do is doing inference where you have closed models or open models and you get completions from them and you can look at them and understand the models. And that's very well suited to evaluation, which you become, you want to be the best at creating representative problems that the models fail on or show certain abilities, which I think that you can break through with this. So I think that the top end goal for a research are working on evaluation,
Starting point is 02:28:58 if you want to have career momentum, is the frontier labs pick up your evaluation. So it's like, you don't need to have every project do this. But if you go from a small university with no compute and you figure out something that Claude struggles with and then the next Claude model has it in the blog post, like there's your career rocket ship. I think that that's hard, but it's like if you want to scope the maximum possible impact with minimum compute, it's something like that, which is just get very narrow, and it takes learning of where the models are going. So you need to build a tool that tests where not Claude 4.5 will fail. If you're going to do a research, if I'm going to start a research project, I need to think where the models in eight
Starting point is 02:29:36 months are going to be struggling. But what about developing totally novel ideas? This is a tradeoff. I think that if you're doing a PhD, you could also be like, it's too risky to work in language models, I'm going way longer term, which is like what is the thing that's going to define language model development in 10 years, which I think that I end up being a person that's pretty practical. I mean, I went to my PhD where it's like, I got into Berkeley, worst case, I get a master's and I go work in tech. It's like, I'm very practical about it. So I'm like, the life afforded to people to work at these AI companies, the amount of, like,
Starting point is 02:30:11 opening eyes average compensation is over a million dollars in stock a year for employee. any normal person in the U.S. to get into this AI lab is transformative for your life. So I'm pretty practical of like, there's still a lot of upward mobility working in language models if you're focused. And the outcomes is like, look at these jobs. But from a research perspective, the transformative impact in these academic awards
Starting point is 02:30:34 that'll be the next Yon Lacoon is from not working, not caring about language model development very much. It's a big financial sacrifice in that case. So I get to work with some awesome students and they're like, should I go work in an AI lab? And I'm like, like, you're getting a PhD at a top school or you're going to leave to go to a lab? I'm like, I don't know.
Starting point is 02:30:52 Like, if you go work at a top lab, I don't blame you. Don't go work at some random startup that might go to zero. But if you're going to open AI, I'm like, it could be worth leaving a PhD for. Let's more rigorously think through this. So where would you give a recommendation for people to do a research contribution? So the options are academia.
Starting point is 02:31:11 So get a PhD. spend five years publishing. Computer resources are constrained. There's research labs that are more focused on open weight models, and so working there, or closed frontier labs, research labs. Open AI, Anthropic, XAI, so on. The two gradients are, the more closed, the more money you tend to get,
Starting point is 02:31:40 but also you get less credit. So in terms of building a portfolio of things that you've done, it's very clear of what you have done as an academic and you have done this. And versus if you are going to go be, like, trade this fairly reasonable progression for being a cog in the machine, which could also be very fun. So I think it's a very different career paths. But the opportunity cost for being a researcher is very high because PhD students are paid essentially nothing. So I think it ends up rewarding people that have a fairly stable
Starting point is 02:32:14 safety debt and they realize that they can operate in the long term, which is they want to do very interesting work and get a very interesting job. So it is a fairly, like, it is a privileged position to be like, I'm going to see out my PhD and figure it out after because I want to do this. And I think a lot of academic, like at the same time, the academic ecosystem is getting bombarded by funding, getting cut and stuff. So there's just like so many different tradeoffs where I understand plenty of people that are like, oh, I can't deal with this funding search. I mean, Grant got cut for no reason by the government or I don't know what's going to happen. So I think there's a lot of uncertainty and tradeoffs that, in my opinion, favor just like take the well-paying job with meaningful impact.
Starting point is 02:32:57 It's like not also like you're getting paid to sit around at Open AI. You're building like the cutting edge of things that are changing millions of people's relationship to tech. But publication-wise, they're being more secretive, increasingly so. So you're publishing less and less and less and less. And so you are having a positive impact at scale, but you're a cog in the machine. I think it's, honestly, it hasn't changed that much. So I have been in academia. I'm not in academia anymore.
Starting point is 02:33:27 At the same time, I wouldn't want to miss my time in academia. But what I wanted to say before I get to that part, I think it hasn't changed that much. I was working in, like I was using AI or machine learning methods for applications in computational biology with collaborators. And a lot of people went from academia directly to Google. And I think it's the same thing. Back then, professors were like, you know, sad that their students went into industry because they couldn't carry on their legacy in that sense. And I think it's the same thing. It's like it hasn't changed, I think, that much. The only thing that has changed is the But, you know, cool stuff was always developed in industry that was close.
Starting point is 02:34:08 You couldn't talk about it. And I think the difference now is, well, your preference, do you like to talk about your work, publish? Or, you know, you are more in a closed lap. That's one difference, the compensation, of course, but it's always been like that, I think. So it really depends on, you know, where you feel comfortable. And it's also nothing is forever. The only thing right now is there's a third option, which is, is I'm starting a startup.
Starting point is 02:34:35 That's a lot of people doing startups. Very risky move. But it's a high risk, high reward type of situation where joining an industry lab, I think is pretty safe. You know, also upward mobility. Honestly, I think if once you have been at a industry lab, it will be easier to find future jobs.
Starting point is 02:34:54 But then again, you know, you know, it's like, yeah, how much do you enjoy the team and working on propriety things versus how, how do you like the publishing work? I mean, publishing is stressful. It is, you know, like acceptance rate at conferences can be arbitrary, can be very frustrating, but also high reward if you have a paper published, you feel good because your name is on there. You have a high accomplishment.
Starting point is 02:35:20 And, you know, I feel like my friends who are professors seem on average happier than my friends who work at a frontier lab, to be totally honest. Because that's just grounding and the frontier labs definitely do this 996, which is just, essentially is shorthand for work all the time. Can you describe 996 as culture that's, I believe you could say, invented in China and adopted in Silicon Valley? What's 996? It's 9 a.m. to 9 p.m.
Starting point is 02:35:46 6 days a week. 6 days a week. What is that? 72 hours. Okay. So is this basically the standard in AI companies in Silicon Valley? More and more of this kind of grind mindset? Yeah, I mean, not maybe not exactly like that,
Starting point is 02:36:01 but I think there is a trend towards it. And it's interesting. I think it almost flipped because when I was in academia, I felt like that because as a professor, you had to write grants. You had to teach and you had to do research. It's like three jobs in one. And it is more than a full-time job if you want to be successful. And I feel like now, like Nathan just said, the professors, in comparison to a lab, I think they have less like even maybe pressure or workload than at a frontier lab because they work a lot. They're just so fulfilled.
Starting point is 02:36:31 but like working with students and having a constant runway of mentorship and like a mission that is very people oriented. I think an era when things are moving very fast and very chaotic, it's very rewarding to people. Yeah, and I think at a startup, I think it's a pressure. It's like you have to make it. And it's like it is really important that people put in the time. But well, it is really hard because you have to deliver constantly. And I've been at a startup. I had a good time, but I don't know if I could do it forever.
Starting point is 02:37:00 It's like a interesting pace. And exactly like we talked about in the beginning, these models are leapfrogging each other, and they are just constantly, like, trying to take the next step compared to the competitors. It's just ruthless, I think, right now. I think this leapfrogging nature in having multiple players is actually an underrated driver of language modeling process
Starting point is 02:37:20 where competition is so deeply ingrained to people, and these companies have intentionally created very strong culture, like anthropic is known to be so culturally, like, deeply committed and organized. I mean, like, we hear so little from them and everybody, it's anthropic seems very aligned. And it's like being at a culture that is super tight and having this competitive dynamic is like, talk about a thing that's going to make you work hard and create things that are better. So I think that this, but that comes at the cost of human capital, which is like, you can only do this for so long. and people are definitely burning out.
Starting point is 02:38:00 I wrote a post on burnout. I'm like, I've tread in and out of this myself, especially trying to like be a manager full mode training. It's a crazy job doing this. The book Apple in China by Patrick McGee, he talked about how hard the Apple engineers work to set up the supply chains in China.
Starting point is 02:38:17 And he was like, they had saving marriage programs. And he told in a podcast, he was like, people died from this level of working hard. So I think that it's just like, it's a perfect environment for creating progress based on human expense. And there's going to be a lot of,
Starting point is 02:38:35 the human expense is the 996 that we started this with, which is like people do really grind. I also read this book. I think they had a quote word for if someone had to go home to spend time with their family to save the marriage. And it's crazy. Then then colleagues say, okay, this is like red alert for this situation.
Starting point is 02:38:52 We have to let that person go home this weekend. But at the same time, I don't think they were forced to work. It's really they were so passionate about the product, I guess, that it is, it is, you get into that mindset. And I had that sometimes as an academic, but also as an independent person, I have that sometimes. I overwork and it's unhealthy. I had, you know, I had back issues. I had neck issues because I did not take the breaks that I maybe should have taken. But it's not because no one forced me to.
Starting point is 02:39:17 It's because I wanted to work because it's exciting stuff. That's an open AI and throttle. They want to do this work. Yeah, but there's also there's also a feeling. a fervor that's building, especially in Silicon Valley, aligned with the scaling laws idea where there's this hype where the world will be transformed on a scale of weeks and you want to be at the center of it. And then, you know, I have this great fortune of having conversations with wide variety of human beings. And from there, I get to see all these bubbles and echo chambers
Starting point is 02:39:48 across the world. And it's fascinating to see how we humans form them. And I think it's fair to say that Silicon Valley is a kind of echo chamber, a kind of silo and bubble. I think bubbles are actually really useful and effective. It's not necessarily a negative thing because you could be ultra-productive. It could be the Steve Jobs' reality distortion field because you just convince each other,
Starting point is 02:40:14 the breakthroughs are imminent, and by convincing each other of that, you make the breakthroughs imminent. Burton Hobart wrote a book, classifying bubbles, but essentially one of them is financial bubbles, which is like speculation, which is bad. And the other one is like, I don't know the term, but effectively for buildouts, because it pushes people to build these things. And I do think AI is in this, but I worry about it transitioning to a financial bubble, which is like, it's, yeah, but also in the space of ideas,
Starting point is 02:40:40 that bubble, you are doing a reality distortion field, and that means you are deviating from reality. And if you go too far from reality, while also working, you know, 996, and you might miss some fundamental aspects of the human experience, including in Silicon Valley, and this is a common problem in Silicon Valley. It's a very specific geographic area. You might not understand the Midwest perspective, the full experience of all the other different humans
Starting point is 02:41:11 in the United States and across the world, and you speak a certain way to each other, you convince each other of a certain thing, and that can get you into real trouble, whether AI is a big success, and becomes a powerful technology, or it's not in either trajectory, you can get yourself into trouble.
Starting point is 02:41:29 So you have to consider all of that. Here you are a young person trying to decide what you want to do with your life. The thing that is, I don't even really understand this, but the SF AI memes have gotten to the point where permanent underclass was one of them, which was the idea that the last six months of 2025
Starting point is 02:41:47 was the only time to build a durable value in an AI startup or model. otherwise all the value will be captured by existing companies and you will therefore be poor, which like that's an example of the SF thing that goes so far. I still think for young people that going to be able to tap into it, if you are really passionate about wanting to have impact in AI, like being physically an SF is the most likely place where you're going to do this, but it has tradeoffs.
Starting point is 02:42:14 I think SF is an incredible place, but there is a bit of a bubble. and if you go into that bubble, which is extremely valuable, just get out also. Read history books, read literature, visit other places in the world. Twitter is not, and Substack is not the entire world. I think I would say one of my, one people I worked with is moving to SF, and it's like, I need to get him a copy of the Season of the Witch, which is a history of SF from like 1960 to 1985, which goes through like the hippie revolution, like all the, gaze kind of taking over the city and that culture emerging and then the HIV-AIDS crisis and other
Starting point is 02:42:54 things. And it's just like that is so recent and so much turmoil and hurt but also like love in SF. And it's like, no one knows about this. It's a great book, Season of the Witch. I recommend it. A bunch of my SF friends who do get out recommended it to me. And I think that it's just like living there. Like I lived there and I didn't appreciate this context. And it's just like so recent. Yeah. Okay, let's, we talked a lot about, we talked a lot about a lot of things,
Starting point is 02:43:24 certainly about the things that were exciting last year, but this year, one of the things you guys mentioned is exciting is the scaling of Texas DeFusion models, and it's just a different exploration of Texas DeFusioning. Can you talk about what that is and what the possibility it holds, sort of different kinds of approaches
Starting point is 02:43:43 than the current LLMs? Yeah, so we talked a lot about the transformer architecture and the auto-regressive transformer architecture specifically like GPT. And it doesn't mean no one else is working on anything else. So people are always on the, let's say, look out for the next big thing. Because I think it would be almost like, yeah, stupid not to. Because sure, right now the transformer architecture is the thing and it works best. And there's right now nothing else out there. But, you know, it's always a good idea to not put all your eggs into one basket.
Starting point is 02:44:13 So people are developing other things, alternatives to the auto-aggressive transformer. One of them would be, for example, text diffusion models. And listeners may know diffusion models from the image generation, like stable diffusion popularized it. There was like a paper on generating images. Back then, people used GANS, a generative adversarial networks. And then there was this diffusion process where you iteratively denoise an image and that resulted in really good quality images over time.
Starting point is 02:44:41 Stable diffusion was a company. other companies build their own diffusion models and then people are now like okay can we try this also for text doesn't you know make intuitive sense yet because it feels like okay it's not something continuous like a pixel that we can differentiate it's like a discrete text so how do we
Starting point is 02:44:56 implement that denoising process but it's kind of like similar to the bird models by Google like when you go back to the original transformer and so that we're like the encoder and the decoder the decoder is what we are using right now in GPD and so forth
Starting point is 02:45:11 the encoder, it's more like a parallel, let's say, technique where you have multiple tokens that you fill in in parallel. So GPT models, they do auto-regress of one token at the time. You complete the sentence, one token at a time. And in bird models, you have a text, let's say a sentence that has gaps. You mask them out. And then one iteration is filling in these gaps. And text diffusion is kind of like that, where you are starting with, let's say, some random text. and then you are filling in the missing parts or you are refining them iteratively and you have multiple iterations.
Starting point is 02:45:46 And the cool thing here is that this can do multiple tokens at the same time. So it's kind of like the promise of having it more efficient. Now the trade-off is, of course, well, how good is the quality? It might be faster. And then now you have this dimension of the denoising process. The more steps you do, the better the text becomes. And people, you know, I mean, you can scale in different ways. they try to see if that is maybe a valid alternative to the auto-regressive model in terms of giving you
Starting point is 02:46:16 the same quality for less compute. Right now, I think there are papers that suggest, okay, if you want to get the same quality, you have to crank up the denoising steps, and then you end up spending the same compute you would spend on an auto-regressive model. The other downside is, well, it's parallel, which sounds appealing, but some tasks are not parallel. like, you know, like reasoning tasks, tool use maybe where you have to ask a code interpreter to give you an intermediate result, and that is kind of tricky with diffusion models. So there are some hybrids, but the main idea is can we parallelize it? And so interesting avenue.
Starting point is 02:46:51 I think right now there are mostly research, let's say, models out there, like Lada and some other ones. I saw some by startup, some deployed models. There is no big diffusion model at scale yet, like, you know, like Gemini-C-GDPD scale in that level. there was an announcement by Google or like a site where they said they are launching Gemini diffusion and they put it into context of their I think Nano2 model and that they said basically for the same quality on most benchmarks we can generate things much faster. So you mentioned what's next. I don't think the text diffusion model is going to replace auto-regressive LLMs, but it will be
Starting point is 02:47:29 something maybe for quick, cheap at-scale tasks. Maybe the free tier in future will be something like that. I think there's a couple examples where it's, I've heard that it's actually been started to be used. I think to paint an example of why this is so much better, for example, when GPT5 is taking 30 minutes to respond is generating one token at a time. And this diffusion idea is essentially generate all of those completion, all of those tokens in the completion in one batch, which is why it could be way faster. And I think it can be suited. The startups I'm hearing are like code startups where you have a code base and you have somebody that's effectively vibe coding. and they say make this change.
Starting point is 02:48:08 And a code diff is essentially a huge reply from the model, but it doesn't have to have that much external context, and you can get it really fast by using these diffusion models. So that's what I've heard of one example, is that they use these text diffusion to generate really long diffs, because doing it with an auto-rogressive model would take minutes, and that time for like a user-facing product causes a lot of churn. So like every second you lose a lot of users.
Starting point is 02:48:32 So I think that's going to be this thing where it's going to grow and have some applications, but I actually thought that different types of models were going to be used for different things sooner than they have been. So I kind of trade off. I think that the tool use point is the one that's stopping them from being, like, most general purpose, because like cloud code and this is chatjpTA with search, like the auto-aggressive chain is interrupted with some external tool, and I don't know how to do that with the diffusion setup. So what's the future of tool use this year and then in the coming years? Do you think there's going to be a lot of developments there, how that's integrated to the entire stack? I do think right now, I mean, it's mostly on the proprietary LLM side,
Starting point is 02:49:14 but I think we will see more of that in the open source tooling. And I think, I mean, it is a huge unlock because then you can really outsource certain tasks from just memorization to actual, you know, like instead of having the LLM memorize what is 23 plus 5, just use a calculator. So you think that can help solve hallucination? not solve it but reduce it
Starting point is 02:49:36 so still the LLM needs to know what like when to ask for a tool call and the second one is well it doesn't mean the internet is always correct you can do a web search but let's say I asked who won the World Cup in let's say 1998 it still needs to find the right
Starting point is 02:49:53 website and get the right information so you can still go to the incorrect website and give me incorrect information so I don't think it will fully solve that but it is improving it in that sense And so another cool paper earlier this year, I think it was December 31st. So it's not technically 2026, but close. So like the recursive language model.
Starting point is 02:50:16 That's a cool idea to kind of take this even a bit further. So just to explain. So Nathan, you also mentioned earlier, it's harder to do cool research in academia because of the compute budget. If I recall correctly, they did everything with GPD-5. So they didn't even use local models. But the idea is, let's say if a long context task, instead of having the LLM solve all of it in like one shot or even like in a chain, you break it down into subtasks. You have the LLM decide when, like what is a good, let's say, subtask and then recursively call an LLM to solve that. And I think something like that also then adding tools and, you know, each one maybe you have like a huge Q&A task.
Starting point is 02:50:56 Each one goes to the web and gathers information and then you pull it at the end together and stitch it back together. like where I think there's going to be a lot of unlock using things like that where you not necessarily improve the LLM itself, you improve how the LLM is used and what the LLM can use. One downside right now with tool use is you have to give the LLM permission to use tools. And that will take some trust, especially if you want to unlock things like having an LLM answer emails for you, not even answer, but just sort them for you or select them for you or something like that.
Starting point is 02:51:30 I don't know if I would today give an LLM access to my emails. I mean, it's like a huge risk. I think there's a cool one last point on the tool use thing. I think that you hinted at this, and we've both come at this in our own ways, is that the open versus closed models use tools in very different ways, where open models, people go to Hugging Face and you download the model, and then the person's going to be like, oh, what tool do I want? And I don't know, XA is my preferred search provider,
Starting point is 02:51:56 but somebody else might care for a different search startup, where you release a model, it needs to be useful for multiple tools for multiple use cases, which is really hard because you're making like a general reasoning engine model, which is actually what GPTOSS is good for. But on the closed models, you're deeply integrating the specific tool into your experience. And I think that open models will struggle to replicate some of the things
Starting point is 02:52:18 that I like to do with closed models, which will be like, I don't know, you can reference a mix of public and private information and something that I keep trying every three to six months. I try like codex on the web, which is just prompting a model to make an update to some GitHub repository that I have. And it's just like, like that set of secure cloud environment is just so nice for just like, send it off and do this thing and then come back to me. And these will probably help define some of the local open and closed niches. But I think initially, because there was
Starting point is 02:52:53 such a rush to get these tool use working, that the open models were on the back foot, which is kind of inevitable. I think there's so much research, there's so many resources in these frontier labs, but we'll be fun when the open models solve this because it's going to necessitate a bit more flexible and potentially interesting model that might work with this recursive idea to like be an orchestrator
Starting point is 02:53:11 and a tool-used model. So hopefully the necessity drives some interesting innovation there. So continual learning. This is a long-standing topic, important problem. I think that increases in importance as the cost of training.
Starting point is 02:53:27 of the models goes up. So can you explain what continual learning is and how important it might be this year and in the coming years to make progress? This relates a lot to this kind of SF-Zeggeist of what is AGI, which is artificial general intelligence, and what is ASI, artificial super-intelligence, and what are the language models that we have today capable of doing? I think the language models can solve a lot of tasks, but a key milestone among the AI community is essentially when AI could replace any remote worker, taking in information and solving digital tasks and doing them. And the limitation that's highlighted by people is that a language model will not learn from feedback the same way that an employee is. So if you hire an editor, the editor will mess up,
Starting point is 02:54:13 but you will tell them. And if you hired a good editor, they don't do it again. But language models don't have this ability to modify themselves and learn very quickly. So the idea is if we're going to actually get to something that is a true, like, general adaptable intelligence that can go into any remote work scenario and needs to be able to learn quickly from feedback and on-job learning. I'm personally more bullish on language models by being able to just provide them with very good context. You said, like, you may be offline said that, like, you can write extensive documents to models where you say, I have all this information, here's all the blog posts I've ever written. I like this type of writing. My voice is based on this,
Starting point is 02:54:51 but a lot of people don't provide this to models, and the models weren't designed to, like, take this amount of context previously, like the agentic models are just starting. So it's this kind of trade-off of, do we need to update the weights of this model with this continual learning thing to make them learn fast? Or the counter-argument is we just need to provide them with more context and information,
Starting point is 02:55:11 and they will have the appearance of learning fast by just having a lot of context and being very smart. So we should mention the terminology here. So continual learning refers to, changing the weights continuously so that the model adapts, adjusts, based on the new incoming information, does so continually and rapidly and frequently and so on. And then the thing you mentioned on the other side of it is generally be referred to as in context learning. As you learn stuff, there's a huge context window. You can just keep loading it with extra information every
Starting point is 02:55:47 time you prompt the system, which I think both are legitimately can be seen as learning. It's just a different place where you're doing the learning. I think, to be honest with you, continual learning, the updating of weights, we already have that in different flavors. I mean, if you think about how, so I think the distinction here is, do you do that on a personalized custom model for each person, or do you on a global model scale? And I think we have that already with going from GPT, to 5.1 and 5.2.
Starting point is 02:56:20 It's maybe not immediate, but it is like a curated update, a quick curated update where there was feedback by the things that couldn't do, feedback by the community. They updated the weights next model and so forth. So it is, I mean, kind of like a flavor of that. Other even finer grade example, a finer grained example is like RLVR. You run it, it updates.
Starting point is 02:56:41 The problem is you can't just do that for each person because it would be too expensive to update the weights for each person. And I think that's the problem. So unless you get, I mean, even at OPEI scale, building data centers, it would be too expensive, I think. That is only feasible once you have something on the device where the cost is on the consumer, like what Apple tried to do with the Apple Foundation models, putting them on the phone. And then they learn from the experience. A bit of a related topic, but this kind of maybe anthropomorphize term, but memory.
Starting point is 02:57:14 What are different ideas of the mechanism of how to add memory to these systems? as you're increasingly seeing so. So personalized memory, especially. So right now it's mostly like context, basically stuffing things into the context and then just recalling that. But again, I think, well, it's expensive because you have to, like, I mean, you can cash it, but still you spend tokens on that. And the second one is you can only do so much.
Starting point is 02:57:41 I think it's more like a preference or a style. I mean, a lot of people do that when they solve math problems. you say it's basically you can add previous knowledge and stuff, but you also give it certain preference prompts, do what I preferred last time, whatever, like something like that. But it doesn't unlock new capabilities. So for that,
Starting point is 02:58:01 one thing people do use still is Laura, Laura adapters. These are basically, instead of updating the whole weight matrix, there are two smaller weight matrices that you kind of have in peril or overlay. It's like the delta. But yeah, you can do that to some extent, but then again, it is economics. So there were also papers, for example,
Starting point is 02:58:23 Laura learns less but forgets less. It's like, you know, it's no free lunch. If you want to learn more, you need to use more weights, but it gets more expensive. And then again, if you learn more, you forget more. And it's like you have to find that Goldilocks zone, basically. I haven't really mentioned much, but implied in this discussion is context length also.
Starting point is 02:58:43 Is there a lot of innovations as possible there? I think the colloquially accepted thing is that it's a compute and data problem where you can, and sometimes like small architecture things, which are like attention variance. So if you have, we talked about like hybrid attention models, which is essentially if you have what looks like a state space model within your transformer. And like those are better suited because you have to spend less compute to model the furthest along token. And I think that, but those aren't free because they have to be. accompanied by a lot of compute or the right data. So how many sequences of 100,000 tokens do you have in the world? And where do you get these?
Starting point is 02:59:25 And I think it just ends up being pretty expensive to scale them. So we've gotten pretty quickly to like a million tokens of input context length. And I would expect it to keep increasing and like get to like two million or five million this year. But I don't expect it to go to like a hundred million. That would be like a true breakthrough. And I think those breakthroughs are possible. like the continual learning thing, I think of it as a research problem where there could be a breakthrough that just makes Transformers work way better at this and it's cheap.
Starting point is 02:59:52 Like these things could happen with so much scientific attention, but turning the crank, it'll be consistent increases over time. I think also looking at the extremes, I think there's again no free lunch. So the one extreme to make it cheap, you have a, let's say, an R&N that has a single state where you save everything from the previous stuff. it's like a specific fixed-sized thing, so you never really grow the memory because you are stuffing everything into one state. But then the longer the context gets, the more information you forget,
Starting point is 03:00:23 because you can't compress everything into one state. Then on the other end, you have the transformers, which try to remember every token, which is great sometimes, which we want to look up specific information, but very expensive because you have the KV cache that grows, the dot product that grows.
Starting point is 03:00:38 But then, yeah, like you said, the Mamba layers, I mean, they kind of have the same problem, I would say, like an RNN, you try to compress everything into one state. You're a bit more selective there. But then I think it's like this Goldilocks zone again. With Nymotron 3, they found like a good ratio of how many attention layers do you need for the global information where everything is accessible compared to having these compressed states. And I think that's how I think we will scale more by finding better, let's say, ratios in Goldilocks zone, like between like computing, making it cheap enough to run,
Starting point is 03:01:13 but then also making it powerful enough to be useful. And one more plug here, the recursive language model paper, that is one of the papers that tries to kind of address the long context thing. So what they found is essentially instead of stuffing everything into this long context, if you break it up into these smaller, multiple smaller tasks, so you save memory by having multiple smaller calls, you can get actually better accuracy than having the LLM try everything all at once.
Starting point is 03:01:41 I mean, it's a new paradigm. We will see, you know, there might be other flavors of that. So I think with that, we will still make improvement on long context, but then also like Nathan said, I think the problem is for pre-training itself, we don't have as many long-context documents as other documents. So it's harder to study, basically, how LMs behave and stuff like that on that level. There are some rules of thumb where essentially you pre-training language model, like, oh, no, we pre-trained at like 8K context length and then extend. ended the 32K with training. And there's some rules of them where you're just like essentially doubling the training context length takes like 2x compute. And then you can normally like 2 to 4x the context length again.
Starting point is 03:02:23 So I think a lot of it ends up being kind of compute bound at pre-training, which is in this link we talked about this. Everyone talks about this big increase in compute for the top labs this year. And that should reflect in some longer context windows. But I think on the post-training side, there's some more interesting things, which is as we have agents, the agents are going to manage this context. on their own, where now people that use Claude Cod a lot dread the compaction, which is when Claude takes its entire full 100,000 tokens of work and compacts it into bulleted list. But what
Starting point is 03:02:51 the next models will do, I'm just not a novel. I'm sure people are already working on this, is essentially the model can control when it compacts and how. So you can essentially train your RL algorithm where compaction is an action, where it shortens the history. And then the problem formulation will be, I want to keep the maximum evaluation scores that I hope. gotten while the model compacts its history to the minimum length, because then you have the minimum amount of tokens that you need to do this kind of compounding auto-aggressive prediction. So there's actually a pretty nice problem setups in this, where these agentic models learn to use their context in a different way than just plow forward.
Starting point is 03:03:29 One interesting, also recent example, would be Deepseek version 3.2, where they had like the sparse attention mechanism, where they have essentially like a very efficient, small, lightweight indexer and instead of attending to all the tokens, it selects, okay, what tokens do I actually need? It's, I mean, it's almost comes back to the original idea of attention where you are selective, but attention is always on. You have maybe zero weight on some of them, but you use them all, but they are even more like, okay, let's just mask that out or like not even do that.
Starting point is 03:03:58 And even with sliding window attention, almost, that is also kind of like that idea. You have that rolling window where you keep it fixed because you don't need everything all the time. Occasionally some layers, you might. but it's wasteful. But right now I think, yeah, if you use everything, you're on the safe side, it gives you the best bang for the buck
Starting point is 03:04:16 because you never miss information. And right now, I think this year will be more also the year figuring out, like you said, how to be more smart about that. I think right now people want to have the next state of the art, and the state of the art is, happens to be the brute force expensive thing.
Starting point is 03:04:32 And then once you have that, like you said, keep that accuracy, but let's see how we can do that cheaper now, like tricks, you know. Yeah, all this scaling thing. Like, the reason we get the quad 4.5 Sonnet model first is because you can train it faster and you're not hitting these compute walls as soon. And they can just try a lot more things and get the model faster, even though the bigger model is actually better.
Starting point is 03:04:54 I think we should say that there's a lot of exciting stuff going on in the AI space. My mind has recently been really focused on robotics. So we have today really almost entirely didn't talk about robotics. there's a lot of stuff on image gen video generation. I think it's fair to say that the most exciting research work in terms of the amount, intensity, fervor is in the LLM space, which is why I think is justified for us
Starting point is 03:05:23 to really focus on the LLM that we're discussing. But it would be nice to bring in some certain things that might be useful. For example, world models, there's growing excitement on that. Do you think there would be any use, in this coming year for world models in the LLM space? Yes, I do think also also with LLMs,
Starting point is 03:05:43 what's an interesting thing here is I think if we unlock more LLM capabilities, it also automatically unlocks all the other fields, because, or not unlocks, but makes progress faster. Because a lot of researchers and engineers use LLMs, like we said, for coding. So even if they work on robotics, if you optimize these LLMs that help you with coding, it's like it pays off. But then, yes, a world model. are interesting. It's basically where you have the model run a simulation of the world in a sense,
Starting point is 03:06:12 like a little toy thing of the real thing, which can again unlock capabilities that the LM is not aware of, it can simulate things. And I think, see, this is like something, I think LLMs, they just happen to work well by pre-training and then doing the next token prediction. But we could do this even a bit, you know, like sophisticated in a sense. So what I'm saying is like with, there's like, I think it was by matter, a paper coda world models.
Starting point is 03:06:42 So where they basically apply the concept of world models to LLMs again, where they, so instead of just having next token prediction and verifiable rewards, checking the answer correctness, they also make sure the intermediate variables are correct. You know, like it's kind of like a, the model is learning basically a code environment in a sense.
Starting point is 03:07:01 And I think this makes a lot of sense. It's just like expensive to do. but it is like making things more sophisticated, like modeling, like modeling the whole thing, not just the result. So it can add more value. I remember when I was a grad student, there is a,
Starting point is 03:07:22 so there's a competition called CASP, I think, where they do protein structure prediction. Like they predict the structure of a protein that is not solved yet. at that point. So in a sense, this is actually great. And I think we need something like that for LLMs, also where you do the benchmark,
Starting point is 03:07:42 but no one does, so you hand in the results, but no one knows the solution, and then after the fact someone revealed that. But alpha fold, when it came out, it crushed, you know, this benchmark. I mean, there were also multiple iterations. But I remember the first one, I'm not an expert in that subfield,
Starting point is 03:08:00 but the first one explicitly modeled the physical interactions of the physics of the molecule also like the angles, impossible angles. And then in the next version, I think they got rid of this. And just with brute force scaling it up. And I think with LLMs, we are currently in this brute force scaling because it just happens
Starting point is 03:08:17 to work. But I do think also at some point it might make sense to bring back this thing. And I think with world models, I think that is where I think that might be actually quite cool. I mean, yeah. And of course, also for robotics. That is a completely
Starting point is 03:08:33 unrelated from LVLMs. Yeah, yeah, in robotics is very explicitly. So there's the problem of locomotion or manipulation. Locomotion is much more solved, especially than learning domain. But there's a lot of value, just like with the initial protein folding systems, bringing in the traditional model-based methods.
Starting point is 03:08:51 So you don't, it's unlikely that you can just learn the manipulation or the whole body local manipulation problem end to end. That's the dream. but then you realize when you look at the magic of the human hand and the complexity of the real world, you realize it's really hard to learn this
Starting point is 03:09:09 all the way through the way I guess AlphaFold 2 did. I'm excited about the robotic learning space so I think it's collectively getting like supercharged by all the excitement and investment in language models generally, where they're getting like the infrastructure for training transformers, which is like a general
Starting point is 03:09:25 modeling thing, is becoming like world-class industrial tooling, wherever that was a limitation for robotics, it's just like way better. There's where more compute. And then on top of like, they take these language models and use them as kind of central units where you can do interesting explorative work around something that kind of already works. And then I see it emerging as like, kind of like we talked about hugging face transformers and hugging face. I think when I was a hugging face, I was trying to get this to happen, but it was too early as like these open robotic models on hugging face and
Starting point is 03:09:59 be having people be able to contribute data. fine-tune them. I think we're much closer now that the investment in robotics and I think self-driving cars is related and enables this, where it's like once you get to the point where you can have this sort of ecosystem where somebody can download a robotics model and maybe fine-tune it to their robot or share data sets across the world, and there's some data, there's some work in this area like RTX, I think, is a few years ago where people are starting to do that. But I think once they have this ecosystem, it'll look very different. And then this whole post-chatGATGBTBT boom is putting more resources into that, which I think is a very good area for doing research.
Starting point is 03:10:34 This is also resulting in much better, more accurate, more realistic simulators being built, closing the simter real gap in the robotics space. But, you know, you mentioned a lot of excitement in the robotics space and a lot of investment. The downside of that, which happens in hype cycles, I personally believe most robotics people believe that it's not. robotics is not going to be solved at the time scale as being kind of implicitly promised. And so what happens when there's all these robotic companies that spring up and then they don't have a product that works, then there's going to be this kind of crash of excitement, which is nerve-wracking.
Starting point is 03:11:19 There's hopefully something else will come in and keep swooping in so that the continued development of some of these ideas keeps going on. It's also related to the continual learning issue, essentially, where the real world is so complex, where with L&Ms, yeah, you don't need to really have something learn for the user because there are a lot of things everyone has to do. Everyone maybe wants to, I don't know, fix their grammar in their email or code or something like that. It's more constrained, so you can kind of prepare the model for that. But preparing the robot for the real world, that's harder. I mean, you have the foundation models, the robotic foundation models. but you can learn certain things like grasping things.
Starting point is 03:11:58 But then again, I think everyone's house is different. You know, like it's so different. And that is, I think, where the robot would have to learn on the job, essentially. And I think that, I guess, is the bottleneck right now, like how to, you know, customizing it on the fly, essentially. I don't think I can possibly understand the importance of the thing that doesn't get talked about almost at all by robotics folks or anyone is safety. All the interesting complexities we talk about learning, all the failure modes and failure cases, everything we've been talking about LLM, sometimes it fails in this interesting ways. All of that is fun and games in the LLM space. In the robotic space, in people's homes, across millions of minutes, billions of interactions, you really are almost allowed to fail never.
Starting point is 03:12:50 When you have embodied systems that are put out there in the real world, you just have to solve so many problems you never thought you'd have to solve when you're just thinking about the general robot learning problem. And so bearish on in-home learned robots for consumer purchase. I'm very bullish on self-driving cars, and I'm very bullish for robotic automation, e.g. Amazon distribution, where Amazon has built whole new distribution centers, designed for robots first rather than humans.
Starting point is 03:13:22 There's a lot of excitement in AI circles about AI enabling automation and, like, mass-scale manufacturing. And I do think that the path to robots doing that is more reasonable, where it's like a thing that is designed and optimized to do a repetitive task that a human could conceivably do, but doesn't want to. And then I'm much, but it's also going to take a lot longer than people probably predict. I think the leap from AI singularity to we can now scale up mass manufacturing in the U.S. because we have a massive AI advantage is one that is troubled by a lot of political and other challenging problems.
Starting point is 03:14:04 Let's talk about timelines, specifically timelines to AGI or ASI. Is it fair, like, as a starting point, to say that nobody really agrees on the definition. of EGA and ESA? I kind of think there's a lot of disagreement, but among, I've been getting pushback where a lot of people kind of say the same thing, which is like a thing that could reproduce most digital economic work. So like the remote worker is a fairly reasonable example. And I think Open AI's definition is somewhat related to that, which is like an AI that
Starting point is 03:14:39 can do a lot of economic, like a certain number of economically valuable tasks, which I don't really love as a definition. but I think it could be a grounding point because language models today, while immensely powerful, are not this remote worker drop-in. And there are things that you could think of that could be done by an AI that are way harder than remote work, which are like solving a, finding an unexpected scientific discovery that you couldn't even pause it, which would be an example of something that somebody says is like an artificial superintelligence problem.
Starting point is 03:15:10 or like taking in all medical records and finding linkages across certain illnesses that people didn't know or figuring out that some common drug can treat some niche cancer. Like they would say that that is like a super intelligence thing. So these are kind of natural tears. My problem with it is that it becomes deeply entwined with like the quest for meaning of AI and this religious aspects to it. So there's kind of different, there's different paths you can take it. And I don't even know if the remote work is a good definition.
Starting point is 03:15:43 What exactly is that? It's like perfect tool use. I actually, I mean, I like, I don't know if you like the originally titled AI 27 report. They focus more on code and research taste. So the target there is the superhuman coder. So they have several, several milestone systems. Superhuman coders, superhuman AI researcher, then super intelligent AI researcher, than the full ASI, artificial superintelligence.
Starting point is 03:16:11 But after you develop the superhuman coder, everything else falls quickly. There, the task is to have a fully autonomous, like automate coding. So any kind of coding you need to do in order to perform research is fully automated. And from there, humans would be doing AI research together with that system,
Starting point is 03:16:35 and they would quickly be able to develop, a system that actually can do the research for you. That's the idea. And initially their prediction was 2027-28, and now they've pushed it back by three to four years to 2031, mean prediction. Probably my prediction is even beyond 2031, but at least you can in a concrete way
Starting point is 03:16:59 think about how difficult it is to fully automate programming. Yeah, I does agree with some of their presumptions and dynamics on how it would play out. But I think they did good, they did good work in the scenario, defining milestones that are concrete and to tell a useful story, which is why the reach for this AI 2027 document
Starting point is 03:17:18 well-transcended Silicon Valley is because they told a good story and they did a lot of rigorous work to do this. I think the camp that I fall into is that, like, AI is like so-called jagged, which will be excellent at some things and really bad at something. So I think that when they're close to this, automated software engineer, what it will be good at is that
Starting point is 03:17:39 traditional ML systems in front end, the model is excellent at, but the distributed ML, the models are actually really quite bad at because there's so little training data on doing large-scale, distributed learning, and things. And this is something that we already see, and I think this is just getting amplified.
Starting point is 03:17:54 And then it's kind of messier in these tradeoffs, and then there's, like, how do you think AI research works and so on? So you think basically superhuman coder is almost unachievable, meaning like because of the jagged nature of the thing, you're just always going to have gaps in capabilities. I think it's assigning completeness to something where the models are kind of superhuman at some types of code, and I think that will continue.
Starting point is 03:18:18 And people are creative, so they'll utilize this incredible abilities and to fill in the weaknesses of the models and move really fast. There'll always kind of be this, I've received for a long time, this dance between the humans are enabling this thing that the model can't do
Starting point is 03:18:32 and the best AI researchers are the ones that kind of enable this superpower. And I think this aligns like to what we already see. I think like Claude for building a website, you can stand up a beautiful website in a few hours or do data analysis. And I don't think it's going to keep getting better at these things and it'll pick up some new code skills and stuff
Starting point is 03:18:49 that it'll get along the way and kind of linking to what's happening in big tech is like this AI 2027 report is like, it leans into the singularity idea where I think research is messy, and social and largely in the data in ways that AI models can't process. But what we do have today is really powerful. And these tech companies are all collectively buying into this
Starting point is 03:19:14 with tens of billions of dollars of investment. So, like, we are going to get some much better version of ChatGPT, a much better version of Cloud Code than we already have. I think that it's just, like, hard to predict where that is going. But the, like, bright clarity of that future is why some of the most powerful people in the world are putting so much money into this. And I think it's just kind of small differences between, like, we don't actually know what a better version of chatubt is, but also like, can it automate AI research? I would say probably not, at least in this time frame. Like, big tech is going to
Starting point is 03:19:48 spend $100 billion much faster than we get a automated AI researcher that enables a AI research singularity. So you think your prediction would be what, like, if this is even a useful milestone, were more than 10 years out. I would say less than that on the software side, but I think longer than that on the things like research. It's just like, for fun, try to imagine a world where all software writing is fully automated. Can you imagine that world?
Starting point is 03:20:19 By the end of this year, the amount of software that will be automated will be so high. But it's like, it'll be the things of like you're trying to train a model with RL and you need to have multiple bunches of GPUs, communicating with each other, that'll still be hard, but I think it'll be much easier. One of the ways to think about this, so the full automation of programming, is just think of, lines of useful code written, the fraction of that to the number of humans in the loop. So presumably there'll be, for a long time, humans in the loop of software writing is just
Starting point is 03:20:52 be fewer and fewer relative to the amount of code written, right? And the SC superhuman code, I think the presumption there is it goes to zero, the number of humans in the loop. What does that world look like when the number of humans in the loop is in the hundreds, not in the hundreds of thousands? I think software engineering will be driven more to system design and goals of outcomes, where I do think software is largely going to be. I think this has been happening over the last few weeks where people have gone from a month ago of like, Oh, AI agents are kind of slop, which is a famous carpety quote to like the, what is a little bit of a meme of like the industrialization of software, when anyone can just create software at their fingerprints.
Starting point is 03:21:37 Like I do think we are closer to that side of things. And it takes direction and like understanding how the systems work to extract that best from the language models. And I think it's hard to like accept the gravity of how much is going to change with software development and how many more people can do things without ever looking at it. I think what's interesting is to think about whether these systems will be independent, like completely independent in the sense that, well, I have no doubt that LODAMS will kind of at some point solve coding in a sense, like calculators solve calculating, right? So at some point humans develop a tool that, you know, you never need a human to calculate that number. You just type it in and it's an algorithm.
Starting point is 03:22:16 You can do it in that sense. And I think that's the same probably for coding. But the question is, so I think what will happen is, yeah, you will just say, build that website, it will make a really good website, and then you maybe refine it, but will it do things independently where, so will you be still having humans asking the AI to do something? Like, will there be a person, say, build that website,
Starting point is 03:22:40 or will there be AI that just builds websites or something or whatever? I think talking about building websites is the... Too simple. It's just, like, there's the problem with websites and the problem with the web, you know, HTML and all that kind of stuff. It's very resilient to just slop. It will show you slop, as good as showing slop. I would rather think of like safety critical systems,
Starting point is 03:23:04 like asking AI to end-to-end generate something that manages logistics or manages cars, a fleet of cars, all that kind of stuff. So end-to-end generates that for you. I think a more intermediate example is take something like Slack or Microsoft Word. I think if the organization's a lot, it, AI could very easily implement features end-to-end and do a fairly good job for like things that you want to try. You want to add the new like tab and Slack that you want to use. And I think AI will be able to do that pretty well. Actually, that's a really great example. How far away are we
Starting point is 03:23:41 from that? Like this year. See, I don't, I don't know. I don't know. I don't know. I guess I don't know how bad production code bases are. But I think that within like on the order of low years, a lot of people are going to be pushing. to be more of like a designer and product manager, where you have multiple of these agents that can try things for you, and they might take one to two days to implement a feature or attempt to fix a bug, and you have these dashboards, which I think Slack is actually a good dashboard where your agents will talk to you,
Starting point is 03:24:10 and you'll then give feedback. But things like, like I make a website, it's like, do you want to make a logo that's passable? Like, I think these, like, cohesive design things, and this style is going to be very hard for models and deciding on what to add the next time. I just, okay, so I hang out with a lot of programmers, and some of them are a little bit on the skeptical side in general.
Starting point is 03:24:34 That's just vibe-wise, they're like that. I just think there's a lot of complexity involved in adding features to complex systems. Like if you look at the browser, Chrome, if I wanted to add a feature, if I wanted to have tabs as opposed to up top, I want them on the left side. interface, right?
Starting point is 03:24:55 I think we're not, it's not a next year thing. One of the Claude releases this year, one of their tests was we give it a piece of software and leave Claude to run to recreate it entirely. And it could almost rebuild Slack from scratch, just given the parameters of the software and left in a sandbox environment.
Starting point is 03:25:13 So from the scratch part, I like almost better. So it might be that the smaller, newer companies are advantaged. And they're like, we don't have to have the bloat And complexity, and therefore this future exists. And I think this gets to the point that you mentioned that some people, you talk to us, skeptical. And I think that's not because the LLM can't do X, Y, Z.
Starting point is 03:25:35 It's because people don't want it to do it this way. Some of that could be a skill issue on the human side. Unfortunately, that could be honest with ourselves. And some of that could be an under-specification issue. So programming, like, you're like, you're just assuming, this is like in, in relationships and friendships, communication type of issue. You're assuming the alum somehow
Starting point is 03:25:57 is supposed to read your mind. I think this is where a spectrum design is really important. Like you're just using natural language, specify like what you want. I think that's like, if you talk to people at the labs, they use these in their training in production code.
Starting point is 03:26:10 Like cloud code is built with cloud code. And they all use these things extensively. And Dario talks about how much of collards code. Oh, and it's like, these people are slightly ahead in terms of the capabilities they have. and they probably spend on inference, they could spend 10 to 100 plus X as much as we're spending,
Starting point is 03:26:29 like we're on a lowly $100 or $200 a month plan. Like, they truly let it rip. And I think that that, like, with the pace of progress that we have, it seems like a year ago we didn't have a cloud code and we didn't really have reasoning models. And it's like the difference between sitting here today and what we can do with these models. And it seems like there's a lot of low-hanging fruit
Starting point is 03:26:52 to improve them. The failure modes are pretty dumb. It's like, Claude, you tried to use the CLI command that don't have installed 14 times, and then I sent you the command to run. It's like,
Starting point is 03:27:02 that thing from a modeling perspective is pretty fixable. So I agree with you. I've been becoming more and more bullish in general. Speaking to what you're articulating, I think it is a human skill issue. So Anthropic is leading the way or other companies
Starting point is 03:27:21 in understanding how to best use the models for programming, therefore they're effectively using them. I think there's a lot of programmers on the outskirts. They're like, they don't, I mean, there's not a really good guide in how to use them. People are trying to figure it out exactly. It might be very expensive. Like, it might be that the entry point for that is $2,000 a month,
Starting point is 03:27:41 which is only tech companies and rich people. Just like, that could be it. But it might be worth it. I mean, if the final result is a working software system, one may be worth it. But by the way, it's funny how we converge from the discussion of timeline to AGI to something more pragmatic and useful. Is there anything concrete and interesting and useful and profound to be said about timeline to AGI and ASI? Or are these discussions a bit to detach from the day to day? There's interesting bets. So there's a lot of people trying to do
Starting point is 03:28:14 reinforce learning with verifiable rewards, but in real scientific domains where there's startups that are spending, like they have hundreds and millions of dollars of funding, and they have wet labs where they're having language models, propose hypotheses that are tested in the real world. And I would say that I think they're very early or they're early, but with the pace of progress, it's like maybe they're early by six months and they make it because they were there first, or maybe they're early by eight years. You don't really know. So I think that that type of moonshot to branch this momentum into other sciences is, Like, okay, like, that would be very transformative if, like, alphafold moments happen in all sorts of other scientific domains by like a startup solving this.
Starting point is 03:28:57 I think there are startups. I think maybe Harmonic is one where they're going all in on language models plus lean for math. I think you had another podcast guest who we talked about this recently. And it's like, we don't know exactly what's going to fall out of spending $100 million on that model. And most of them will fail, but a couple of them might be big breakthroughs that are very, different than chat GPT or cloud code type software experiences, like a tool that's only good for a PhD mathematician, but makes them 100x effective.
Starting point is 03:29:30 Okay, I agree. I think this will happen in a lot of domains, especially also like domains that have a lot of, you know, resources like finance and legal and pharmaceutical companies. But then again, is it really AGI again, because we are now specializing it again. And then again, is it really that much different from back in the day how we had specialized algorithms? I think it's just the same thing, way more sophisticated.
Starting point is 03:29:57 But I don't know, is there a threshold when we call it AGI, I guess? I think the real cool thing is here that we have like the foundation models that we can specialize. I think that's like the breakthrough at some point right now. I think we are not there yet because, well, first, it's too expensive, but also, you know, like Chachapiti doesn't just give away that Chachapiti to customize it. I think once that's going to be true in a some way, and I think I can imagine this as a business model that ChachapD, what may I say at some point, like, hey, you know, Bank of America for $100 million,
Starting point is 03:30:26 we will do your custom model or something like that. And I think that will be the huge economic value add. The other thing, though, is also companies, I mean, right now, what is the differentiating factor? I mean, if everyone uses the same LLM, if everyone uses ChachapD, they will all do the same thing again. I mean, then, well, Everyone is moving in lockstep, but usually companies, they want to have a competitive advantage.
Starting point is 03:30:51 And I think there's the no way around using some of their private data and experimenting and maybe specializing. It's going to be interesting, yeah. Sitting in the pace of progress, it does just feel like things are coming. I don't think the AGI and ASI thresholds are particularly useful. I think, I guess the real question, this takes us to the remote worker thing is when I were going to see a big obvious leap in economic impact because currently there's not been an obvious leap in economic impact
Starting point is 03:31:25 of LLM models, for example. And that's, you know, aside from AGI or ASI or all that kind of stuff, there's a real question of like, when are we going to see a GDP like jump? Yeah, it's like, what is the GDP made up of? Like, a lot of it's like financial services. So like, I don't,
Starting point is 03:31:45 don't know what this is. It's just hard for me to think about the GDP bump, but like, I'd say that software development becomes valuable in a different way when you no longer have to look at the code end anymore. So when when it is like cloud will make you a small business, which is essentially cloud can set up your website, your bank account, your email, and your whatever else. And like you just have to express like what you're trying to put into the world. Like, that's not just a enterprise market. But it's is a hard, like, I don't know how you get people to try doing that. I guess if Chad GPT can do it, like, people are trying Chad GPT.
Starting point is 03:32:21 I think it boils down to the scientific question of how hard is tool use to solve. There's a lot of the stuff you're applying, the remote work stuff is tool use. It's like how, computer use, like how you have an L.M that goes out there, this agentic system, and does something in the world and only screws up one percent of the time. computer use is a good example of what labs care about and we haven't seen a lot of progress on. We saw multiple demos in 2025 of like, Claude can use your computer or opening I had Cua and they all suck. So like they're also investing money in this.
Starting point is 03:33:00 And I think that will be a good example where that's actually something where it just seems pretty like taking over the whole screen seems a lot harder than having an API that they can call in the back end. And some of that is you have to then set up a different. different environment for the model to work in. They're not working on your MacBook. They are individually interfacing with Google and Amazon and Slack, and they handle all these things in a very different way than humans do. So some of those might be structural blockers. Also, like, specification-wise, I think the problem is also for, you know, arbitrary tasks.
Starting point is 03:33:34 Well, you still have to specify what you want your LLM to do, and how do you do that in a, what is the environment, how do you specify, you can say what the end goal is, but if it can't solve the end goal, with LLMs, if you ask it for text, you can always clarify, do sub-steps. How do you put that information into a system that, let's say, books a travel trip for you. You can say, well, you screwed up my credit card information,
Starting point is 03:33:57 but even to get it to that point, like how do you, like, as a user guide the model before, like, it can't even attempt that. I think the interface is really hard. Yeah, it has to learn a lot about you specifically and about this goes to continue learning
Starting point is 03:34:15 about the general mistakes that are made throughout and mistakes that are made through you. All the AI interfaces are getting set up to ask humans for input.
Starting point is 03:34:25 I think Cloud Code we talk about a lot. It asks feedback on questions if it doesn't have enough specification on your plan or your desired. It starts to ask questions, would you rather?
Starting point is 03:34:33 We talked about memory, which saves across chats, which its first implementation is kind of odd where it'd be like it'll mention my dog's name or something like in a chat. I'm like, you didn't need to be subtle about this. Like I don't care. But the like things that are merging are chat TPT has the pulse feature, which is like a curated couple paragraphs with links to something to look at or to talk about and people talk about how the language models
Starting point is 03:34:59 are going to ask you questions, which I think is a very, it's probably going to work. The language model is like it knows you had a doctor appointment or something. It's like, hey, how are you feeling after that? Which is like, again, goes into the territory of humans are very susceptible to this. And there's a lot of social change to come. But also, like, they're experimenting with having the models engage. Some people really like this pulse feature, which is it processes your chats and automatically searches for information and puts it in the chat GPT app. So there's a lot of things coming.
Starting point is 03:35:31 I used that feature before and I always feel bad because it does that every day and I rarely check it out. It's like how much money, like, I mean, compute is burned on something I don't even look at, you know, where it's like, it's kind of like... There's also a lot of idle compute in the world, so don't feel too bad. Okay. Do you think new ideas might be needed? Is it possible that the path to AGI, whatever that is, however we define that, to solve computer use more generally, to solve biology and chemistry and physics, sort of the Daria definition of, BGI or Palflea, do you think it's possible that totally new ideas are needed? Non-LLM, non-RL ideas. What might they look like? We're not going into philosophy land a little bit.
Starting point is 03:36:23 For something like a singularity to happen, I would say yes. And the new ideas can be architectures or training algorithms, which is like fundamental deep learning things. But there's in that nature pretty hard to predict. But I think we will get very far, even without those advances. Like, we might get this software solution, but it might stop at software and not do computer use without more innovation. So I think that it's like a lot of progress will be coming, but if you're going to zoom out, like there are still ideas in the next 30 years that are going to look like that was a major, like, scientific innovation that enabled the next chapter of this. And I don't know if it comes in one year or in 15 years.
Starting point is 03:37:05 Yeah, I wonder if the bitter lesson holds. it's true for the next 100 years, what that looks like. If scaling laws are fundamental and deep learning, I think the bitter lesson will always apply, which is compute will become more abundant, but even within abundant compute, the ones that have a steeper scaling law slope or a better offset, like this is a 2D plot of performance and compute,
Starting point is 03:37:28 and even if there's more compute available, the ones that get 100x out of it will win. It might be something like literally computer clusters, orbiting Earth with solar panels. The problem with that is heat dissipation. So you get all the radiation from the sun and you don't have any air to dissipate heat. But there is a lot of space to put clusters.
Starting point is 03:37:50 There's a lot of solar energy there and you could figure out the heat dissipation. But there is a lot of energy and there probably could be engineering will to solve the heat problem. So there could be. Is it possible, and we should say that, it definitely is possible.
Starting point is 03:38:03 How like it is the question that we're basically going to be plateauing this year. Not in terms of the system capabilities, but what the system capabilities actually mean for human civilization. So on the coding front, really nice websites will be built.
Starting point is 03:38:23 Very nice auto-complete. Very nice way to understand code bases and maybe help debug, but really just a very nice helper on the coding front. It can help research mathematicians do some math. It can help you with shopping. It can help you.
Starting point is 03:38:43 It's a nice helper. It's clip beyond steroids. What else? It may be a good education tool and all that kind of stuff. But computer use turns out extremely difficult to solve. So I'm trying to be, I'm trying to frame the cynical case in all these domains where there's not a really huge economic impact. We realize how cost.
Starting point is 03:39:07 is to train these systems at every level, both the pre-training on the inference, how costly the inferences, the reasoning, all of that. Like, is that possible? And how likely is that, do you think? When you look at the models, there's so much obvious things to improve. And it takes a long time to train these models and to do this art. And that it'll take us with the ideas that we have multiple years to actually saturate in terms of whatever benchmark or performance we are searching for, it might serve very narrow niches. Like the average chatybt 800 million user might not get a lot of benefit out of this, but it is going to serve different populations by getting better at different things.
Starting point is 03:39:51 But I think what everybody's chasing now is a general system that's useful to everybody. So, okay, so if that's not, that can plateau, right? I think that dream is actually kind of dying. As you talked about the specialized models where it's like, and multimodal is often, like, video generation is a totally different thing. That dream is kind of dying is a big statement. Because I don't know if it's dying. I don't know if you ask the actual financial lab people,
Starting point is 03:40:19 I mean, they're still chasing it, right? I do think they are still like rushing to get the next model out, which will be much better than the, much is a relative term, but will be better than the previous one. And I can't see them slowing down. I just think the gains will be made or felt more through not only scaling the model, but now, fine. So I feel like there's a lot of tech debt. It's like, well, let's just put the better model in there and better model and better model.
Starting point is 03:40:47 And now people are, okay, let's also at the same time improve everything around it to, like, you know, like the engineering of the context and inference scaling. And the big labs will still keep doing that. And now also the smaller labs will catch up to that because now, it's just like they are hiring more, there will be more people, LLMs, it's kind of like, you know, like a circle, they also make them more productive,
Starting point is 03:41:09 and it's just, it's like amplify. I think what we can expect is amplification, but not like a change of, like a paradigm change. I don't think that is true, but everything will be just amplified
Starting point is 03:41:20 and amplified and amplified. And I can see that continuing for a long time, yeah. Yeah, I guess my statement with the dream is dying depends on exactly what you think it's going to be doing.
Starting point is 03:41:30 Like, Claudecote, is a general model that can do a lot of things, but it's not like necessarily, like, it depends a lot on integrations and other things. Like, I bet cloud code can do a fairly good job with doing your email, and the hardest part is figuring out how to give the information to it and how to get it to be able to send your emails and stuff like this. But that's just kind of like, I think it goes back to like what is the one model to rule
Starting point is 03:41:54 everything ethos, which is just like a thing in the cloud that handles your entire digital life and is way smarter than everybody. It's like, it's operating in a and so it's an interesting leap of faith to go from Claude code becomes that, which like in some ways is
Starting point is 03:42:14 there's some avenues for that, but I do think that like the rhetoric of the industry is a little bit different. I think the immediate also thing we will feel next as a normal person using Adelham's is will probably be related to something like also trivial, like making
Starting point is 03:42:30 figures. Right now, LLMs are terrible at making figures. Is it because we are getting served the cheap models with very less, like less inference compute than behind the scenes? Maybe some, like there are some cranks we can already get better figures. But if you ask today, I don't, draw a flow chart of XYZ. It's most of the time terrible. And it is kind of like a very simple task for a human. I think it's almost easier sometimes to draw something than to write something. Yeah. The multimodal understanding does feel like something that, is odd that it's not better solved. I think we're not saying one actually obvious thing
Starting point is 03:43:06 that we're not actually realizing that's a gigantic thing that's hard to measure, which is making all of human knowledge accessible to the entire world. Like one of the things I think is hard to articulate, but there's just a huge difference between Google search and an LLM.
Starting point is 03:43:26 I feel like I can basically ask an LM anything and get an answer. And less, it's doing less and less and less hallucination. And that means understanding my own life, figuring out a career trajectory,
Starting point is 03:43:43 figuring out how to solve the problems all around me, learn about anything through human history. That, like, I feel like nobody's really talking about that because they just immediately take it for granted that it's just, this is awesome.
Starting point is 03:43:59 That's why everybody's, using it. It's because you get answers for stuff. And like the impact of that across time, like think about, this is not just in the United States is all across the world. Like kids throughout the world being able to learn these ideas, like the impact that has across time is probably, that's where the real like, talking about GDP, it won't be like a leap. It'll be, that's how we get to Mars. That's how we build these things. That's how we have a million new open. AI's all the kind of innovation that happens from there. And that's just this quiet force that permeates everything, right?
Starting point is 03:44:37 Human knowledge. I do agree with you. And in a sense, it makes knowledge more accessible. But it also, I think, depends on what the topic is. For something like math, in a sense, you can ask it questions, it answers. But if you want to learn a topic from scratch, I think, again, like we talked about this earlier, I think the sweet spot is, I mean, there are really good math textbooks where someone laid it out linearly, and that is like a, let's say, proven strategy to learn this topic. And it does make sense if you start from zero to ramp up to get like an information dense text to soak it up.
Starting point is 03:45:17 But then you use the LLM to make infinite exercises. Like you have problems in a certain area and or have questions. Something is uncertain or like you are uncertain about certain things. You ask it to generate example problems. you solve them and you have questions and then maybe you need more background knowledge and you ask it to generate that. And I think, but then it won't give you anything, let's say, that is not in the textbook. It's just packaging it differently if that makes sense. But then there are things I feel like where it also adds value in a more, I mean, timely sense where there is no good alternative besides a human doing it on the fly.
Starting point is 03:45:56 For example, if you, I don't know, like, let's say you're planning to go to Disneyland and you try to figure out which tickets to buy for which park when, well, there is no textbook on that. There is no information dense resource in that. There's only the sparse internet. And then there is a lot of value in the LLM. You just ask it, it has you have the constraints I'm traveling these and these days. I want to go there and there. Please figure out what I need when and from where and what it costs and stuff like that. And it is very customized on the fly package.
Starting point is 03:46:28 And then this is like one of the thousand examples in exercise, personalized. Personalization is essentially like pulling information from the sparse internet, the non-information dense thing where there is no better version that exists. It just doesn't exist. You make it from scratch almost. And if it does exist, it's full of, speaking of Disney World, like full of what would you call it, ad slop? Like you just, it's impossible here.
Starting point is 03:46:55 You go any city in the world. What are the top 10 things to do? L.LM is just way better to ask than anything on the internet. Well, for now. That's because they're massively subsidized and they're going to be paid for by ads. It's coming. No. Oh.
Starting point is 03:47:13 I hope there, I mean, I'm hoping there's a very clear indication of what's an ad and what's not an ad in that context. I did a little, you know, that's something I mentioned a few years. years ago. It's like, I don't know, if you are looking for a new running shoe, well, is this a coincidence that Nike maybe comes up first? Maybe, maybe not. But I think there are clear laws around this. You have to be clear about that. But I think that's what everyone fears. It's like the subtle, you know, subtle message in there or something like that. But it also brings us to the topic of, I guess, ads where I think this was the thing, hope may I try to launch in 2025? And just to, you know, just to,
Starting point is 03:47:50 because I think it's still not making money in that other way right now. So that, like, having really, like, ad spots in there. And then the thing, though, is they couldn't because, well, there are alternatives without ads and people would just flock to the other products. And it also is just, like, crazy how, yeah, like, they're one-upping each other, spending so much money to just get the users. I think so. Like, some Instagram ads.
Starting point is 03:48:16 I don't use Instagram, but I understand the appeal of, paying a platform to find users who will genuinely like your product. And that is the best case of things like Instagram ads. But there are also plenty of cases where advertising is very awful for incentives. And I think that a world where the power of AI can integrate with that positive view of like, I am a person and I have a small business and I want to make the best, I don't know, damn steak knives in the world. And I want to sell them to somebody who needs them. And if like, if AI can make that sort of advertising thing work even better. That's very good for the world, especially with, like, digital infrastructure, because
Starting point is 03:48:57 that's how, like, the modern web has been built. But that's not to say, like, addicting feeds so that you can show people more content is a good thing. So it's like, I think that's even what opening I would say is they want to find a way that can make the monetization upside of ads while still giving their users agency. And I personally would think that Google is probably going to be better at figuring out how to do this because they have, they already have ad supply, and they figure out how to turn this demand in their Gemini app into useful ads, then they can
Starting point is 03:49:31 turn it on. And somebody will figure, I don't know if I think it's this year, but there will be experiments with it. I do think what holds companies back right now is really just that the competition is not doing it. It's more like, more like a reputation thing. It's just, like, I think people are just afraid right now, like, ruining or like losing the reputation, losing users, because it would make headlines if someone launched these ads. Unless they were great. But the first ads won't be great, because it's a hard problem that we don't know how to solve. Yeah, I think also the first version of that will likely be something like on X, like the timeline where you have like a promoted post sometimes in between.
Starting point is 03:50:08 It would be something like that where it will say like promoted or something like small and then there will be an image or something. I think right now the problem is who makes the first move. If we go 10 years out, the proposition for ads is that you will make so much money on ads by having so many users that you can use this to fund a better R&D and make better models, which is why, like, YouTube is dominating the market for any, like, Netflix is scared of YouTube. Like, they have the ad, like they make, I don't, I pay $28 a month for premium. They make at least $28 a month off of me and many other people. and they're just, like, creating such a dominant position in video. So I think that's the proposition, which is that ads can make you have a sustained advantage
Starting point is 03:50:50 in what you are spending per user. But there's so much money in it right now that it's like somebody starting that flywheel is scary because it's a long-term bet. Do you think there'll be some, like, crazy big moves this year, business-wise? Like somebody, like Google or Apple acquiring Anthropic or something like this? Dario will never sell, but we are starting to see some types of consolidation with like GROC for $20 billion and scale AI for almost $30 billion and countless other deals like this that they're structured in a way that is actually detrimental to the Silicon Valley ecosystem, which is this sort of licensing deal where not everybody gets brought along rather than a full acquisition that benefits the rank and file employee by getting their stock vested.
Starting point is 03:51:40 Like that's a big issue for Silicon Valley culture to address because the startup ecosystem is the lifeblood where if you get a, if you join a startup, even if it's not that successful, your startup very well might get acquired on a cheap premium of it and you'll get paid out for this equity. And these licensing deals are essentially taking the top talent a lot of the times. I think the deal for GROC to Invidia is rumored to be better to the employees, but it is still this antitrust avoiding thing. but I think that this trend of consolidation will continue. I've been, me and many smart people I respect have been expecting consolidation to have happened sooner. But it seems like some of these things are starting to turn, but at the same time, you have companies raising ridiculous amounts of money for reasons that you don't.
Starting point is 03:52:29 I'm like, I don't know why you're taking that money. So it's maybe like mixed this year, but some consolidation pressure is starting. What kind of surprise and consolidation do you think we'll see? So you're saying Anthropic is a never. I mean, Grock is a big one. Grock with a Q, by the way. Yeah. There's just a lot of startups, and there's a very high premium on AI startups.
Starting point is 03:52:49 So there's a lot of, like, there could be a lot of 10 billion range acquisitions, which is a really big acquisition for a startup that was maybe founded a year ago. I think MENAS AI, the company that's based in Singapore that META founded, was founded eight months ago and then had a $2 billion exit. And I think that there will be some other big, like, many billion dollar acquisitions. Like perplexity. Yeah, like people rumored them to Apple. I think there's a lot of pressure and liquidity in AI. There's pressure on big companies to have outcomes.
Starting point is 03:53:22 And I would guess that a big acquisition gives people leeway to then tell the next chapter of that story. I mean, yeah, there's a, I guess cursor. We've been talking about code and somebody acquires cursor. They're in such a good position. by having so much user data. Yeah. And we talked about continual learning and stuff. They had one of the most interesting, like, two sentences in a blog post,
Starting point is 03:53:43 which is that they had their new composer model, which was a fine tune of one of these large mixture of expert models from China. You can know that by asking gossip, or because the model sometimes responds in Chinese, which none of the American models do. And they had a blog post where they're like, we're updating the model weights every 90 minutes based on real world feedback from people using it,
Starting point is 03:54:04 which is like the closest thing to, real world RL happening on a model. And it's just like in one of their blog posts, which is super cool. And by the way, I say I use Composer a lot because one of the benefits that it has is fast. I need to try it because everybody says this. And there will be some IPOs potentially.
Starting point is 03:54:20 You think Anthropic, OpenAI, XAI? They can all raise so much money so easily that they don't feel I need to. Like, so long as fundraising is easy, they're not going to IPO because public markets apply pressure. I think we're seeing in China that the ecosystem's a little different. with both minimax and z.a.i. applying for filing IPO paperwork, which will be interesting to see
Starting point is 03:54:42 how the Chinese market reacts. I actually would guess that it's going to be similarly hypey to the U.S. so long as all this is going and not based on the realities that they're both losing a ton of money. I wish more of the American gigantic AI startups were public, because it would be very interesting to see how they're spending their money and have more insight. And also just to give people access to investing in these because I think that there's some of the most, like, formative. Like, they're the companies of the era and the tradition is now for so many of the big startups in the U.S. to not go public. It's like we're still writing for Stripe and the IPO, but Databricks definitely didn't. They raised like a series G or something. And I just say,
Starting point is 03:55:23 like, it's a kind of a weird equilibrium for the market where it's like, I would like to see these companies go public and evolve in that way that a company can. You think 10 years from now, some of the frontier model companies are still around, Anthropic, Open AI. I definitely don't see it to be a winner-takes-all unless there truly is so algorithmic secret that one of them finds. Like, let's this flywheel. Because the development path is so similar for all of them. Google and OpenAI have, like, all the same products. And then, like, Anthropics more focused.
Starting point is 03:55:57 But when you talk to people, it sounds like they're solving a lot of the same problems. So I think, and there's offerings that will spread out. There's a lot of, it's a very big cake that's being made that people are going to take money out of. I don't want to trivialize it, but, so OpenAI and Anthropica, primarily LLM service providers. And some of the other companies like Google and XAI, linked to X, does other stuff too. And so it's very possible if AI becomes more commodified that the companies that are just providing LLM will die. I think they will, the advantage they have, they have a lot of users, and I think they will just pivot. I think then if they figure out, it's like anthropic, I think pivoted.
Starting point is 03:56:43 I don't think they originally planned to work on code, but it happened that they found, okay, this is like a nice niche, and now we are comfortable in this niche, and we push on this niche, and I can see the same thing once. Maybe, let's say, hypothetically speaking, I'm not sure it will be true, but let's say Google takes all the market share of the general chatbot. maybe open I will be then focus on some other topic like the, they have too many users
Starting point is 03:57:06 to go away in foreseeable future, I think. I think Google is always ready to say, hold my beer with AI mode. I think that the question is if the companies can support the valuations. I think I'd see the AI companies being looked at in some ways like AWS Azure and GCP
Starting point is 03:57:23 are all competing in the same space in all very successful businesses. There's a chance that the API market is so unprofitable that they go up and down the stack to products and hardware. They have so much cash that they can build power plants and build data centers, which is a durable advantage now. But there's also just a reasonable outcome that these APIs are so valuable and so flexible for developers that they become the likes of something like AWS. But AWS and Azure also can have these APIs. So there's some, like that's a like five or six people competing in
Starting point is 03:57:56 the API market is hard. So maybe like that's why they get squeezed out. You mentioned RIP Lama. Is there a path to winning for meta? I think nobody knows they're moving a lot. So they're signing licensing deals with Black Forest Labs, which is an image generation or mid-journey or applying mainness. So I think it's some ways it's on the product and like consumer-facing AI front. It's too early to tell.
Starting point is 03:58:24 I think they have some people that are excellent and very motivated, being close to Zuckerberg. So I think that there's still a story to unfold there. Lama is a bit different where Lama was the most focused expression of the organization. And I don't see Lama being supported to that extent. I think it was a very successful brand for them. So they still might do some part of participation in the open ecosystem or continue the Lama brand into a different surface.
Starting point is 03:58:52 The people know what Lama is. You think there's a Lama 5? Not an open weight one. it's interesting I think also just to recap a bit I think I mean Lama was the I would say pioneering open weight model
Starting point is 03:59:06 and then Lama 1 2 3 a lot of love but I think then I think what happened just hypothesizing or speculating I think the leaders at META like the upper executives they I think they got really excited about Lama
Starting point is 03:59:20 because they saw how popular it was in the community and then I think the problem was trying to let's say monetize the open source but like kind of use the open source to make a bigger splash in a sense, like to kind of force it almost, it felt forced like developing these very big Lama 4 models to have like the best, like to be on the top of the benchmarks.
Starting point is 03:59:41 But I don't think the goal of Lama models is to be on top of the benchmarks, beating, let's say, Chachapida or other models. I think the goal was to have a model that people can use, trust, modify, understand that. So that includes having smaller models. They don't have to be the best models. And what happened was just these models. were, of course, like the benchmarks
Starting point is 04:00:00 suggest that they were better than they were by it because I think they had like specific models trained on preferences that they perform well on the benchmarks. It's kind of like this overfitting thing to kind of force it to be the best, but then at the same time, they didn't do the small models that people could use. And I think that no one could run these big models
Starting point is 04:00:18 then. And it was kind of like a weird thing. And I think it's just because people got too excited about headlines pushing the frontier. I think. And too much like on the bench maxing side. Yeah, I think it imploded under political, like internal political fighting and misaligned incentives. So I think the researchers want to build the best models, but there's a layer of organization and manager that is trying to demonstrate that they do these things. And then there's lots of, there's a lot of pieces and rumors where how like some horrible technical decision was made
Starting point is 04:00:50 and how that comes in. And it just seems like it kind of got too bad where it all just crashed out. But we should also give huge props to Mark Zuckerberg. I think it comes from Mark, actually. Mark Zuckerberg from the top of the leadership saying open source is important. I think that's like that, the fact that that exists means there could be a Lama 5, where they learn the lessons from the benchmarking and say we're going to be GPTOSS and provide really awesome library of open source. What people say is that there's a debate between Mark and Alexander Wong, who is very bright, but much more against open source. And to the extent that he has a lot of influence over the AI org, it seems much less likely.
Starting point is 04:01:36 Because it seems like Mark brought him in for like a fresh leadership aid in directing AI. And if the like open or closed is no longer the defining nature of the model, I don't expect that to be a defining argument between Mark and Alex. So, like, they're both very bright. But I just, like, I have a hard time understanding all of it because Mark wrote this piece in July of 2024, maybe, which was, like, probably the best blog post at the time saying the case for open source AI. And then July 2025 came around,
Starting point is 04:02:10 and it was like, we're reevaluating a relationship with open source. So it's just kind of like... But I think also the problem, not the problem, but I think, well, we may have been a bit also too harsh, I think and that caused some of that because I think, I mean, we as open source developers or the open source community because I think even though the model was maybe not what everyone hoped for, it got a lot of backlash and I think that was a bit unfortunate because I can see that as a company now they were hoping for positive headlines and instead of just getting no headlines or not these positive headlines, in turn they got negative headlines. And then it kind of reflected bad on the company. and I think that is also something like where you, it's maybe a spite reaction,
Starting point is 04:02:53 almost like, okay, we have, we try to do something nice, we try to give you something cool, like an open source model, and now you are like, you know, kind of like being negative about us, even like for the company. So in that sense, it looks like, well, maybe then we'll change our mind, I guess, I don't know.
Starting point is 04:03:11 Yeah, that's where the, the dynamics of discourse on X can lead us as a community astray. Because sometimes it feels random. People pick the thing they like they don't like. And you can see the same thing with GROC 4-1 and GROC code Fast 1. I don't think vibe-wise people love it publicly. But a lot of people use it.
Starting point is 04:03:42 So if you look to Reddit and X, they don't really give it praise from the programming community. but like they use it. And the same thing with probably with the Lama. I don't understand. I don't understand the dynamics of either positive hype or negative hype. I don't understand it. I mean,
Starting point is 04:03:58 the story of one of the stories of 2025 is the U.S. feeling the gap of Lama, which is like all the rise of these Chinese open weight models to the point where I was like, that was the single issue I've spent a lot of energy on the last five months is like trying to do policy work to get the U.S. to invest in this. Tell me the story of Adam.
Starting point is 04:04:16 Adam Project is, It started as me calling it the American Deep Seek Project, which doesn't really work for DC audiences, but it's the story of like, what is the most impactful thing I can do with my career, which is that Chinese open weight models are cultivating a lot of power. And there is a lot of demand for building on these open models, especially in enterprises in the U.S. that are very cagey about these Chinese models. Going to perplexity, the Adam Project, American truly open models is a U.S.-based initiative to build and host high quality, genuinely. open-weight AI models and supporting infrastructure explicitly aimed at competing with and catching up to China's rapidly advancing open-source AI ecosystem. I think the one sentence summary would be that, or two sentences. One is a proposition that open models are going to be an engine for AI research because that is what people start with. Therefore, it's important to own them. And the second one is, therefore, the U.S. should be
Starting point is 04:05:14 building the best models so that the best researcher happens in the U.S. and the U.S. companies take the value from being the home of where AI research is happening. And without more investment in open models, we have all the plots on the website where it's like, Quinn, Quinn, Quinn, and it's all these models that are excellent from these Chinese companies that are cultivating influence in the U.S. in China and internationally. And I think the U.S. is spending way more on AI and the ability to, to create open models that are half a generation or a generation behind what the cutting edge of a closed labs is, costs orders of like $100 million, which is a lot of money, but not a lot of the money to these companies. So therefore, we need a centralizing force of people who want to do this.
Starting point is 04:05:59 And I think we got signed engagement from people pretty much across the full stack, whether it's policy. So there has been support from the administration? I don't think anyone in that, like, technically in government has, like, signed it. publicly, but I know that people that have worked in AI policy, both in Biden and Trump administration, are very supportive of trying to promote open source models in the U.S. I think, for example, AI2 got a grant from the NSF for $100 million over four years, which is, like, the biggest CS grant the NSF has ever awarded. And it's for AI2 to attempt to this, and I think it's a starting point.
Starting point is 04:06:37 But the best thing happens when there are multiple organizations building models, because they can cross-pollinate ideas and kind of build this. ecosystem. Like, I don't think if it just works if it's just Lama releasing models to the world, because then you can see Lama can go away. The same thing applies for AI2, where it's like, I can't be the only one building models. And I think that it's like, that it becomes a lot of time spent on talking to people, whether they're in policy. I know NVIDIA is very excited about this. I think Jensen Wong has been specifically talking about the urgency for this, and they've chain, they've done a lot more in 2025, where the nematron models are more of a focus. They've
Starting point is 04:07:16 started releasing some data along with Nvidia's open models. And like, very few companies do this, especially of invidia's size. So like, there is, there is signs of progress. And we hear about reflection AI where they say their $2 billion fundraise is dedicated to building US open models. And I feel like their announcement tweet is like it reads like a blog post out right. And I I think that that cultural tide is starting to turn. I think in July was when we had like four or five deep-seek caliber Chinese open weight models in zero from the U.S. And that's the moment where I was released this. And I was like, oh, I guess I have to spend energy on this because nobody else is going to do it.
Starting point is 04:07:56 So it takes a lot of people contributing together. And I don't say that like the atom project isn't like the thing that's helping to move the ecosystem. But it's people like me doing this sort of thing to get the word out. do you like the 2025 America's AI Action Plan that includes open source stuff? The White House AI Action Plan
Starting point is 04:08:15 includes a dedicated section titled to encourage open source and open web AI defining such models and arguing they have unique value for innovation and startups. Yeah, I mean, like the AI action plan is a plan, but
Starting point is 04:08:28 largely I think it's like maybe the most coherent policy document that has come out of the administration and I hope that it largely succeeds. And I know people that have worked on the AI action plan. And the challenge is taking policy and making it real. And I have no idea how to do this as an AI researcher. But like, largely a lot of things on that are very real. And there's a huge build out of AI in the country. And it's like there are a lot of issues that people are hearing about from water use to whatever. And
Starting point is 04:08:57 like, we should be able to build things in this country. But also we need to not ruin places in our country in the process of building it. And it's a worthwhile to spend energy on. I think that's a role at the federal government plays it's like they set the agenda and with AI setting the agenda that open weight should be a first consideration
Starting point is 04:09:16 is like that's a large part of what they can do and then people think about it. Also for education and talent for these companies it's I think very important because otherwise if they're only closed models
Starting point is 04:09:30 how do you get the next generation of people contributing at some point because otherwise you will at some point only be able to learn after you joined a company, but then at that point, like, how do you hire talented people, how do you identify talented people? And I think open source is, that's even for a lot of things, but also even just for educating the population and training the next generation of researchers. It's the way or the only way. The way that I could have gotten this to go more viral was to tell a story of Chinese AI integrating with an authoritarian state and being ASI and taking
Starting point is 04:10:05 over the world and therefore we need our own American models, but it's very intentional for why I talk about innovation and science in the U.S. because I think it's both more realistic as an outcome, but just like, it's like, that's a world that is, I would like to manifest. I would say, though, also even like, let's say any open weight model I do think is a valuable model. Yeah. My argument is that we should be in a leading position, but I think that it's worth saying it so simply because there are still voices in the AI ecosystem that say we should consider banning releasing open models due to the safety risks. And I think it's worth adding that I think effectively that's impossible without making the U.S. have its own great firewall, which is also known to not
Starting point is 04:10:51 work that well because the cost for training these models, whether it's one to $100 million, is attainable to a huge amount of people in the world that want to have influence. So these models will be getting trained all over the world. And these, we want the models, especially when, like, I mean, there are safety concerns, but we want these information and tools to flow freely across the world and into the U.S. so that we people can use them and learn from them. And we, like, stopping that would be such a restructuring of our internet that it seems impossible. Do you think maybe in that case, the big open weight models from China are actually a good thing in a sense, like for the U.S. companies, because maybe the U.S. companies, you mentioned,
Starting point is 04:11:33 earlier they are usually one generation behind in terms of what they release open source versus what they are using. For example, GPTOS might not be the cutting edge model, Gemma 3 might not be. But they do that because they know this is safe to release. But then when they see, these companies see, for example, there is Deepseek version 3.2, which is really awesome. And it gets used and there is no backlash. There is no security risk that could then again encourage them to release better models.
Starting point is 04:12:00 Maybe that in a sense is a very positive thing. 100%. These Chinese companies have set things into motion that I think would potentially not have happened if they were not all releasing models. So I think it's like, I'm almost sure that those discussions have been had by leadership. Is there a possible future where the dominant models, AI models in the world, they're all open source? Depends on the trajectory of progress that you predict. If you think saturation and progress is even coming within a few years, so essentially within the time where financial support is still very good, then open models will be so optimized and so much cheaper to run that they will win out. Essentially, this goes back to open source ideas where
Starting point is 04:12:41 so many more people will be putting money into optimizing the serving of these open weight common architectures that they will become standards. And then you could have chips dedicated to them and it'll be way cheaper than the offerings from these closed companies that are custom. We should say that AI 27 report kind of predicts one of the things it does from a narrative perspective is that there will be a lot of centralization. As the AI system gets smarter and smarter, the national security concerns will come to be and you'll centralize the labs and you become super secretive and there'll be this whole race from a military perspective of how do you, between China and the United States.
Starting point is 04:13:22 And so all of this fun conversations we're having about LMs, the generals, the soldiers will come into the room and be like, all right, we're now in the Manhattan Project stage of this whole thing. I think 2025, 6, 7, 27, I don't think something like that is even remotely possible. I mean, you can make the same argument for computers, right? You can say, okay, computers are capable and we don't want the general public to get them or chips, even AI chips, but you see how, like, you know, Huawei makes chips now, you know, took a few years.
Starting point is 04:13:56 But, and I think, I don't think there is a way you can contain something like that, like knowledge like that. I think in this day and age, it is impossible. Like the internet, I don't think this is a possibility. On the Manhattan Project thing, one of my funny things making out of them is I think that, like a Manhattan project-like thing for open models would actually be pretty reasonable because it wouldn't cost that much. But I think that that will come.
Starting point is 04:14:21 It seems like culturally the companies are changing. But I agree with Sebastian and all the stuff that you just said. It's just like I don't see it happening nor being helpful. Yeah, I mean, the motivating force behind the Manhattan Projects is there are a civilizational risk. It's harder to motivate that for open source models. There's not civilizational risk. You think on the hardware side, we mentioned Nvidia a bunch of times. Do you think Jensen and Vindia are going to keep winning?
Starting point is 04:14:51 I think they have the downside that they have to iterate a lot and manufacture a lot. And I think they probably, what they're doing, they do innovate. But I think there's always the chance that there is something who does something fundamentally different, who gets very lucky and then does something. But the problem is, I think, adoption. You know, like the mode of Nvidia is probably not just the GPU. It's more like the Kuda. ecosystem and that has evolved over so many, I think, I mean, even back when I was a grad student,
Starting point is 04:15:23 I was in a lab, we did biophysical simulations, molecular dynamics, and we had a Tesla GPU back then just for the computation that was 15 years ago now. And just they built this up for a long time and that's like that's the mode. I think it's not the chip itself, although they have now the money to iterate and build and scale. But then it's really on the compatibility. It's like, Well, if you're at that scale as a company, why would you go with something risky, where it's only a few chips that they can make per year? You go with the big one.
Starting point is 04:15:55 But then I do think with LLMs now also, it will be easier to design something like Kuda. So it took 15 years because it's hard. But then now we have LLMs. We can maybe replicate Kuda. And I wonder if there will be a separation of the training and the inference compute, as we kind of stabilize a bit more
Starting point is 04:16:15 and more and more, computer is needed for inference. That's supposed to be the point of the GROC acquisition. And that's why part of what Verra Rubin is, where they have a new chip with no high band with memory, or very little, which is one of the most expensive pieces. It's designed for pre-fill, which is the part of inference where you essentially do a lot of matrix multiplications, and then you only need the memory when you're doing this auto-regressive generation, and you have the KV cache swaps.
Starting point is 04:16:44 So they have this new GPU that's designed for that specific use case, and then the cost of ownership per flop or whatever is actually way lower. But I think that Nvidia's fate lies in the diffusion of AI still. Their biggest clients are still these hyperscale companies, whether it's like Google obviously can make TPUs. Amazon is making Traneum. Microsoft will try to do its own things. And like so long as the pace of AI progress is high, Nvidia's platform. is the most flexible, and people will want that. But if there's stagnation, then creating bespoke chips, there's more time to do it.
Starting point is 04:17:22 It's interesting that Nvidia is quite active in trying to develop all kinds of different products. They tried to create areas of commercial value that will use a lot of GPUs. But they keep innovating, and they're doing a lot of incredible research, so. Everyone says the company is super oriented around Jensen and how operational. operationally plugged in he is, and it sounds so unlike many other big companies that I've heard about. And so long as that's the culture, I think that I will expect them to keep progress happening. And it's like he's still in the Steve Jobs era of Apple. So long as that is how it operates, I'm pretty optimistic for their situation because it's like, it is their top order problem. And I don't know if making these chips for the whole ecosystem is the top goal of all these other companies. They will do a good job, but it might not be. as good of a job. Since you mentioned Jensen,
Starting point is 04:18:17 I've been reading a lot about history and about singular figures in history. What do you guys think about the single man, woman view of history? How important are individuals for steering the direction of history in the tech sector? So, you know, what's in video without Jensen?
Starting point is 04:18:34 You mentioned Steve Jobs. What's Apple without Steve Jobs? What's XAI without Elon? or deep mind with Audemus. People make things earlier and faster, where scientifically, many great scientists credit to being the right place at the right time and still making the innovation,
Starting point is 04:18:55 where eventually someone else will still have the idea. So I think that in that way, Jensen is helping manifest this GPU revolution much faster and much more focused than without having a person there it would do. And this is making the whole AI buildout faster. But I do still think that eventually something like chat GPT would have happened and a buildout like this would have happened,
Starting point is 04:19:20 but it probably would not have been as fast. Or like I think that's the sort of flavor that is applied. People, these individual people are people who are placing bets on something. Some get lucky. Some don't. But if you don't have these people at the helm, it would be more diffused. It's almost like investing in an ETF versus individual stocks. individual stocks might go up, might go down more heavily than an ETF, which is more balanced.
Starting point is 04:19:44 It will eventually go up over time. We'll get there. But it's just like, you know, the focus, I think is the thing, passionate focus. Isn't there a real case to be made that without Jensen, there's not a reinvigoration of the deep learning revolution? It could have been 20 years later is the thing that it would say. Yeah, yeah, 20 years. Or like another AI, like a deep learning winter could have come if GPUs weren't around. That could change history completely because you could think of all the other technologies that could have come in the meantime and the focus of human civilization
Starting point is 04:20:16 the silicon value would be captured by a different hype. But I do think of this, I mean, there's certainly an aspect where it was all planned the GPU trajectory, but on the other end, it's also a lot of lucky coincidences. For example, or good intuition, like the investment into the, let's say, biophysical simulations, or like,
Starting point is 04:20:35 I mean, I think it started with video games and then it just happened to be good at linear algebra because video games require a lot of linear algebra, and then you have the biophysical simulations. But still, I don't think the plan, the master plan was AI. I think there was just, it happened to be Alex Krashefsky. So someone took these GPUs and like, hey, let's try to train a new network on that
Starting point is 04:20:57 and happen to work really well. And I think it only happened because you could purchase those GPUs. Gaming would have created a demand for faster processors if NVIDIA had got a, out of business in the early days. That's what I would think. Like, I think that the GPUs would have been different for the AlexNet. But I think, like, GPs would still exist at the time of AlexNet and at the time of
Starting point is 04:21:20 the transformer. It was just hard to know if it would be one company as successful or multiple smaller companies with worse chips. But I don't think that's, like, a 100-year delay. It might be a decade delay. Well, it could be one, two, three, four, five decade delay. I mean, I just can't see Intel or AMD doing what Nvidia did. I don't think it would be a company that exists.
Starting point is 04:21:44 I think it would be a different company. Like Silicon Graphics or something. So, yeah, some company that has died would have done it. But it does, like, just looking at it, it seems like these singular figures, these leaders, have a huge impact on the trajectory of the world. Obviously, incredible teams behind them. But, you know, having that kind of. very singular, almost dogmatic focus is necessary to make progress.
Starting point is 04:22:13 Yeah, I mean, even with GPT, it wouldn't exist if there wasn't a person, Elyar, who pushed for this scaling, right? I mean, yeah, Dario was also deeply involved in that. You read some of the histories of open AI. It almost seems wild thinking about how early these people were like, we need to hook up 10,000 GPUs and take all of open AI's compute
Starting point is 04:22:31 and train one model. There's a lot of people there that didn't want to do that. Which is an insane thing to believe. To believe scaling before scaling has any indication that is going to materialize. Again, singular figures. Speaking of which, 100 years from now, this is presumably post-Singularity,
Starting point is 04:22:52 whatever singularity is, when historians look back at our time now, what technological breakthroughs would they really emphasize as the breakthroughs that led to the singularity? So far we have touring to today, 80 years. I think it would still be computing, like the umbrella term computing. I don't necessarily think it's even like 100 years,
Starting point is 04:23:17 200 years from now it would be AI. It could still well be computers, you know. We are now taking better advantage of computers, but like the fact of computing. It's basically Moore's Law kind of discussion. You're not even the details of coup and GPUs won't even be remembered. And it won't be all the,
Starting point is 04:23:35 software turmoil. It'll be just obviously compute. I generally agree, but is the connectivity of the internet and compute able to be merged, or is both of them? I think the internet will probably be
Starting point is 04:23:52 related to, yeah, I mean, communication, it could be a phone, internet, satellite, that stuff, where, yeah, and compute is more like the scaling aspect of it. It's possible that the internet is completely forgotten. that internet is wrapped into the phone networks, like communication networks.
Starting point is 04:24:11 This is just another manifestation of that. And the real breakthrough comes from just the increased compute is the Moore's Law broadly defined. Well, I think that connection of people is very fundamental to it. So it's like you can talk to any, you want to find the best person in the world
Starting point is 04:24:28 or something there somewhere in the world. And being able to have that flow of information, and the AIs will also rely on this. I think I've been fixating on the, like, when I said the dream was dead about the one central model, and the thing that is evolving is, like, people have many agents for different tasks. People are you start doing this with different clods for different tasks,
Starting point is 04:24:47 and it's described as many AGIs in the data center where each one manages and they talk to each other. And, like, that is so reliant on networking and free flow of information on top of compute. But, like, networking, especially, with GPUs is such a part of scaling up compute. Like the GPUs and the data centers need to talk to each other. Anything about neural networks will be remembered?
Starting point is 04:25:12 Do you think there's something very specific and singular to the fact that there's neural networks that's seen as a breakthrough, like a genius, that you're basically replicating in a very crude way, the human mind, the structure of the human brain, the human mind? I think without the human mind, we probably wouldn't have neural networks, because it just was an inspiration for that.
Starting point is 04:25:34 But at the other end, I think it's just so different. I mean, it's digital versus, you know, biological that I do think it will probably be more like grouped as an algorithm. That's massively paralyzable on this particular kind of compute. Could have been like genetic computing, like genetic algorithms just as paralyzed. It just happens that this is more efficient, works better, you know. And it very well could be that the LLM, you know, the neural networks, the way we architect them now, is just a small component of the system that leads to singularity. I think is, if you think of it 100 years,
Starting point is 04:26:08 like society, I think, can be changed more with more compute and intelligence because of autonomy. But it's like looking at, looking at this, like, what are the things from the industrial revolution that we remember? We remember, like, the engine is probably the equivalent of the computer in this. But there's a lot of other, like, physical transformations that people are aware of, like all the the cotton gin and all these things
Starting point is 04:26:32 that these machines that are still known, air conditioning refrigerators, like some of these things from AI will still be known. Like the word transformer could still very well be known. I would guess that deep learning is definitely still known, but the transformer might be evolved away from
Starting point is 04:26:49 in 100 years of, with ASI, AI, AI researchers everywhere. But I think deep learning is, likely to be a term that is remembered. And I wonder what the air conditioning and the refrigeration of the future is that AI brings. Is there, if we travel forward 100 years from now, we transport there right now, what do you think is different? How do you think the world looks different?
Starting point is 04:27:15 First of all, you think there's humans, you think there's robots everywhere, walking around? I do think specialized robots for sure, for certain tasks. Humanoid form? That I'm maybe half humanoid. we'll see I think for certain things yes there will be humanoid robots because it's just amenable for the environment but like for certain tasks it might make sense
Starting point is 04:27:35 what's harder to imagine is how we interact with devices and what humans do with devices I mean I'm pretty sure will probably not be the cell phone will probably not be the laptop will it be implants I mean it has to be brain computer interfaces right I mean 100 years from now it has to like given the progress we're seeing now
Starting point is 04:27:55 there has to be, unless there's legitimately complete alteration of how we interact with reality. On the other hand, if you think of cars, cars are older than 100 years, right? And it's still the same interface. It's not, we haven't replaced cars with something else. We just made the cars better, but it's still steering wheel, it's still wheels, you know. I think we'll still carry around a physical break of compute because people want some ability to have a private, like you might not engage with it as much as a phone, but having something where you can have private information that is yours
Starting point is 04:28:30 as an interface between the rest of the internet, I think is something that people will still exist. It might not look like an iPhone, and it might be used a lot less, but I still expect to have people carry things around. Why do you think the smartphone is the embodiment of private? There's a camera on it. Private for you, like encrypted messages, encrypted photos.
Starting point is 04:28:51 You know what your life is. Like, I guess this is the question on how optimistic on brain machine interfaces you are. Is all of that just going to be stored in the cloud in your whole calendar? Like, it's hard to think about processing all the information that we can process visually through brain machine interfaces presenting something like a calendar or something to you. Like, it's hard to just think about knowing, without looking, you know your email inbox. Like you signal to a computer and then you just know your email inbox. Like what does that, like, is that something that the human brain can handle being
Starting point is 04:29:29 piped into it non-visually? Like, I don't know exactly how those transformations happen. Because humans aren't changing in 100 years. I think agency and community are things that people actually want. Local community. So people you are close to being able to do things with them and being able to describe meaning to your life. and to be able to do things.
Starting point is 04:29:53 I think that that is, if not in 100 years, I don't think that human biology is changing away from those on a time scale that we can discuss. And I think that, like, UBI does not solve agency. I do expect mass wealth, and I hope that it is spread, so that the average life does look very different in 100 years. But that's still a lot to happen in 100 years. if you think about countries that are early in their development process to getting access to computing and internet, like to build all the infrastructure and to have policy that shares one nation's wealth with another is, I think it's an optimistic view to see all that happening in 100 years while they still being, while they are still independent entities and not just like absorbed into some international order by force. But there could be just better, more elaborate, more effective social support systems that help alleviate some levels of basic suffering from the world.
Starting point is 04:30:57 You know, the transformation of society where a lot of jobs are lost in the short term, I think we have to really remember that each individual job that's lost is a human being who's suffering. That's like when jobs are lost, it scales a real tragedy. You can make all kinds of arguments about economics or it's all going to be okay. It's good for the GDP. There's going to be new jobs created. Fundamentally, the individual level for that human being, that's real suffering.
Starting point is 04:31:29 That's a real personal sort of tragedy, and we have to not forget that as the technologies are being developed. And also my hope for all the AI slop we're seeing is that there would be a great, greater and greater premium for the fundamental aspects of the human experience that are like in person, the things that we all like seeing each other talking together in person. The next few years are definitely going to be increased value on physical goods and events
Starting point is 04:32:01 and even more pressure on slop. So it'll be so, the slop is only starting. The next few years will be more and more diverse versions of slop. It would be drowning in slop. I'm hoping that society drowns and slop enough to snap out of it and be like, we can't. Like, none. Like, it just doesn't matter. We all can't deal with it.
Starting point is 04:32:22 And then, like, the physical has such a higher premium on it. Even, like, classic examples, I honestly think this is true. And I think we get tired of it. We are already kind of tired of it. Same with, I mean, even art. I don't think art will go away. I mean, you have paintings, physical paintings. There's more value, not just monetary value, but just.
Starting point is 04:32:43 more value appreciation for something. That is the actual painting than a photocopy of that painting. It could be a perfect digital reprint of that. But there is something when you go to a museum and you look at that art and you see that real thing and you think about, okay, a human, I don't know, it's like a craft. You have an appreciation for that. And I think the same is true for writing, for talking, for any type of experience where it will be, I do unfortunately think it will be like a dichommet, like it will be like a fork
Starting point is 04:33:10 where some things will be automated. Like, you know, there are not as many paintings as they used to be 200 years ago. There are more photographs, more photocopies. But at the same time, it won't go away. There will be, you know, value in that. I think that the difference will just be a bit, you know, what's the proportion of that. But personally, I have a hard time reading things where I obviously see it's obviously AI generated. I'm like, sorry, it might be really good information there, but I have like a certain, nah, not for me.
Starting point is 04:33:41 I think eventually they'll fool you, and it'll be on platforms that give ways of verifying or building trust. So you will trust that Lex is not AI generated having been here. So then you have trust in this channel, but it's harder for new people that don't have that trust. Well, that will get interesting because I think fundamentally I think is a solvable problem by having, you know, trust in certain outlets that they won't do it. But it's all going to be kind of trust-based. There will be some systems to authorize. okay, this is real, this is not real. There will be some tell-tale science
Starting point is 04:34:15 where you can obviously tell this is AI generated and this is not, but some will be so good that it's hard to tell, and then you have to trust. And that will get interesting and a bit problematic. The extreme case of this is to watermark all human content. So all photos that we take on our own have some watermark until they are edited or something like this.
Starting point is 04:34:36 And software can manage communications with the device manufacturer to maintain human editing, which is the opposite of the discussion to try to watermark AI images, and then you can make a Google image that has a watermark and use a different Google tool
Starting point is 04:34:52 to remove the watermark. Yeah, it's going to be non-strays. And we've been mostly focusing on the positive aspects of AI. I mean, there's also the, all the capabilities we've been talking about can be used to destabilize human civilization with even just relatively dumb AI
Starting point is 04:35:09 applied at scale and then further and further super intelligent AI systems of course there's the sort of doom or take that's important to consider a little bit as we develop these technologies what gives you hope about the future
Starting point is 04:35:25 of human civilization, everything we've been talking about. Are we going to be okay? I think we will. I'm definitely a warrior both about AI and non-AI things, but humans do tend to find a way.
Starting point is 04:35:41 I think that's what humans are built for is to have community and find a way to figure out problems. And that's what has gotten us to this point. And I think that the AI opportunity in related technologies is really big. And I think that there's big social and political problems to help everybody understand that. And I think that that's what we're staring at a lot of right now.
Starting point is 04:36:05 It's like the world is a scary place. And AI is a very uncertain thing. And it takes a lot of work that is not necessarily building things. It's like telling people and understanding people that the people building AI are historically not motivated or wanting to do. That it is something that is probably doable. It just will take longer than people want. We have to go through that long period of hard, distraught AI discussions if we want to have the lasting benefits. Yeah, through that process, I'm especially excited that we get a chance to better understand ourselves, us at the individual level as humans and at the civilization level.
Starting point is 04:36:49 It answers some of the big mysteries. Like, what is this whole, like, consciousness thing going on here? Seems to be truly special. Like, there's a real miracle in our mind, and AI puts a mirror to ourselves and get to answer some of the big questions about, like, what is this whole thing going to? on here. One thing about that is also what I do think makes us very different from AI and why I don't worry about AI taking over is, like you said, consciousness, we humans, we decide what we want to do.
Starting point is 04:37:21 AI in its current implementation, I can't see it changing. You have to tell it what to do. And so you have still the agency. It doesn't take the agency from you because it becomes a tool. You can think of it as a tool. You tell it what to do. it will be more automatic than other previous tools. It's certainly more powerful than a hammer.
Starting point is 04:37:41 It can figure things out, but it's still you in charge, right? So the AI is not in charge, you are in charge you tell the AI what to do, and it's doing it for you. So in the post-Singlarity post-apocalyptic war between humans and machines, you're saying humans are worth fighting for? 100%. I mean, this is the movie Terminator, they made in the 80s, essentially, and I do think. I think, well, the only thing I can see going wrong is, of course, if things are explicitly programmed to do the thing that is harmful, basically. I think actually in that, in a terminated type of setup, I think human win.
Starting point is 04:38:21 I think we're too clever. It's hard to explain how we figure it out, but we do. And we'll probably be using local LLMs, open source LLMs, to help fight the machines. I apologize for the ridiculousness. Like I said, Nathan already knows. I've been a big fan of his for a long time. I've been a big fan of yours, Sebastian, for a long time.
Starting point is 04:38:46 So it's an honor to finally meet you. Thank you for everything you put out into the world. Thank you for the excellent books you're writing. Thank you for teaching us. And thank you for talking today. This was fun. Thank you for inviting us here and having this human connection, which is extremely valuable human connection.
Starting point is 04:39:05 Thanks for listening to this conversation with Sebastian Rashka and Nathan Lambert. To support this podcast, please check out our sponsors in the description, where you can also find links to contact me, ask questions, give feedback, and so on. And now, let me leave you with some words from Albert Einstein. It is not that I'm so smart, but I stay with the questions much longer. Thank you for listening and hope to see you next time. Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.