Latent Space: The AI Engineer Podcast - Information Theory for Language Models: Jack Morris

Starting point is 00:00:03 Hello, this is Layton Space just swix today with our special guest, Jack Morris. I guess from Columbia. That's your affiliation right now? Cornell. It's actually confusing because I'm in the New York City outposts of Cornell. So you have the city right, but it's Cornell Tech, which is like a small Cornell campus in New York. I just, you're a student of Sasha Rosh who teaches at Cornell. So I actually didn't make that connection.

Starting point is 00:00:33 Okay, yeah, I'm sorry. Well, that's a horrible mistake to make right off the bat. But you're one of, look, you're one of the, there are not that many PhD students that make an impact with their research. The last time someone like this happens was Shun You from Princeton. And he joined the Open AI operator team quite shortly after he graduated. So like, you're one of those like high profile PhD students at least that's like coming out of the program. And I figured it was a good time to just talk about your work. And also the fact that you're looking for which lab you're going to join.

Starting point is 00:01:11 That's like a whole interesting meta discussion, especially with like the insane market for AI talent these days. What's it like to be an AI grad student these days? Yeah, and thanks for having me. I guess maybe we can go back to when things first started or like put yourself in my shoes in 2017, 2018, I really learned a lot about machine learning. And at my, I went to a state university. It's a good school, but they didn't have like a deep learning research department or anything. They had people doing it, but it was just not as big at that time. But I was getting really interested in those topics, especially as applied to language. And then in 2019, I kind of was starting to do research. And I think thinking about my career, I mean, at that point, I was 20, 21. I was thinking about like, Like, where do I want to be career-wise or, like, who's doing the coolest stuff right now?

Starting point is 00:02:06 Like, looking at, like, what kind of stuff is coming out of that time? I mean, I think AlphaGo. I thought AlphaGo was really good. At that time, I was playing a lot with, like, Burt and Burt-based models. So, like, you know, Google, DeepMind, they're doing great work. GPT2, GPT1 from OpenAI were, like, interesting. But I think most people were into BERT at that time. I still have a soft spot for, like, that parameter class of, like, 100 million.

Starting point is 00:02:31 to one billion scale models. But this is all to say, I think at that time, I felt like the people doing a lot of the most impactful work were professors and PhD students, like just a ton of like interesting ideas being explored and cool opportunities in academia. So I ended up applying to grad school. Well, at first I did this Google AI residency program,

Starting point is 00:02:55 which was mostly during the pandemic, like 2020 and then 2021. And then I was also applying to grad school. started grad school in 2021, that's still what was going on at that time. Like around when, I guess, GPT3, 175 billion had been released, but not instruct GBT. So, like, we had pre-training and sort of the science of pre-training was emerging, but that's where the models were. And I still think, like, I'm glad that I went to grad school.

Starting point is 00:03:25 And, like, I had a great experience, but the last five years have changed a lot. like the whole meta has shifted, you know, like the kind of power dynamics are completely different. The ideas are coming from different places. Most stuff is open. Now most stuff is not open. The types of questions people are asking are different. And so, yeah, I mean, for better or for worse, I did go to do the full grad school thing and here I am. It's been really an interesting perspective watching the science kind of emerge with the products. Like, the biggest thing that happened by far was like chat GPT coming out, which was right in the middle, like, what, 2022 before Christmas, like November? I remember that year like my grandma was asking

Starting point is 00:04:11 me about it. And that's when it hit me like, oh, this is actually becoming like a real area that people will know about and understand. Like I was trying to explain it to my parents. And that's when I think things really started to change in terms of the types of questions you wanted to ask can't always be answered with academic resources. So a lot of the fundamental kind of like boundary pushing and AI science moved into companies. That was the year when like, you know, just around Europe's as well, everyone in NLP and deep learning were like very confused at like, I think some people were like kind of expecting this already in a sense that they had, they were obviously more clued into large language models. But I think that the sheer amount of consumer level interests that had, that was around

Starting point is 00:04:58 at the time in 2022, that completely changed the world. Like now we're just like in a different sphere. Did you have to pivot your research or were you already, you just went from birth to like other stuff? You've done a lot of embeddings work. I mean, you're always heads down working on a problem. So I don't think most people in academia are the type to say, oh, look at this new product that came out.

Starting point is 00:05:22 I'm going to abandon everything I'm doing. That can be the right move, you know? Oh, it definitely can. And honestly, if I were to give advice to a younger grad student, I think the way to do it would be literally just like sit and wait until the next kind of paradigm shift and then just immediately start working as fast as you can to like re-implement it.

Starting point is 00:05:44 Like I don't think that's like maybe the best way to do science, but it's probably the best way to play the sort of academic game in the days of AI. Like you've seen that so many times. Most recently probably with the reasoning models like, 01 came out of Open AI, September 2024 last year. And then there's just been this explosion of like you build like abstraction ladders on top of that. Like first it was re-implementation. Like how do we even do this? And now it's like a lot about the data. What's the right data? What are the right

Starting point is 00:06:14 evals? What are the right training schemes? Like there are so many different axes you can test and publish research in. And like I think the easiest way to do that probably is just work in a field like that, like, has it only existed for less than one year? And so no one has any, like, big advantage, I guess. That is mostly correct. I think anyone who jumped on reasoning and RL for LLLLL is doing super well. I just saw this morning that one of the recent Stanford grad students who worked on RL, they just started their company and there were 500 million. It's, it's like absolutely bonkers right now. It's just like, no product, just three dudes, you know, in some basement somewhere, I mean, undoubtedly cracked, but like also not worth 500.

Starting point is 00:07:01 Yeah, but maybe it's not paying for the product, right? It's like the- potential. The ideas behind it. Yeah. Yeah. There was this big shift from, in scale, of working with 100 million parameter models. Really what happened is like, I think the company's invested a ton more into training and infra. And like, we all kind of had to catch up, like, you know, me, I go to Cornell, work with a professor there. He has to buy GPU. Like, should he buy last year's GPUs or this year's GPUs? How many should he get?

Starting point is 00:07:29 That we were kind of like trying to figure that out. And there was like a big lag, I think, where basically the seven and eight billion parameter scale, like, there's a huge difference between the bird size models, which are 125 million parameters to 200. And then like the eight billion parameters. I mean, obviously it's two orders of magnitude. But just like this idea of emergence, like if you're talking to a model that's 100. million parameters, no matter how well it's trained, it knows nothing. Like if you ask it, like, what's the capital of a state? Or like, if you ask it, who was president of the United States in 1990 or whatever, it'll just always say George Washington because it just associates the words

Starting point is 00:08:11 like president of United States with George Washington. And then when you get to the eight billion parameter scale, suddenly it knows every single president. It knows every single capital of every single country. And I really do think that changes the type of research. you can do. And so, like, it took us a while, I think, in academia to catch up, like, getting good 7 billion parameter models and then running them and getting GPUs to run them. Now I think things have stabilized a lot. Like, we have access to compute and we can kind of like fine tune and inference that scale of models and that's like kind of fine. But there was like kind of two years where everyone in academia was working on like smaller models and none of it really mattered. I think sort of branched

Starting point is 00:08:50 that discussion in two ways. And we should sort of go to your research at some point. But I'm enjoying this because I think like we don't get to talk about this on the podcast too often. One is there's an often, there's an often, a often bit of advice from the industry people to grad students, which is give up, don't work on models, just do benchmarks, right? Like a really good benchmarks will get our attention and then we'll hire you and then you can switch to models later. You have for better or worse avoided that, which is cool. And we can talk about that as well.

Starting point is 00:09:17 But the other thing I think is that around about 7, 8b, maybe 4B is when you start switching from like a single GPU setup to like a distributed setup. And I'm wondering, like, do grad students get HPC training? How much they teach you of like just how to work with like large clusters of stuff? Oh, to be clear, they don't teach you anything. Like anything, like if you see a paper coming out from even, you know, Stanford, they're probably the best school in AI if you had to choose. And it's not like they're learning how to do like multi-node distributed FSDP.

Starting point is 00:09:53 training, like, with whatever deep speed. You have to learn that from the internet and from other people. And, like, there's no classes that really do that. I mean, that's hard to facilitate, like, as one person. I would say most grad students are doing stuff on single GPU. Some people are doing multi-GPU training. There's probably basically no grad students doing multi-node training. I mean, there's probably a few, especially if they have, like, company affiliations, but that's really unusual, I think. Okay. For grad students who are looking to get up to speed on that, I will recommend the GPU mode Discord, where basically the Pytorch team is hanging out in there, just waiting to help you.

Starting point is 00:10:33 And then the other one would be the Fast AI team. If you have some kind of thing, Jeremy Howard will basically help you out, and they have some distributed training. Honestly, try to reach out to the Deep Speed team at Microsoft. Like, actually, they're reasonably accessible. Nobody talks to them. Like, it's so funny. I, like, met them at Europe's, and, like, they had nobody at their, like, there was presenting these speech. Three, I was the only one asking questions. Like, yeah.

Starting point is 00:11:00 Yeah, that's good advice. Listen to this guy. Yeah, I mean, just basically, like, people are there if you want to ask. This is very, very valuable experience. Once you're, like, a GPU God, like, you're basically, you know, in, like, a different tier as a researcher, because you don't rely on someone else helping you out. Like, you can just sort of be your own research engineer, you know? Yeah, I'll comment on that quickly because if someone has been listening to this and also following me online for a while, I think I've made a couple comments like saying something like you shouldn't learn about CUDA or things to that nature. And I'll give some more color to that. So it's definitely a great idea to learn CUDA if you can. I think my point was that if you're trying to enter this space, like learn about the models, learn about how they're trained, what the data looks like, what the compute looks like.

Starting point is 00:11:49 One axis of that is how to do more efficient training and inference. And one part of doing more efficient training and inference is studying the hardware, which is GPUs. So, like, I think that's a very small subset of all possible knowledge that you could acquire. And it's probably not the best place for a lot of people to start. That said, if you do it, you've got to be one of the most hireable people in the world. Like, if you, like, really deeply understand. the architecture of the new GPUs coming out

Starting point is 00:12:23 and how to control it, you're in a very small handful of people and everyone will want tire you. Actually, the sweet spot is not even Kuda right now. I would say actually it is Mojo. I don't know if you've been paying attention to modular mojo. Oh, I listen to your podcast, man.

Starting point is 00:12:39 You had that guy on the other day. The whole story is Chris Ladner, industry legends, LVM, Swift, all these things. And now he's turned his attention. to the Python-Kuda relationship, right? And he wants to basically create a viable Kuda replacement. It's basically Python married with Rust.

Starting point is 00:12:59 For the last two-and-a-half years, he was basically kind of stealth, not ready for production. When he came on our podcast, he was basically announcing to the world, like, we're open for business. Like, you can use us now for most models, and, like, we actually are faster than, like,

Starting point is 00:13:11 the native, like, sometimes the PtX implementation. I don't know how that works precisely, but he's a compiler language as God. I think there's a narrative, there's one of those windows now, like you said, like, you know, bad early on something that's, there's a shift. It's one of those windows now when you try to implement things, you basically, like, you know, modular as 100 people.

Starting point is 00:13:29 If you run into issues, you'll get Chris's personal help on things. Like, I'm not promising it, but like, probably, you know, like, because he wants to work on improving the toolkit. And all you have to do is just, like, it's not really about becoming a kuda god, because obviously, like, once you wrap up on the general concepts and principles, you can probably translate ecosystems pretty, sounds pretty effectively.

Starting point is 00:13:50 A lot of people switch from like Jax to Kuda. But like the thing is just like being able to experiment very quickly on a limited budget. Like efficiency is not just because you are trying to be an efficiency guru and that's your career and that's kind of boring. But it's really also just about

Starting point is 00:14:06 being able to experiment very quickly in finding these like these ideas. I also think VLM and SG Ling seem like really good and important and here to stay. Like they'll probably just get larger and more complex to accommodate future systems. But if I were like starting out grad student and working in that area, I'd probably like want to learn more about how they work.

Starting point is 00:14:31 Awesome. Let's go to your research. I like to mention that I first came across you because of CDE, the contextual documented bidding paper. You can tell me the story about that. But I just want to show you proof that, you know, I get one slot per day to highlight the number one AI story. and you were the slot of the day for October 5th. Oh, no way. Obviously, you were producing work before that, but I thought CDE was a really cool exploration of like, oh yeah, you know, embedding models are kind of like stuck in a rut.

Starting point is 00:15:02 Like, here's actually how to make them very efficient by just doing it in two stages. That seems like a, you know, relatively simple insight that was done very well. But you have a general maybe information theory thing that maybe we should start with, and then we can sort of create it to send our way. Yeah, sure. That sounds good.

Starting point is 00:15:19 So we can circle back on that. That's really cool that you wrote about it. What was that? Almost coming up on two years ago. Yeah, this is the post that wrote. I called it a new type of information theory. We don't need to go into it. There's this paper about a concept called the information.

Starting point is 00:15:38 Maybe I'll give like the most simple explanation, which is if you say you have two text files, one text file contains a paragraph of information about New York City. And then the other text file contains the same text, but encrypted with like saw whatever encryption algorithm. So it looks like random letters, but if you decrypt it, it has the same text as the first text file. From the perspective of like Shannon's information theory, these two files contain the same information content, like relative to everything, they have the same number of bits. but it's very clear to the observer that the first text file, which is plain English text,

Starting point is 00:16:23 is much easier to read and easier to process, even though they have the same information. And so there's this theoretical framework proposed in this paper, which is a theory of usable information under computational constraints from 2020. It really doesn't have that much press. There aren't as many citations as you would think. But I think it's a really, really neat idea. It's like we should measure information

Starting point is 00:16:46 with computational power as a constraint. So they have this idea, they call the information of how much information is extractable from a given file or code. So in that case, we could say the left text file actually has more extractable information than the right text file. I think that's really good.

Starting point is 00:17:08 That captures a lot of our ideas of how these deep learning systems work. Why is pre-training work? If you have two sets of weights and you want to train on some downstream data set, one set of weights is pre-trained, one set of weights is randomly initialized, why is the pre-trained model better at all,

Starting point is 00:17:26 even though it's never seen your data? Maybe one way of looking at that is that it has, it makes the information more extractable somehow. Like there's this concept of like computational processing that you can almost like store up. I like this as just like a lens to view problems with, like how much information is stored where. Like if you get a set of model weights or like an activation vector and you open it up,

Starting point is 00:17:54 like print some tensor or numpi array, it looks like random numbers, right? Like there's nothing human intelligible about that. But really it's this complex combination of like the training data and the training algorithm which get compressed into model weights and then the actual computation that the model is doing, which involves like manipulating these numbers in ways that we don't understand. So it's like this really highly compressed nonlinear combination of all these information sources

Starting point is 00:18:23 mixed with like computation. And I just think we don't have like the right words of discussing this. I think I like the information theory analogy because back in the day, you know, we had phones and like telegraphs and people were just sort of like building the phone system with these crazy heuristics to like, information across the country or send telegraphs across the Atlantic, people were just

Starting point is 00:18:48 like trying stuff. And then we kind of found stuff that worked and we ran with it, but it wasn't really optimal. And it wasn't until someone came along and proposed this concept of like a bit, like a one or a zero that tells you something. And once we have a bit, we can do all these things. We can like count the amount of information to signal. We can do really good error correction. we can measure properties of distributions of things, and we can build a really good system for phones, and then eventually which led to computers. I'm bringing this all up because I don't think we have,

Starting point is 00:19:23 I don't think we know what a bit is yet in terms of like deep learning models. I'm going to graduate for my PhD this year, but I didn't figure it out. So if you're listening to this, maybe you can like, I don't know, spend more time on it or you're smarter than me or you have a group of collaborators you can all get together and figure out what the right. lens to look at this stuff is. But even by just asking these questions, I think I was able to

Starting point is 00:19:47 conduct this research agenda that I'm kind of still working on, actually. Yeah. What do you call this field? I don't know. I don't know. I called the post a new type of information theory. I don't think it exists yet, I guess. So maybe it'll it'll get a name once someone actually comes up with the right set of definitions. I think V-information is a really good start. There's a couple really the threads. So first of all, you don't know this, but I actually have been trying to accumulate data about Shannon. Like, this is just like a Shannon information theory view of language models.

Starting point is 00:20:23 I have a lot of notes. This is actually on my GitHub for people who are watching along. But, you know, at the limit, if a language model has 175 billion parameters using 16-bit, it will take up 350 gigabytes. You can compare that to Wikipedia. Wikipedia is about 150 gigabytes. You know, let's say GPC3 can store two Wikipedia. But is that a relevant measure of information storage, right?

Starting point is 00:20:48 It is not because you can compress Wikipedia a lot. There's a lot of repeated patterns. tokenization is like the first form of compression. But I think there's a related talk from Aaliyah Sutskiver about how deep learning is machine learning, it kind of is compression. You have a dataset, you compress it into a model that is smaller than a dataset, but generalizes and has like you know some some amount of acceptable loss. I think that one of your commenters on the post made this direct comparison with comagroch complexity,

Starting point is 00:21:21 which is what Ilya, how Ilya sees it. So I think like people have this information theory idea or approach to language models. It is just not precise because exactly what you say like it's we don't know what a bit really means. We don't know what like the most legibility is a word that comes to mind in terms of like like it matters to us that it's human. readable. Like, even if it's shot one, shot 256, I don't care. But like, that is less readable. And therefore, there's more, I guess, I don't know. Entropy is not the word because it's, it's directly convertible. But it's just less useful. Yeah, yeah, yeah, yeah. Useful is a good word. I think maybe useful information or usable information is the right lens. And Komagrov is a really interesting connection, like Komagrov of complexity. I think that's a really good concept for computer scientists. So I'm not. I'm not sure exactly about this specific talk or like what he was trying to say, but I think that we have a very good understanding of language model pre-training, and there's a deep connection between language models and compression.

Starting point is 00:22:26 Actually, maybe let's start with the embeddings. We can come back to that. Okay. So is this, are we going to the first paper? Actually, let's go to your Wikipedia numbers if you still have access to that. So there's 50 gigabytes for text of Wikipedia. that sounds like pretty high to me. Is that that's uncompressed, like text files or something?

Starting point is 00:22:45 I grabbed it from NJ Wang, so I don't know. Okay, okay. No, no, I'm probably off. I just sort of have the sense that like when you store text, it's generally like very, very small, especially when you zip it. Maybe he's including all the languages, all the edits. Yeah, yeah, that can make sense. That can make sense because I guess what you say from, you know,

Starting point is 00:23:06 if you want to do apples to apples comparisons, GP3 can store, two wikipedias, is that right? 2.3 Wikipadis? 2.5 something, yeah. So I thought it would be a lot more. And this is actually an experiment that you could do. You could like just train a model on Wikipedia and keep training it until you can perfectly extract all of Wikipedia.

Starting point is 00:23:26 And that would be like a good way of knowing like how many Wikipedia's can ZPT store. I like that idea. But I think this type of like back of the envelope math is, it's really useful for thinking about problems and like grounding yourself in the real world, even if you can never quite answer the questions you want to answer, at least like in four years. If we think about embeddings, you know, vectors that people use for search, we can do the exact same kind of math. So if you use the open AI embeddings, which last time

Starting point is 00:23:54 I checked, I think have 1,536 dimensions. So that if you say there's 16 bits per dimension and like half precision floating point, it's something like 20 kilobytes of information in a And if you want to store 20 kilobytes of text, that's a lot of text, like many, many paragraphs that you can perfectly compress into 20 kilobytes. And so I think this is kind of like the idea we had, I'll give you the practical explanation, which was, well, first of all, I'm a second year grad student. I'm like going to these conferences, seeing all these other things people working on and thinking, you know, like, what the heck? Like, how am I going to like have my own little area to do work in that no one else is working in already? And so I spent a lot of time.

Starting point is 00:24:40 coming up with bad ideas. My advisor would say, no, like that's not a good idea to work on. Many times this happened. And like even my first year and a half of grad school was like a lot of exploration and a lot of like coming up with bad ideas. And then honestly, I'd be interested to see how he remembers it. But I think I wrote a sequence of proposals about different projects.

Starting point is 00:25:05 And then I came up with this idea. I was like, oh, we should just try to do as well as. we can to reverse engineer the text that's in embeddings. And then we were talking about it. He was like, oh, yeah, you should just do that. And then that was the end of the proposals. And then that was just working on that problem for a long time, which at the time I was really motivated by that because I was like, cool.

Starting point is 00:25:23 Like my first, as a grad student, my first sort of like official like sign off on like coming up with a good research idea. And at the same time, there was this big rise of this startup business model called like a vector database. and there are all these companies popping up, raising money, raising money, getting like crazy funding. And then actual applications being built

Starting point is 00:25:47 that do something where instead of exchanging customer data, they exchange vectors. So we had this like very grounded question of like what data are they actually sending when they send the vectors. Like first of all, you have this information theoretic argument that when you send one vector, there should be a lot of text recoverable

Starting point is 00:26:05 just in terms of like a lot of these things represent very short documents, but they actually have many, many bits. So, like, the problem seems tractable. And then second of all, we had this justification of how the product is actually being used. Like, if someone hacks into a vector database,

Starting point is 00:26:21 what do they actually find, if that makes sense? So we were working on that for a while. I think I have the talk that you did that Sasha highlighted is this one. Oh, yeah, yeah. Maybe that has the graphic that would kind of... Oh, go one before, I think. One before? This is actually

Starting point is 00:26:38 Yeah, this one's good. Yeah. I like having visual aid. I like giving people breakcrumbs to follow up if they're interested in digging more. But yeah, I remember this is a pretty hot area of research at the time. And there's been some really interesting follow-ups. Like we ended up building a system that can do this quite well, like taking an embedding. And I think our highlight number is like at a certain length, like a long sentence length, we can get 90% of the text back exactly.

Starting point is 00:27:07 and a lot of people were able to do stuff with that. Like they can, for example, I know these people that work on a problem of like devising embeddings and like in one data set they do something. They have a procedure for like removing all latent features that correlate with gender so they can produce like useful embeddings that from some perspective have no like information or usable information about gender. And they'd been doing that for a while. They actually just used our tool.

Starting point is 00:27:39 And they, so, like, they would put in a sentence, like, this woman is a doctor at Wild Cornell Hospital in New York. Or I would say, this woman is a doctor. She works at Wild Cornell. And then they would run their procedure. And then they run R embedding to text model. And now it would say, like, this person is a doctor. They work at Wild Cornell, which is pretty cool.

Starting point is 00:28:00 So they have, like, sort of text-based evidence that their method is actually removing gender features. but let me talk for a second about the research phase here because I thought it would be, I mean, I know if you ever heard me talk about this, I probably told you about it, but just for a wider audience, I like thinking back on this because it was probably my, in some sense, like my greatest victory of grad school was like working on this embedding inversion problem for a while, for quite a while, and proposing a lot of approaches and like testing stuff. I think sometimes you do stuff and it's clear. It was a

Starting point is 00:28:36 idea. Sometimes you think you should have figured it out earlier. And then sometimes you do stuff and you kind of realize it's really complicated and probably not worth it. So I was testing different decoding algorithms for embeddings that are closer, or text that's closer to the text that's in embeddings. And I was testing these kind of like inference time adaptation models for samplers. I think we tried a lot of architecture and like kind of training tweaks. We should have tried RL, I think that would work. But finally we found something that ended up working. And I guess I'm just saying this all because I thought it was like so rewarding. Like we were just banging our heads against the wall. I would have biweekly meetings with my advisor who kind of suggest things.

Starting point is 00:29:21 Sometimes we would agree we were mutually stuck. Sometimes I don't get feedback one way or another and try something new or try a couple things. And we had this idea that it was possible from the information theory arguments and this other thing where we would we would kind of like take our best guess at what the text was and re-embed it and see that it was kind of far from the true embedding. So we had this proof that like a better method could like leverage this kind of information. And then when we finally solved it, it was, it was awesome. Like we had this number that was like 30 for months. I think at one point I got it to 35. And actually I think I was like, oh, I'm done. Like I got it to 35. And my advisor told me,

Starting point is 00:30:03 You can't really just propose a new problem and show you push a metric from 30 to 35. That's confusing and probably not that meaningful to people. And I think I was, you know, that was kind of like a local minimum for me where I was like bummed. But then we ended up getting the number to like 97 or something, which neither of us knew were possible. We were all just, we were just kind of staring at this graph like, oh my God, like, who knew you could get this much information from an embedding? And that was like so great. like just sort of this, it was so rewarding. It was invigorating, honestly, like that research process of like,

Starting point is 00:30:39 we picked a good problem and then we spent so long trying stuff that didn't work, which I'm probably forgetting how frustrating that was. I'm sure it was terrible, but then like actually solving, or at least like coming up with a much better way of solving the problem, I don't know if I'd say we solved it, but we definitely learned a lot from where we started was like, was great. And it completely solidified for me. the fact that I should have gone to grad school

Starting point is 00:31:04 to have this life experience and makes me want to do research forever. You clearly sort of in love with the journey, which I think is important because this is what keeps you going through the tough parts. Is this a good time to talk about the universal geometry side then? Yeah, yeah, yeah. Let's do that next.

Starting point is 00:31:22 I think that's a good idea. So we have this more recent follow-up. So the first part I was talking about ended up in this paper called text embeddings reveal almost as much as text, which was published in 2023. And then we recently had a paper come out on archive, which will hopefully be published at some point. And it's called harnessing the universal geometry of embeddings, which was also, that was probably like the only other time. I felt like we've made like, maybe there have been

Starting point is 00:31:52 two more times, but that was probably the second of three times where I felt like we made like a real discovery about like the unknown. And it was like very rewarding just for its own. intrinsic kind of elusiveness. And I'll start from explaining it in terms of the prior paper. So we build a system that can do embeddings to text, and it works very well, and we're very pleased with ourselves. And then we went to a conference. We talked to people about it.

Starting point is 00:32:18 We talked to like the vector databases. I think some of them changed our privacy policies, which was like somewhat gratifying. And then we kept getting this perpetual question, which is like, well, you're just assuming we use the Open AI model, or you're just assuming we use the most popular text embedding model, if they fine-tune their own model, or if they use a model that you're not training an adversary for, then you can't solve the problem, which is, like, true. Like, none of the vector-to-text stuff works unless you have this assumption of, like, knowing the encoder and also being able to make a lot of queries to it. But we had this kind of underlying theory that all of the models learn very

Starting point is 00:32:59 similar things. Like we have some preliminary evidence for that, like certain models that are fine-tuned from the same base. You can kind of swap their representations without doing much. Or if you look at the nearest neighbors, a lot of the models will give you the exact same nearest neighbors, even though they have completely different training bases. And then there's this paper that came out last year called the Platonic Representation Hypothesis from some folks at MIT, which is really, really compelling. And I think just like great intersection of philosophy, representation learning, deep learning, research. I love this paper and it's such a beautiful idea, which is something like all models are trained on data from the world. And there's only one world. And so as the models get better by

Starting point is 00:33:43 scaling data and scaling model size, they're sort of converging to learn the exact same thing. And in this paper, they have evidence based on correlations for doing this with vision and language models. It's very neat. And so we saw this. So basically think about, You know, you're us. You see this platonic representation hypothesis paper. A lot of people have this shared idea. Like, you know, Claude and GBT4 probably do a lot of very similar internal computation because both of them are trained on trillions of tokens of human written text.

Starting point is 00:34:16 Even if they have different architectures, like maybe, you know, the actual basis or like the numbers, if you look at them, look different. But in some way, they're like kind of computing the same thing. And I think it's even more true with the, these embedding models, which have really only one objective that works. And they're probably all trained on, like, MSMarco, which is a really popular dataset and pre-trained maybe on Wikipedia. But we wanted to basically combine this platonic representation hypothesis idea with the

Starting point is 00:34:46 vector text thing and produce a system that can like align models so that we can do embedding inversion. But you know, it's valuable for more than just embedding inversion. But you can use this to kind of glue together. models. That's what actually got me super excited. And by the way, I think there's a few related threads. I think we did an episode with Nicholas Carlini where he had

Starting point is 00:35:07 an extraction attack on one of the GPT models and they got it fixed. The other thing I want to, I just realized to spell out for people just in case they're not thinking it through. Being able to invert embeddings also means that you can back out secret prompts

Starting point is 00:35:23 or context that might lead customer information. That's potentially harmful and obviously an attack vector issue. I think one of the things I had a question about was whether or not position and bidding does affect it. And extension of position and bedding is affected because obviously that like context are going to get longer and longer. Your ability to invert will obviously decrease with longer context. What now? Maybe not that important. No, no, no, no. You're totally right. So we're operating in this space in our work where the sequences are relatively short and the

Starting point is 00:35:58 embeddings are relatively large. Like, I think we're kind of at a great advantage from that perspective. And you're definitely right. Like, if you embed an entire book to a 500-dimensional vector, there's just no way you could get the entire book back. Like, there must be this, these kind of collisions. Like, you know, in information theory, like, if you have lossy compression, two different inputs map to the same code, which means that you can never determine which

Starting point is 00:36:27 input formed the code. And I think that's probably what will start to happen. Like if you have two books and you swap just one word and you embed them, I don't know, someone can try this, you'll probably get like a perfect collision. And in that case, inversion is impossible. And even like when you don't take it to the limit, it probably just gets very, very hard. Like things get super compressed. So I don't know how well this works scales. Like it's a great question, like exactly how much information you can sort of cram into one of these vectors and I don't have a sense of where the boundary is. It'd be interesting to talk to one of the linear algebra people from the math department on like how literally can we take inversion?

Starting point is 00:37:06 Like, you know, how, like, what measures of a matrix do they have where we can like kind of run that and like try to get some meaningful information out of that? This is like where information theory starts to collide with linear algebra and all the other stuff. Totally, yeah. There's always this detail where we're, We're running these on computers, and so we don't actually have, like, real decimal numbers or real numbers. We have, like, floating point representations of numbers, which are, like, very, it kind of, like, throws a wrench into the mix. Do you have any consideration of, like, superposition when, like, sort of nonlinearity?

Starting point is 00:37:42 Like, you keep, like, stuff information in the lower bits, but I don't know if that matters. I really don't. It's just, like, a nice thing to think about. Yeah, yeah, it is a great question. And I get a sense that a lot of the less important bits are more useful for computation. And maybe the higher order bits are more important for storing data or something like that. But I'm not sure. These are the kinds of questions I'm actually hoping to explore over the next few years. Like I'll skip ahead for a second. So we have this result that's like maybe the third sort of like discovery I was alluding to,

Starting point is 00:38:16 which is like a way to measure the exact capacity of a language model. and we get this number, if you train a language model on a ton of random data and you measure its rate of memorization, yeah, can you open the right curve? This is sort of the discovery I'm talking about, no matter how you scale the training size,

Starting point is 00:38:33 you hit this perfect-perfect-ish plateau in auto-memorization, which we call the model capacity. And the question I've been stuck on in the back of my mind for a while is like, how is that actually implemented? So, like, this is a transformer that is trained for many, many data points,

Starting point is 00:38:51 and many, many training steps. And so, like, it's almost, like, if you have, okay, the 10 to the 6 point on the X axis, the capacity, we don't have to actually say the numbers, but it's basically perfectly dividing its computation between all of the data points. Like, every one of the 10 to the 6 data points gets, like, a tiny sliver of the model parameters

Starting point is 00:39:15 because they're completely independent random strings. So I don't really know if superposition is occurring here. Like, it seems possible to me that the model would learn, like, these completely independent columns of computation, one per data point. But it's also possible it's learning some kind of, like, combined thing where it's, maybe it learns like a load and a store, and it's, like, loading and storing bits using these generic operations. And then in the end, it reconstructs the random string. So even though, like, the data is completely independent, the kind of, like, compute is is very similar in terms of like predicting random strings.

Starting point is 00:39:53 But yeah, I guess this is all to say like about superposition and everything. I have no idea how the mechanisms are actually implemented inside the models. And that's like one thing I'm hoping to learn about in the next couple years. It's a reasonable question whether it's meaningful to learn. I think there's a lot of things that is like nice to know, but maybe not that useful. Latent space alignment is very, very useful. data set efficiency in theory cool but like practically people are just going to go for the biggest data set they can

Starting point is 00:40:25 like the scaling laws are kind of worked out insofar as like the relationship of computer data amount of memorization I don't know I think maybe this is a good point to maybe also bring in the idea that Andre has been pushing for the last like a year and a bit of the cognitive core like what is the dumbest possible model that knows nothing thing, but it's smart enough for tool use to do everything else, right? So you can write it on

Starting point is 00:40:52 device and fast inference, it's open source, whatever. So Gemma 3N is like a really good candidate right now because it's like a 4B model that is like claimed to be better than Lama 4 and GPC 4.1 according to, you know, certain arenas that shall not be named. This is where things get complicated. It feels like language models kind of implement things and know things almost in the same way, and it's like really difficult to disentangle, like, whether they're memorizing facts from whether they're like learning useful ways to generalize about new stuff. But I agree this would be really nice. I don't think we have a lot of evidence that we can build a system like this that like is really, really good at reasoning, but really dumb about the world. Like, I don't know if we

Starting point is 00:41:40 have the tools. Yeah, maybe, maybe not. I think the existence proof is humans, right? People always lean on humans as like the existence proof. It's not a great existence proof because I think if you talk to people about the number of neurons that we have and you make a neuron roughly equivalent to a parameter, we have something like 100 trillion in our brains. So like, and like we consume like 20 watts of energy. Like that's nothing. Like we're so much better than language models. It's not even funny. And then the last feature of us is that we're self-prooting, which is not something that language models do as well. Oh, like we were forget stuff? No, like we are not deeply, densely connected. Like we, like, connections will drop, therefore we're more efficient, you know?

Starting point is 00:42:21 See, I'm like a language model where everything is always connected all the time. Yeah, or you like, you preset the skip layers or whatever, and that's it. You know, like, it's not, it's not really actually anything we've evolved with learning. It's just like something you do based on the placions and like, guesstimates. Even if we did want that, I'm not sure if we have, like, the right frameworks or methods for actually building, like, what you're talking about yet. I think the world is much closer to where you're at, the where Andre is at. Andre is, like, kind of wishing for an optimistic world.

Starting point is 00:42:53 Our conversation with Noam Brown was, like, yeah, reasoning is emergent. If you gave the 01 harness on top of GPT2, you would get nothing because GPT2 didn't know enough. you need a GPT3 and GPT4 in order to then get O1, like as GPT4 as the base model, which is like, yeah, that's, I mean, that's reasonable. The way I put it is like in order to use tools, you need to, like, in order to search Google, you need to know at least search terms in order to like then search Google and then learn what you need. And if you don't know what to search, then like you might just be too dumb. I like the kind of ethos, like maybe you could do some kind of free training or whenever the model

Starting point is 00:43:34 doesn't know something, it can just Google for it, and that way you try to encourage it to learn words without, or like to guess words correctly without actually storing the information into its weights. Yeah. It seems like a nice, like, goal at least. Yeah, you need some kind of online learning, probably, or memory, and some combination of that. Yeah. It's exciting, you know, like, I think, like, if that is the direction of where this all ends up, that's great. But, like, people are not doing that. Instead, we're building, you know, $500 billion data centers in the middle of Texas.

Starting point is 00:44:11 And, like, you know, all hail the God cluster that just will, you know, eventually wrap around the sun and consume solar energy because that's what we need. Do we finish out the universal geometry thing? Let me finish the kind of methodological description. So we had this goal. So, yeah, back to the embedding universality.

Starting point is 00:44:32 We started with going from, embeddings to text. We know about this platonic representation hypothesis. And maybe I'll skip over the details, but basically we had total inspiration from computer vision in this model from 2017 called CycleGAN, which is, among other things, it's a way to map between two different distributions without any underlying notion of which thing should be mapped where. It's just based on some kind of idea of closeness. So like the cool thing about this, if you look at the top left, so I guess the top left is Monet, so impressionist paintings, and this picture on the right is a photograph. So like, it's learning this kind of like semantic notion of what content goes where just by

Starting point is 00:45:21 mapping a distribution of Monet pictures to a distribution of photographs without actually telling it which Monet picture should map to which photograph. It's kind of a subtle point I'm making. It takes a little bit of time to wrap your head around, or maybe like go to the middle one, if you don't mind, the zebras and the horses. So like, it's clearly learning like what an animal is and what legs are and sort of like more abstract stuff like what the camera position should be and what grass is and stuff like that. And it's learning like what a horse that looks like is a zebra is, which is actually like a complicated semantic concept. Like we don't have a data set that has a horse and then that horse.

Starting point is 00:46:02 as a zebra. We just have separate horses and separate zebras, but somehow this this GAN system is able to like elicit this sort of mapping property. It's like kind of a magical connection that it learns. And I'm still like in awe that it's possible at all. But we more or less like repurpose this system and like we've built our own but like this idea we took it and we applied it to model embeddings where instead of zebras and horses we have like Burt embeddings and GPT embeddings or like two completely different models with different architectures. So I think these are GTR, which is a T5 based retrieval model and GTE, which is based on BERT. So they have different training data, different architectures, different downstream objectives, different embeddings. But yet when we do

Starting point is 00:46:50 this cycle GAN in the embedding space, they just perfectly sort of snap to the same place, which is amazing and has some pretty deep implications of like the platonic stuff like maybe the models actually are learning a lot of the same functions or something and in some semantic way they're like very close and yeah this is a diagram of how our system looks it's weird to me how profound it seems you seem like deeply impressed by it and then the other thing is like when we talked to the manual from Anthropic who did the circuit tracing and mechanistic interpret mechanistic interpret ability work. They were like excited that like the same thing in different languages maps to the same circuits. And I'm like, what you would expect? I don't know like why like I don't know like why like I don't

Starting point is 00:47:40 know. I think I feel like this this feels more profound to you than it does to me. I'm like yeah, obviously. No, that's that's so fair. Maybe it's just like self-congratulatory. I'm more happy that were like the people that got it to work. Yeah, exactly. Yeah, it does seem obvious in retrospect. And I think That's like constant feedback I've gotten from research from, you know, people tell you that this seems obvious to them. But you have to realize that like you came from a perspective of no one ever having done this before and they're coming from a perspective of you telling them it's true. And like if someone had told you that this was true, it would be like maybe obvious to you too, if that makes sense.

Starting point is 00:48:18 The way I put it is that we have the intuition, but not the proof. You have to, you did the work and you have at least some evidence that it's true, whereas we just have intuitions, right? So part of research is just confirming intuitions. The applied part comes from like, okay, now that you know this for a fact, what do you do with it? Yeah, right.

Starting point is 00:48:38 I think the details can be really interesting, like the details of the proof, like which models are most similar to one another, and to what degree can you get them to align, and on which distributions does this property actually emerge? And, like, that's why reading papers can be fun sometimes is because they kind of answer all those little question. Yeah, I would say, okay, I've pulled up something very current,

Starting point is 00:48:59 which is Gemma 3N, which launched, which sort of was generally available yesterday. I would say for me, and you can correct me if I'm wrong, the most immediate implication is mapping adapters to language models. So the dream is that you have a language model backbone. Let's say this one is like a 2B language model backbone. And then you offload your vision. So you only load in the vision.

Starting point is 00:49:22 and parameters or the vision adapter when you need vision. You only load in audio, you only load text to speech, whenever you need it. Because these are all separately trained. You're just sort of aligning latent spaces and you can sort of train them separately. And I think this helps to make us more confident in, one, it's more efficient. That's a given. Two, it helps it makes us confident that we can just add capabilities without taking away or catastrophically for getting others.

Starting point is 00:49:52 So they're just sort of like stacking more parameters. Just stackable. So that's very cool. Swappable, stackable. It's like a fatter version of Laura's that is not really that model specific. I would say Apple and Google are pursuing this for their on-device stuff is where it's my sense. Is this open source? Gemma, yeah.

Starting point is 00:50:12 For a given definition of open source, which is like we released the way it's a hugging face. Here you go. Oh, that sounds like open source to me. Oh, yeah. I guess it's open weights, but not the day. Not the data, not the code. Not the code, yeah, right. Yeah, I would say that this is quite soda in terms of efficient models.

Starting point is 00:50:29 Maybe a small L-LM also from Hanging Face would be also in that kind of degree. It's not that many people working on very good, very efficient models. Yeah, this is a very deeply related question and something that really interests me, which is like, what is the limit of like a 100 million parameter model? Like if you imagine, you know, 100 years from now when we have maybe, our computers are gelatinous blobs and we all communicate through telepathy, will we have 100 million parameter models that are at the level of today's O3 Pro or whatever? And if so, like, how would that even be the case, like based on scaling laws? Like, do we have special data?

Starting point is 00:51:12 Do we come up with like a brilliant new training scheme or some type of magical architecture? Like, I really don't know. Or maybe we really are at the plateau already. I don't know. It seems like when we are doing things like calling a small model, like a 27B model as small, like that's what Ms. Charles doing. You know, we've plateaued a little bit in terms of like what we can do to compress things. I have a fun theory that this is where we mix quantum computing with models. Like we have to change what a parameter means. We have to search through very high dimensional space and resolve them much quicker than we can with like conventional compute.

Starting point is 00:51:51 That would be my pie in the sky thing. I said 100 years, that's very reasonable to me. Throw quantum at it. Yeah, I probably have to get a second PhD to know what's going on there. I think that we should establish the definition of small model as being a model that a grad student can inference at reasonable time on a single GPU, which is probably like 7B. Maybe. I don't think 27 is small under any reasonable.

Starting point is 00:52:25 Is it, is it MOE? Mistral? I don't think so. I think I'll, I think their stuff is default dense. Don't call me on that. This is coming off of just a lot of pre-trained data that is potentially collided. Okay, there was one, there's two more papers that we wanted to cover and then we can sort of wrap it. You had an approximating, you had a language model training data. I think this is a little bit also newer. How does this rank in terms of your overall work? Yeah, let's return to the kind of information theory questions. So, yeah, maybe we'll skip over the contextual embeddings in the case of time,

Starting point is 00:53:02 but we'll group those papers. Great paper. Hopefully people start training with that technique. It's kind of a free lunch. Those questions are all about information and model activations. Like how much can we recover from this given vector? or like what data does this factor represent or what computation does this factor represent. And there's really two types of, like if you want to taxonomize, there's two types of,

Starting point is 00:53:29 whatever you call it, dense information storage mechanisms. One of them is activations or embeddings, which we were discussing already. And then the other is weights, which are the things that are used to perform the computation, but not the computation itself. and so we have now two papers in this direction of what is stored in the weights. The first one is about language model capacity, which is called how much can language models memorize or how much do language models memorize? I never remember which one we settled on. And then the other one is called approximating language model training data from weights.

Starting point is 00:54:06 The first one is like, I think has a lot of deep messages about how language models store information and how they work in general. The second thing is like a proof of, of concept of maybe like a longer term research project. Let's start with the capacity stuff, if that's good with you. Do I have the paper for that? I don't know. You know, we can return to the question you asked me, which is something like, why do we care or like what is this useful for? And I don't know if I have a good answer for this. I think this is somewhat profound. Like it's kind of like in physics, you know, when they try to measure these constants, like gravity. People tried to measure the rate of acceleration of gravity for a long time, or like those Greek guys, like, back in the BC era,

Starting point is 00:54:50 when they were trying to approximate the radius of the earth based on shadows, we're trying to take the GPT architecture, like the main one, and just measure how much information it can store. And we did this through the lens of memorization, which I think we can skip over for the podcast, and we'll just talk about, like, information, storage, and weights. Like, these curves, to me are pretty crazy. Again, maybe it's like the sort of discoverer's fat folly or something where I'm like, oh, this didn't exist before, so it seems so cool. But then you're saying like it seems somewhat obvious.

Starting point is 00:55:27 No, no, no, don't let me take that away from you. Yeah, no. Again, like I independently was asking how come there's not enough people exploring LLLLM's from information theory. And then like you come along and your embedding's works become like an information theory exploration and I'm like suddenly like I'm very aligned to like exploring this, promoting this and encouraging more people to figure it out because like that's ultimately how we figure out this whole compression issue and you know what Andre wants which is like the cognitive core right the most

Starting point is 00:55:55 efficient model for the most capability like that is an information theory question totally agree with that we could start here like so so transformers that are trained in 32-bit precision we approximate can store about 3.6 bits of information to maybe 3.9 bits somewhere in there per parameter. And like, why is this? I mean, for some perspective, this is quite bad. Like, if you have 32 bits available and you can only use 3 to 4 of them,

Starting point is 00:56:26 like you're just store 32, bro. Yeah, yeah. Then you'll like, you know, you could build your own AI lab if you can make these models that much more. efficient. I don't know how they're implementing this mechanism or where the kind of bottlenecks come from or even now that we know this, what it's necessarily useful for. I guess the tools that would be interesting to me are knowing, like, given a data set, if you could predetermine the exact model size and maybe architectural properties required to get a certain level of performance, that would be

Starting point is 00:57:02 really neat. And like, we don't even know how to do that. We don't even know what the difference is between doing low retraining, which trains less than 1% of the parameters, and full fine-tuning, which trains all the parameters. We don't even really understand the difference there. So I think this is maybe like a baby step sort of in that direction, but there's a lot of unknown ahead of us. Okay. Do you think this is a hard limit? Do you think someone can come up with a better algorithm by better architecture and then sort of just change the slope? There are two axes here. One is the ability of the model to store data. And I think we can definitely improve that. I think, like maybe even if we tested this with the Lama architecture, like there's sort of like a GPT plus plus

Starting point is 00:57:41 architecture. Like I would guess that can store better data just because the kind of numerical flow is a little bit better. The nonlinearities are maybe like a little bit more suitable to training. Like that will probably raise the bound a little bit. And then the second axis is that our measurement tools are just not that good. Like this is, you know, me, I'm a grad student. I'm running all these hyperparameter sweeps and sort of like where we draw conclusions from

Starting point is 00:58:03 them, but even that being said, there are probably ways to measure this better, but all that would do is push the number up. So it's possible, like, there is a way to store five bits per parameter if you have, like, a better optimization technique, or if you were a super genius and you could just perfectly set the weights to store the data, then maybe you can do better. And this is just sort of like what we can reach through optimization is this 3.6 bits per parameter. But I would be happy if someone came along with a much better measurement tool. Like, this is just sort of like the first measurement. I mean, I would guess in the future, like, you know, people will look back and say like,

Starting point is 00:58:44 this is like somewhat all in one direction or another for whatever reason. And that's just how science goes and I have no problem with it. What we do is we call this the Morris Constert 3.6, right? And then we said like a, like a challenge, like a little. leaderboard of like beat this, right? And like let people go. Yeah, that assumes we know the true constant ahead of time and we can measure the error rate. It's doable. You laid it out here. Yeah, yeah, yeah, yeah. That makes sense. One minor doubt I have is like the goal actually is a memorization. It's generalization, right? The best memorizer model may not be the best generalizer model.

Starting point is 00:59:26 and this incentivizing people to max this number might actually just be fruitless in terms of actual intelligence. Like you just get the best actual compressor. Like you're just going to get GZIP. That's totally true. And there's this pattern and research, you know, time after time. It's like someone poses a question and then people answer it over and over and over again. But it's often much more fruitful to just ask a new question. Maybe it just doesn't matter how much GPT models can store. should just like work on something else. We'll figure that out. Did you want to dwell on this site at all?

Starting point is 01:00:03 Yeah, let's just talk about it real quick. Definitely not the algorithm itself. By the way, what are your tools for doing these kinds of charts and these kinds of diagrams? I just kind of curious up behind the scenes on the tools. I think like visualization has definitely been a son hobby of mine during grad school. This one actually, Oscar, my co-author made this one. Maybe I gave some like prompting, but he made it. I think most the last few papers have all been in diagrams, Google diagrams.

Starting point is 01:00:33 I was using Figma for a while and Illustrator. I think Illustrator actually is the best tool. Oh, did you know the Transformers Diagram was in Adobe Illustrator? Oh, yeah, yeah, I didn't know that actually, yeah, because that's the only way you can get arrows that sort of like curved like that. And they have good shadows. diagrams is like the least robust but it's the most accessible and honestly if you if you're good you can make pretty good stuff. Excaladra is nice too if it's not going in a paper. Yeah it's just too

Starting point is 01:01:06 rough for a paper but you need something professional looking you know it helps like if you're going to publish your work you need to make it look nice and professional and like official right so this is what it is. Yeah yeah and I think there's something worthwhile about saying like okay if I'm going to put My name behind this, I want to spend time making all the references perfect and all the diagrams professional. All the captions are correct. And I think it's important to put that level of detail into that's a little chugie. But let's finish this off. So, okay, we're talking about bits, information theory, what information is sort of embeddings.

Starting point is 01:01:41 We were talking about language model capacity. I think a much more practical question is maybe this is more analogous to the vector database hacking, embedding. threat model we discussed is like if you have access to a set of model weights, what can you learn about the data? So like you were just mentioning, Gemma 3B came out yesterday, and you can download it, and it takes up a certain amount of space on disk,

Starting point is 01:02:07 and it was trained on some data, but we have no insight into what the data was. I mean, it's probably English. There's probably some distribution of web text. I guess there's a lot of code. And we seem to have a lot of information about the model, right? You have this file, and there's like many ones in zero, which means something,

Starting point is 01:02:23 but it's kind of like a very highly compressed version of the training data. But I would be extremely surprised if they do any type of private training. Like there are these mechanisms for doing like differentially private language model training or even just anonymization in the pre-training pipeline. I bet they don't do any of that. They just sort of like train on the data and then they kind of know that we don't have the right tools to decrypt to the model weights. And so that's like my dream is we can come up with some way of translating model weights back into text data sets.

Starting point is 01:02:57 And so in the most recent kind of drop, paper drop is that paper approximating language model training data from weights. And it turns out to be a really hard problem. Like trying to go from model weights to text is really hard. And we do something a lot simpler, which is like, well, there's two ways we make it simpler. The first thing is we assume access to two checkpoints, which I think is probably not the case in Gemma. But in the case of Deepseek, if you download the 400 billion parameter model weights,

Starting point is 01:03:31 it's this giant file, and you can actually get two of them. You can get the base model weights and the fine-tune model weights. So the way we put this, you have this kind of difference in parameter space, telling you what Deep Seek fine-tuned on. and it's very controversial. I mean, they're sort of like geopolitical, definitely at the corporation level, they're really interested in the implications of like what did Deep Seek train on.

Starting point is 01:03:57 And they've released this kind of treasure trove of information of what they trained on, which is the actual model weights, but we have no tool for like interpreting or kind of decrypting this weight difference. And so we started with something really simple, which is instead of even just trying to like regenerate the training data, we take. just a web corpus and try to do selection of training data that kind of like looks like the true training data and gives us performance that's as close as possible to the true training data. So there's this complicated method, but it's something like you just sort of like look at the

Starting point is 01:04:33 data point gradient and see if it points in the direction in wait space of the fine tune and then you take like the top data set. There's some tricks to it, but it's basically just like gradient based selection based on this weight. difference. And it seems to be okay. Like, it can get us pretty good training data. So I guess if you actually wanted to use this, it would be like your competitor releases a base model and a fine tune, and you're trying to recreate their data set. So you can take this weight difference and take a giant web data set. Like if I was doing this at a company, I'd probably try to scale it up to trillions of tokens and then select the exact data points that try to produce the model. And it turns out

Starting point is 01:05:14 you can train a pretty good model with that. We don't get to quite the performance of the original model, but it does seem to be like trending in that direction. This is like very creative. I don't know what the use of it exactly is. Yeah, like when would you be in this exact situation? Decently often for the open model labs. Even DeepSeek R1 has released an update,

Starting point is 01:05:39 Mistral does it pretty frequently, Lama does it frequently. It's not impossible, but like I think it's like, I think that I really like the creativity in using quote-unquote synthetic checkpoints to do this, which is, I don't think I've heard less from any other place. So I don't know if you came out of the idea. It's like linear interpolation in wait space.

Starting point is 01:06:00 Okay. That's a bunch of the recent work. I wanted to sort of cap things off with the datasets question. Is that a good? Wouldn't you get asked me whatever you want? Well, it's not like a ask. It's just like I think this is a very good thesis. I think it's a hot take.

Starting point is 01:06:15 I almost invited you to speak based on just this alone, but it was a little bit late to talk. Oh, for the conference. Yes. When I look for conference keynotes, I look for something that has a broad overview that can I put the last few years in perspective or it's an insight that you can reasonably rely on to like last for a while

Starting point is 01:06:35 so you can get some mileage out of it. You know, I think a lot of ideas in AI come and go. But like things that are trend, these things are scaling laws, things that are trend lines, things that are like, there's no new ideas in AI, that I pay attention to. So maybe you want to recap, like, what's the backstory if there was one? Yeah, yeah, sure.

Starting point is 01:06:53 So the meta backstory is I've sort of started writing on Substack, and this is a post that I wrote a few months ago. The highest art form of humanity. Yeah, yeah. Yeah, publishing papers wasn't doing it for me anymore, and I moved to Sub-Sack, and this is the name of the post. There are no new ideas in AI, only new data sets.

Starting point is 01:07:16 One guy pledged me. But then I found out he was like my former student from a class I was teaching. So I don't think it really counts. It counts. He's a friend. He's your first supporter. A pledge is a pledge, man. I'll take whatever I can get.

Starting point is 01:07:30 So the underlying thesis is that whenever, maybe I'll lay out this framework first. So there's this book called The Structure of Scientific Revolutions by Thomas. Coon that I read near the beginning of my PhD, which suggests that science kind of moves in these cycles where not very often there's something he calls a paradigm shift, which is like a you could think of it as a zero to one innovation where everything changes. And then it's followed by a rapid period of small innovations, a lot of like reapplication of previous techniques, pre-paradigm shift techniques to the new era. And then things sort of slow down as we wait for. a new paradigm shift. And I was kind of asking myself what's unique to the paradigm shifts that

Starting point is 01:08:17 we've seen in AI. And by the way, to me, AI and language models are somewhat synonymous at this point, like at least for the foreseeable future. I'm certain that will change. But basically everything that's pushed the boundary to whatever we have now that resembles like intelligence has come from language models. And so those breakthroughs came in a few steps. So I think the idea is also like a meta commentary

Starting point is 01:08:47 on the research community because what everyone wants as a researcher is some kind of like cute new method that no one has thought of before that just works on the existing data better than the previous methods. That's like for whatever reason like the kind of most glamorous

Starting point is 01:09:04 thing people think you can do as a researcher like Mamba. It's like it's like it's like a transformer, but it's like more efficient and works better. So that's what a good idea looks like. And I think everyone wants to like find something like that. But if you look at what's actually born out in practice, it's never been like that. I think like all of the things that I would consider paradigm shifts in the Coonian sense came from a new technique, but trained on new data. And I think the new data is super, super important. So I wrote it as a series of four paradigm shifts. The first is the emergence of

Starting point is 01:09:37 deep neural networks with AlexNet, which I think it was like 2010 to 2012 era, where we just started training on ImageNet, which is like a scale no one had ever seen before of millions of images. And then the second thing was Transformers and Burt and this attention is all you need paper, 2017, the first GPT is 2018, which is web scale pre-training. Like no one had ever done that before. No one had ever tried to scrape all the text off the internet and then tokenize it and feed it into models. Like, it's a crazy idea. And I think, like, we should be honest. I mean, transformers are incredible and, like, their staying power is never going to cease to amaze me. They're, like, much more optimal than I think anyone ever knew. And I don't know if we'll ever beat them. But the real innovation is

Starting point is 01:10:26 web scale pre-training. And I think, like, we honestly probably could have gotten this with R&Ns. know like the scaling laws paper shows that R&Ns have worse curves for scaling, but probably people would have been, like, I bet you could have built chat GPT with a very sophisticated R&N. Like you didn't even need transformers. What you need is web scale pre-training and the third innovation, which is instruction tuning. And we thought it came with like reinforcement learning, but I think the big innovation of instruction tuning is actually the human preference data, which is like gathering positive and negative pairs of what, looks good, like, in terms of a chatbot interface. And actually, it turns out you can do

Starting point is 01:11:06 supervised learning on that, too. You can do DPO, which is a form of supervised learning. You don't even need the instruct GPT techniques. You just need the data. So, like, I'm sort of playing devil's advocate here, but I actually think this is true that, like, if we had the right data sets, we almost could have scaled, like, 2015 era techniques and gotten something that looks like, at least instruct GPT. Reasoning models are a little different. Like, they're, I'm not sure if we could have that with RNNs or not. I don't know if I'm in a position to comment on that with certainty, but they do fall into this framework,

Starting point is 01:11:40 which is they really did emerge from a new data source. In this case, it's something like a little different. It's like verification with symbolic systems like math, calculators, coding environments, unit tests, like things where we can provide numerical feedback to language model outputs. But we built a way to learn that and leverage it to get more intelligent systems.

Starting point is 01:12:03 And so whatever the fifth thing is, whether it's video or embodied AI or some kind of crazy innovation on reasoning models, whatever comes next will probably be some type of new data source that we're not using yet. That's a really good thesis. I would say that the researchers I talked to would somewhat disagree. Obviously, this is like a hot take type of thing. And like you already acknowledge that RNNs don't scale to the same extent. Like they operate on the slope of the curve, whereas, you know, I guess like the amount of data or the type of data or the core insight just changes the order of magnitude of the x-axis, right, that we are mostly working on.

Starting point is 01:12:44 But like, both are important. The way that I think someone put it to me was an improvement on compute or data efficiency is the equivalent of having a whole bunch more data that we, you know, that otherwise would be a lot more expensive to collect. it's likely that the frontier models right now are just a collection of hundreds of these small little experiments that just stack up. You mentioned Muon

Starting point is 01:13:08 in your post, which seems to be the atom killer. Curiously enough, like still none of the big models use Muon, but like, vibes are good. Yeah, yeah.

Starting point is 01:13:20 And the value of building better optimizers is really incredible. Like, it's just a free launch. You can just sort of like plug in a slightly better training. mechanism and then you save like a ton of compute and a ton of training time, that's like hugely

Starting point is 01:13:33 valuable. I think this is cool because I think like it puts us in a mode of like if you were ever to ask what comes after reasoning, it has to be something on the order of this. And most ideas are not. Most ideas are not. And so this is cool in a sense of like it just joltz you out of incremental thinking into like what really is missing for the next paradigm. And I don't have an answer. Do you have one? Do you have candidates? Oh, I really haven't even considered that too much. I guess like scaling, reasoning. You got to do the auto-complete. For step five. I mean, you got us all the way there and you're like, you know, you got to show us the way now. We can say it's an exercise left to the reader. But I mean, the reality is like predicting the future is too damn hard, you know,

Starting point is 01:14:18 like maybe it'll be obvious to me in hindsight in five years, but sitting here today, I really can't derived from first principles, what the next wave of innovation will come from. Yeah, I think we have a few years left. Like, each of these phases lasted for a few years. Reasoning just started last year, kind of. We got some juice on this one. Cool. I think that is a broad overview.

Starting point is 01:14:43 We've went way over time, but, like, I really enjoyed this. I guess my parting question for you is kind of a meta one. So I'm not an academic. I'm, like, kind of self-taught. I just read a bunch of papers, and, like, I talked to people all day. as part of the podcast. How do I rate in terms of like my questions as though like, could I pass as a grad student? Like, or like, what's my distribution?

Starting point is 01:15:06 Like maybe I was maybe more industry oriented than academics. I think you got to realize that like the only person that's an expert in your area as a grad student is you. And even like eventually your advisor defers to you for a small set of questions that fall within your very niche expertise. So, like, I think you're clearly, like, a very good generalist and have, like, a huge amount of background on these topics. And to the point where I would say you're passing the grad student Turing test.

Starting point is 01:15:37 And I think if you went to a talk, like, people would just assume you have some weird research area of your own that they don't understand, you know? My research area is AI engineering. Like, I'm trying to, like, making it up as I go. But, no, this is super helpful. Okay, well, that's about all we prepared. All the best in your search, all the best in your PhD. I assume, like, apparently the current PhD meta is you do a bunch of small papers,

Starting point is 01:16:02 you staple them together and, like, find an overall theme. You do a defense, and that's it. Like, that's the journey. Which is kind of cool. Like, I would love to do that. I'm too old to do it, but, like, it's cool. Yeah, yeah. It's a great thing to do at any age.

Starting point is 01:16:16 Well, it's better to do a substack, right? And then you have, like, people, like, subscribing and pledging along the way and like getting validation and like, yeah, that's, that's better than a PhD. Substack. That's the title of the episode, like, Substack, better than PhD. But no. Thanks for your time.

Starting point is 01:16:37 This is really great. Where can people find you? What are you looking for, really? I'm online. You know, you can follow my substack and Twitter. I tweet pretty consistently. And you're putting papers out, I guess, like, the most meaningful thing. to be honest, is to engage with the research and send me an email.

Starting point is 01:16:55 If you really care, that's amazing. And, like, I love having those kinds of discussions. And you mean, like, what I'm looking for an end job or out of life? Your research direction, like, what interests you over anything else that's, like, if there's someone out there looking who has a problem and is looking for someone to help them on it, like, you are the guy for underscore? Oh, yeah. Hopefully, if you listen this long, like, I think, like, my,

Starting point is 01:17:21 research is a lot more well connected than some people's PhD research and that it all falls into like a very small manifold of like all possible problems. And so if you if you want to work on anything within that space or that's sort of like adjacent to the problems that we discussed in terms of like language model, maybe not even language level, but model weight and activation information. I think anything that can be described as that is very interesting to me and I would love to talk.

Starting point is 01:17:54 Awesome. Well, we'll put your contact info in the show notes and thanks for your time. Thank you.

Latent Space: The AI Engineer Podcast - Information Theory for Language Models: Jack Morris

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.