Latent Space: The AI Engineer Podcast - Llama 2: The New Open LLM SOTA (ft. Nathan Lambert, Matt Bornstein, Anton Troynikov, Russell Kaplan, Whole Mars Catalog et al.)

Starting point is 00:00:00 Hello, hello, this is Swix, and I hope you like that new intro music that we have for the emergency pod. For those of you who are new, we do emergency pods whenever there are big enough breaking news in the AI landscape because we try to be the first place that all you AI engineers hear about news that might affect your day-to-day work. So a couple months ago in our No-MOTS emergency pod that we did around the Google No-Mode demo, We actually talked a little bit about the rumors that Zuck would be considering releasing commercially a version of Lama. And four days ago, it was rumored and leaked in the press. And today at 9 a.m., they released it. You're probably listening to this on the Wednesday, so a day later.

Starting point is 00:00:48 So a usual MO about this is that we try to gather some guests, and then we try to talk through day one reactions from AI engineers. and today was a little bit special because we got some really special guests. We had Nathan Lambert from Hugging Face. He works at Hugging Face as a machine learning researcher, and HuggingFace were launch partners of Meta's Lama 2, which meant that he had early access, and so Nathan actually just dropped his in-depth paper review in summary,

Starting point is 00:01:14 and he spent the most time with it, so we figured we should ask him the most number of questions because the rest of us were just reacting live to it. We also, which is a first for us, worked with Matt Bornstein, of A16Z, which were also surprisingly launched partners of Lama 2. They put up the first templates and the first playgrounds, Lama2.aI, which is super helpful for people trying this out for the first time. I also want to give a special shout out to my friend Rajko Radurovik. He couldn't join us today, but he spent a lot of time prepping the examples and talking with me through how to test

Starting point is 00:01:48 Lama 2. To compare its quality to GPC 3.5. We also had Anton, the CTO of Chroma, joined to talk about the impact of Lama and open source retrieval augmented generation. And then finally, Russell Kaplan from Scale AI on how to fine-tune Lama 2. So a very guest-packed episode. As always, it's a little bit awkward trying to be a moderator and participants in a Twitter space because you're always trying to see who goes first. But we've done our best to clean up the audio and make it an enjoyable listening experience or reading experience if you want to read on Substack.

Starting point is 00:02:20 So let us know what you think. We tried to cover all the major issues and predictions with NAMA2 and enjoy it. There's not a single adult day in this space. I think when we started the podcast in January, a lot of people asked us, how long can you really do this, just focusing on AI research and models? And I think the answer is clear now a long time. So excited for this and excited to have Simon again. You're basically an honorary guest host of all of our Twitter spaces.

Starting point is 00:02:49 Cool. Thank you. No, it's great to be here again. And Nathan, thanks for joining us. Actually share your write-up on Lama 2 technical details with Sean this morning. So it's great to have you here to dive into some of the details. Yeah, it sounds good. It's probably clear.

Starting point is 00:03:04 Hugging Face was trying to collaborate on releasing the model on the platform. So we ended up getting some early details, which made it a lot easier for me to cram study before the chaos hit. Oh, that's great. It's kind of what happened with the Code Interpreter episode, when Sean and I had access for about five hours and Simon was like, I've been playing with this for weeks and had all the inside scoops. So I think this will be a good episode. Maybe Nathan, you just want to give people a little bit of background on what you do at Hugging Face and yeah, your experience with the dilemma to kind of preview. Yeah, so I've been a researcher and helping lead reinforcement learning from human feedback efforts at HuggingFace, which really means I do some research and I try to figure out. how to fine-tune models to do what people want. Generally, we're trying to operate in the scale a little bit smaller than what meta is doing

Starting point is 00:04:00 because we obviously don't have that kind of resources at a startup. So I do a lot of technical research and also try to actually engage and communicate that with the community. And specifically at Talama, I think I was most interested on kind of the research side. I think the paper is a phenomenal artifact, and it's clear that the model, is really strong in a lot of areas and then kind of the big picture trends of where open source is going. Like this is a clear step in a direction that a lot of people wanted but weren't sure if it was going to happen. Yeah. What are some of the things that stood out to you? I think to a lot of the engineers audience that we have, they're not as deep into the details of the papers. We'll love to get a reaffirm somebody like he was a much deeper at a, you know, model research level.

Starting point is 00:04:47 Yeah, it's like, where do I start? So I think as a general summary, The paper includes a lot of details on methodology. So what are the things that they did in their stack to build to actually run this? And it misses a lot of details on what does the specific dataset actually look like. It's clear that they have a really fine-tuned dataset and they paid a lot of money for these data sets. It seems like now that both surge and scale are claiming some part in it, which I find hilarious. because it's really unclear, which are two of the probably biggest data labeling firms. So they kind of took the approach, METO took the approach of starting with open source preference data

Starting point is 00:05:30 and then added a lot onto it. And the most interesting part to me on this preference data, which is a new technical approach, is they trained two preference models, two reward models, one for making the model helpful and one for making the model safe. and then in terms of open source models, it's clearly more performant on kind of ground root benchmarks, and then it's safer. That's where I was going to wrap up.

Starting point is 00:05:57 To clarify, right, this is a big difference from the first Lama paper because the first Lama paper was very, it was so detailed in terms of how the training data worked that people were able to essentially replicate it. And so you're saying that this new paper, there's much less transparency as to how the training worked? On the death side, yeah. I think they did a lot of new methodological things.

Starting point is 00:06:20 So taking the time to explain that is not as much of a data-focused paper. There's no table that is like this is what the distribution of pre-training data came from. I would guess that it's a similar data set to the original llama with the kind of, they mentioned like one of the details that's really interesting is that they mentioned they upweight high factuality content. So things that probably seem like Wikipedia, it seems like they're doing some sort of upranking during base model training, but they don't, they did some type of thing they didn't detail. Because it's also worth mentioning.

Starting point is 00:06:54 I mean, they're being sued right now by Sarah Silverman of all people. I mean, it's one of the many lawsuits flying around, but there's lawsuits specifically over the training data involved in the first llama, because one of the things that went into that was this day set called Books 3. And Books 3 is like 190,000 pirated e-books, like the full text of all of the Harry Botan novel. things like that, which, yeah, that's very difficult to say that that's not extremely copyrighted data. So I wonder if that's part of the reason they've been less transbound this time round, is that, you know, got them in trouble last time. Now, one of my colleagues on kind of the ethics and society side immediately pointed out

Starting point is 00:07:31 that publicly available data is the phrase often used in the paper, but that does not mean that it's free from copyright issues and or terms of service issues. It means that I could go on a computer and download it. Right. If you scrape the entire internet, very little of that stuff is actually like public demand. Yeah. And I think without going down kind of social issues, rabbit hole right now, I think the notion of public is extremely being strained by AI and changing communication practices. And it's just like kind of those things where it's like, oh, okay, here we go. And they also use words like democratize and they have these sentences in the paper that are extremely valued in, which is like the carbon filings. footprint of our model and releasing this is good because it'll mean a lot of people don't have to train models and burn more CO2 in the future. And it's like, okay, meta, like, what,

Starting point is 00:08:22 are you going with this? Yeah. Perhaps before we go too deep into the issues, because we have lots to talk about, I would also want to get a high level overview from Simon and from Matt, who's also just joined us from A16 and Z. So maybe Simon, you want to go first with like, just recap for everybody what you think the relevant details are about Lama 2. I mean, we'll talk about Mets and stuff. Yeah. So yeah, I mean, the headline here is that Lama 2 has been released and Mesa kept their promise of doing a version of Lama that is usable for commercial purposes, which is so big because so much of the, like, Lama itself came out at the end of February,

Starting point is 00:09:02 and so many models have been released on top of that. So models like Vecuna, which was a fine-tuned Lama, all of the, them with the same no, not usable for commercial purposes. So now we've got a really high quality foundation model that we are allowed to use commercially. I think the amount of innovation we're going to see over the next few weeks is just going to explode. You know, I feel like this is monumental on that front. In terms of quality, I never know how to interpret these benchmarks.

Starting point is 00:09:29 The benchmarks all look good. You know, the claims are it's a bit better than Lama. It's compared to with the GPChatchip 3.5, etc., etc. I have no reason to disbelieve that. But it always takes quite a while with these new models to get a feel for them. You have to spend time with them to really feel like, is it trustworthy, is a summariser, all of those kinds of things. My hunch is that it's going to turn out to be extremely good.

Starting point is 00:09:51 I doubt that it'll turn out to be a sort of a damp squib on that front. But yes, so they've released it. It's available commercially, and you are allowed to redistribute it. But the only way to officially get the weights is to fill in a form on their website and wait for them to approve you still. which is kind of stupid because obviously it's already started leaking. I downloaded a version onto my laptop this afternoon, which worked. There's a GGML and the bloke thing that's floating around on Hugging Face already.

Starting point is 00:10:21 So within 24 to 48 hours, I think every possible version of this thing will be available to download without going through a waiting list. I'm almost not sure why they even bother with that, especially since, you know, Lama leaked within a few days last time. and somebody ended up submitting a pull request to the GitHub Read Me with a link to the BitTorrent for the Lama models, which Facebook didn't delete. You know, they didn't sort of, they kind of like nodded and winked and said, yeah, this is what you can do. And now it's even legitimately okay to do it because the license says you can.

Starting point is 00:10:52 But anyway, it's out there. You can run it on your computer right now today. It's also hosted in a bunch of places. Yeah, Andreson Horowitz got, it sponsored the version of it that's available on Replicate. Although you actually do have to pay for that. I noticed that I've built up 26 cents in replicate charges already playing around with that model. But it's AP, so it's available via API or you can run it on your own machine. And, you know, it's open season.

Starting point is 00:11:17 Let's all start poking around with it and seeing what it can do. It's open season. Speaking of Andreessen, yes, Matt, hey. Hey, hey, everyone. Thank you for having me. And Simon, if you want to send me a Venmo request for 26 cents, I'll all happily reimburse you. Absolutely, yeah. We may lose about $3 on the transaction.

Starting point is 00:11:36 fee, but I think it'd be worth it. Just to throw in a term sheet in there for a dataset, you're good. No, I'm a huge dataset fan, and we've followed Simon's work for quite a while, and Nathan, it's great to have a chance to share a stage with you. I think folks probably saw we released a bunch of sort of, you know, VC version of evaluations. You know, we're way less smart than, you know, Nathan and Simon and a bunch of folks in the space here. But using just sort of the does it feel good approach and trying to get a fairly representative sample across different types of prompts, the model seems very good. We were playing a lot with 13B, and we're playing now as 70B, and it really does give you kind of very fast GPT 3.5 level responses to some questions.

Starting point is 00:12:23 I think Simon's point about benchmarks is very well taken. It's hard to know how to interpret those, so we sort of go for the direct version. And for creative tasks, you know, especially, it seems very good so far. So a lot of what we're doing is just trying to get it out there as much as possible and as fast as possible. You know, I think we should all be incredibly, you know, appreciative that meta is doing this. And it's not, you know, maybe quite perfect, you know, for some of the reasons that folks are talking about. But, you know, I think it's going to be a huge unlock in open source LLMs. And we're trying to, you know, just sort of support the community as much as possible.

Starting point is 00:12:58 Yeah, I have to say you guys are doing a bang up job recently. What is, is there, this is a big team effort, right? Like, I, I see that there's a number of names from your team, just essentially building projects and then collaborating on this demo. Like, maybe could you describe, like, what is in Driesen's sort of involvement so far? And like, yeah, what is the scope of this? Yeah, you know, we all applied for, you know, L3 engineer jobs and got turned down by all the, all the big tech firms.

Starting point is 00:13:27 So we thought, hey, you know, we'll just do it ourselves. Look, I think, and this might be a little controversial, your average venture capitalist doesn't do any real work. And I completely include myself in this category. You know, allocating resources to support teams is important. It's an important function in the economy, but it's what you might call indirect work, which is you're supporting someone else doing something. You know, we just sort of made the decision when we really saw AI starting to take off that we should start doing real work. too. And it's really just about supporting the ecosystem, especially around open source. Like Simon, we're massive believers that the innovation you see in open source is really going to be a big unlock

Starting point is 00:14:11 for AI-based applications. Right. Not everybody can just use the OpenAI API API as good as it is. And not everybody can train a model from scratch, right? Not everybody, you know, is Noam Shazir or someone like that. So we think it's a really huge unlock. And again, we're just trying to support as much as possible. So today we, you know, we released a playground to play around with WMAT2. We got it up on replicate so people can just sort of try it with an API call and try integrating into their apps. We released an AI starter kit over the last couple of weeks, which people are actually using. We were shocked. We're a little nervous because our code, you know, may or may not be production ready, but you'll see more and more of this from us

Starting point is 00:14:50 over time. Yeah, I've seen your companion chatbot, and I have to say it's actually pretty impressive. It's got all the latest features, especially in terms of streaming and lag chain and all the other stuff. So kudos to your team on that. Just to round out the overviews or the high level takes before we go into individual details, Alessio has been compiling the show notes, which we were going to publish when this podcast goes live on Layton Space. So maybe you want to go over some of the notes that you've been taking, then I'll go over to Alex. Yeah, we got a lot of stuff to run through here. I think like the most interesting things that I read from the paper one, there's an abandoned size model. So the 7 billion, 13 billion and 70 billion made it to release.

Starting point is 00:15:33 But there's a 34 billion size that didn't make it. And in the safety chart, you can actually see it's like twice as unsafe, quote unquote, and they decided not to publish it because of lack of time to read team it. So I don't know if anybody had a chance to try the 34B before the release, but I would love to learn more about that. Outside of that, yeah, as Simon and Nathan were talking about the data piece is a lot more obscure. So Lama 1 was 67% Common C4, a bunch of GitHub, Wikipedia, books, as we mentioned. We don't have any information about Lama 2, but they did mention they have a 40% larger pre-training corpus.

Starting point is 00:16:14 So they've obviously been investing a lot in that. Also, yeah, the supervised fine tuning was very interesting. I saw a tweet somebody as the Lama Doe How to Kill a Process. And Lamadu was like, you can't kill things. And I was like, just a process. It's not a person. So I think in some places, it might have gone too far with the RLHF. But that's another interesting side, right?

Starting point is 00:16:38 Like if this is the starting point and like the de facto stand there for open source models, are we okay with, you know, not being able to ask how to kill a Linux process? But I'm not sure about that yet. I ran into that myself. I asked it to give me all of the animal emoji and it said that that would be disrespectful if it attempted to do that, which was kind of interesting.

Starting point is 00:17:01 Exactly. So that's an open question on open, you know, it's the age old safety question. It's like how much do we need to do before we release this models to the public versus what should the public decide? The other thing is like they should have let these GPUs burn for more.

Starting point is 00:17:16 Like if you look at the at the last crafts, like these models are not saturated. I guess like they spent a lot of money to try and train these, but it seems like there's a lot of work left to do there. We just did a Datasats 101 episode that we released yesterday, which is already old news because now Lama 2 is out and this is all the rage. But we talked about some of the scathing loss and we thought the 200x was like the new Lama ratio.

Starting point is 00:17:41 But I think this one is 275x, Sean? I think you do that. Yeah, so that's $2 trillion tokens for 7.000. B model, and that's up from 1.2 last time. So they've definitely ramped up the amount of data, and they just refuse to tell us any of it. Because, you know, guess what happened last time? You know, they published the data info.

Starting point is 00:18:02 Red Pajama went and cloned, you know, line for line exactly what was in the Lama paper. So, you know, and that created, you know, Red Pajama model and then Open Lama as well. So it says that the context length is up from the first Lama. Do we know what the new context length is? I think it's 4K. 4K. Is that likely to be higher for the 70B model, or are they all the same context length?

Starting point is 00:18:27 I believe they're all the same, and we have tested it a little bit. And my intuition is that you can actually get more effective performance, more accuracy out of 4K, rather than scaling up the way, say, Open AI have to 32K or higher. I think it's just hard to find high-quality training data. So when users actually start to submit longer, inputs, performance kind of breaks down. And I'm not talking about Open AI specifically, but in general.

Starting point is 00:18:54 And that's my intuition on why meta is keeping it relatively small for these models. I'm kind of hoping that somebody, now that it's open source, somebody finds some clever trick to increase that. I've been playing with the Claude 100,000 a lot recently. And it's pretty phenomenal what you can do once you've got that extra context length. There is actually a trick. It's called Rope. We've seen this with a two-line change that you can you can make Lama forget about the context it was trained on, and there was back and forth about how effective this is and whether or not it suffers from the same dip, you know, in the middle of the context.

Starting point is 00:19:28 But this rope scaling trick then was verified by folks from, I think, Microsoft, independently from that guy, Kiko, Ken Dev, and I see some folks in the audience here who were participating in this. So apparently this applies to the previous Lama and would likely apply to this next one as well. That's pretty exciting. I can't wait. This is the thing I'm looking forward to is now that at open source, all of this stuff has got, these experiments are just going to start happening at such a fast rate. This happened with Lama before, you know, once you let every researcher in the world download and start tinkling with your model,

Starting point is 00:20:02 people start finding optimizations and new tricks at a crazy rate. It's going to be really interesting. So I think the interesting piece here is to see whether or not the commercial license will unlock even more, or did the researchers didn't care and kind of through the kitchen. sync of everything they wanted to hack together on the previous llama. I'm thinking because it's open source commercially now, companies will actually start, you know, doubling down because there will be able to then use the fruits of their labor on commercial purposes. So we'll likely see more. I think you guys use the magic word, which is open source and everybody has a different

Starting point is 00:20:39 definition. And I know we had Tom Warren in the audience who asked the question about this. So Tom, I'm going to invite you up to speak if you're around. Yeah, I'm going to say, I call it, I say openly licensed, not open source, because I feel like open source has a definition that doesn't quite apply here. Yeah, yeah, exactly. If you go actually on my website, I wrote like a 10,000 words thing on like the history of open source licensing. And there's things that are open source, things that are somewhat open source in traditional infra that's like the server side public license. Some of these things that like Elastic and Mongo came up with to avoid the AWS API compatible, in quotes, products that were literally just the same thing.

Starting point is 00:21:20 So yeah, it's really curious also that the breakpoint for the Lama license is 700 million monthly active users, which is a lot of users, obviously, but there's some notable people that go over. So Snapchat is one company that is obviously a close competitor to meta. TikTok isn't there. YouTube by far exceeds that. Yeah, it's worth noting, but that's actually, that's not a rule going forward. that as of the date of the release, if you have 700 million monthly active users, you have to get an extra license from meta.

Starting point is 00:21:53 If you manage to achieve 700 million monthly actors next week, you could still use it. Like, it's that point in time that matters. At the other point, you should just name people. But yeah, just to close the loop on this open source element, there's one other piece about the open source or the usage policy, which is you can't use it to train any other model. thou shall not have any other models before Lama.

Starting point is 00:22:16 Lama is your only model that you can fine tune with Lama, Dita. I think it's more than that. This is they're protecting against distilling the model, right? The thing that everyone's been doing, like Vecuna was trained on chat GPT data, despite OpenAI having a thing in their terms that says you can't train a competing model. I'm really frustrated by this because the language says

Starting point is 00:22:37 you cannot train a competing large language model, but what does that even mean? Who gets to decide what a large language model is. If in six months time we invent a new architecture, is that still an LLM that's covered under those terms? It's frustratingly vague. Yeah, these clauses are kind of bogus. We talk about them a lot of hugging base, and it seems also from a legal perspective, the things that they're grounded in, like terms of service are being walked back in kind of this digital domain. And then also it's just like unclear what is actually using the language model. So all these things where people use

Starting point is 00:23:10 language models as a judge, or you can just generate a bunch of interesting prompts to then modify them. It's so ridiculous to even think of trying to enforce these clauses. It's surprising to see it show up. Which you have to note, like, in the Lama 2 paper itself, they also use other company models to do their evaluations, right? So, you know, a strict reading of those clauses would not allow them from that. Nathan, actually, a quick follow-up.

Starting point is 00:23:38 Hugging Face has its own license. rail license. I think there was some iteration following the stable diffusion release. Would you would that be appropriate for something like a Lama 2? Yeah, I think it's good. I don't have a hundred percent knowledge of rail. My understanding is that it's like generally the goal is to be like commercially available with good intention and then there's kind of like it starts to try to give leverage for people to come after bad actors using their models. I think the commercial use of this is going to be off the charts very soon. At Higginpace, a lot of the monetization efforts

Starting point is 00:24:15 are around trying to enable commercial use of open source language models. And the license questions have been a constant discussion for the last six months from things they're trying to build and from customers. So this is definitely going to be used. Yeah. OK, so we have a lot of insightful people here.

Starting point is 00:24:36 I feel like the best way to organize this space is maybe to just kind of try to stick to as many sort of factual elements as we can. I feel like, Nathan, since you've done the most work, you've had the most time with the paper, to be honest. Maybe sort of pick one other sort of element of the paper that you find worth discussing, and we can kind of go into that. Maybe that's sort of the pre-training base model stuff. I don't think there's a lot on the pre-training.

Starting point is 00:25:01 There's definitely an important thing that makes it able to be used, which is they use, like, what is CQA? It's like cross-quiry attention, which, make inference on the bigger models faster. I think there's kind of an asterisk that is interesting on that code and math and reasoning seems pretty not emphasized in the paper, and that's what their kind of, like, market for, that's what chat GPT is used by a lot of people on this call for. I think at a technical level, the RLHF details are the most fleshed out that we have seen

Starting point is 00:25:34 and kind of confirm a lot of the capabilities we've seen insinuated by, Anthropic in Open AI. So that was like kind of a relief for me as someone that's trying to be like, I still think this really works and they drop this paper. It's like, we really like this, which was not guaranteed. I have one pre-training question. This is for you, Nathan, or for the whole group. Like, we talked about it before.

Starting point is 00:25:56 The amount of pre-training data here goes far beyond Chinchilla Optimal. And the loss curves were still going down when they cut it off. Like, are we ready to say that Chinchilla Optimal? is just not optimal anymore. Oh. I'm ready. I never really cared about it. I think data quality is changing that completely.

Starting point is 00:26:20 It's like I think when Gentilla came out, data quality standards were so different and given what the practices are now, it's like what does it mean? It was a really big deal at the time though, right? I mean, it was kind of this breathtaking results that if you just ramp up, training data much higher than you thought or people had been doing, you just kept getting

Starting point is 00:26:42 better performance. Maybe, Nathan, since you're, you know, the most knowledgeable in this space, like, can you just, like, give us a little intuition? Like, when you say better data quality, like, what exactly is happening under the hood that makes this possible now? Oh, they're removing. Okay. Think about all the tweets and texts that everyone sends, and we have these weird insider jokes and phrasings that we do. They make no sense if you read them. and your language model like half reproduces them. So like, and like I'll say like you got got or something that is just very confusing from like a token prediction state point of view.

Starting point is 00:27:20 And then also a ton of just errors. It's like I write a blog post. I used to not take it as seriously. I've like published a blog with a half finished sentence in it. It's like they would just scrape that and take it. But trying to actually get data that is complete is, is consistent is just extremely hard. I think technical terms are like deduplication, so you don't want to pass the model the same text, even if it came from different websites, and there's tons more that

Starting point is 00:27:50 goes into this. I don't think it's the area of my most expertise, but I think it's actually pretty simple. You just want to put good text into the model and understanding what good text is on the internet is really hard. So you're sort of saying the reason people were using not enough data initially is because they just weren't good enough at cleaning it. And now that those methods have advanced so much, we're moving duplicates better, we can measure quality better, all of that. Like, do you think we're going to keep going up,

Starting point is 00:28:16 I guess, is the question. Like, this, you know, they trained a 7B model on two trillion tokens. Like, do you think that's like the max, or are we going to keep going? I kind of like, I think the intuition on, like, what you're saying is how getting more, higher quality data is

Starting point is 00:28:31 making it so using more works better. I'd like that's what everyone in my circles is saying is the trend and given machine learning in the last few years I think trends tend to be stickier than most people expect them to be so I would expect it to keep going I just kind of trust the process to continue for a lot of stuff like this yeah so we on our podcast we've been asking everyone that we can possibly ask about you know what went from 2x tokens to prance ratio with kaplan and then 20x with chinchella now 200 x with Lama, like someone's going to try 2000, right? We did have a response today from one of our previous guests, Varroon of Kodium, who said that they did try 1,000 to 1 tokens to Param's ratio, and it definitely gone into the range of overfitting. So your loss can continue to go down, but you're not sort of measuring, overfitting in some of that respect.

Starting point is 00:29:21 So it's very unclear. I would say, though, I do have visual source. Like, it's not that Chinchella was wrong. Chinchella was optimizing for a particular set of assumptions, particularly the pre-training compute budget, right? Compute optimal sort of scaling laws. And if you look at the Lama paper right on the first page, I have it open right in front of me.

Starting point is 00:29:41 They actually criticize that and say, like, you know, this regards the inference budget, which is critical when you're actually serving the model instead of just optimizing for a pre-training compute objective. And as things move from research into production, inference starts to become more of a concern. Resource constraint starts becoming more of a concern. And so I think it's,

Starting point is 00:30:01 actually quite reasonable to move on from Chinchela, which is a very important result and say that we are exploring very different objectives as compared to, you know, more than a year ago when Chinchela was published. Yeah, I agree. I was just going to say that I feel like the loss going down, like all of these, reading the paper, it feels like this is a checkpoint of a much longer term project. They like readily list off things that they didn't get to, but they want to continue and like capabilities or something. Some of the methods seem like kind of hacks to make things work that they didn't know,

Starting point is 00:30:37 didn't get to work. Like, Anthropic came up with context distillation, which is a way of getting a really, the behavior of a really long system prompt into a shorter prompt. Essentially, like, and they did something like this and this paper to get the model to behave like characters for longer conversation turns.

Starting point is 00:30:56 And like, there's all sorts of little things that I just think meta, going to continue this. So that's kind of fascinating, because that implies that the actual story here, it's the AI arms race, right? It's Zuckerberg saying, no, we need to get something out right now,

Starting point is 00:31:12 get it to a point where it's good enough and safe enough, and then let's ship it. And it's not so much that they didn't necessarily have time to get to the sort of perfect point that they wanted to get to. Yeah, that is the, I have asked people about this offline. And so I was like,

Starting point is 00:31:26 okay, so why don't people throw a lot more compute at this? like, you know, as long as you have a state-of-the-art model, you should just ship it and get credit and then wait till, like, wait a few months and then get the next version out. That way you have a lot more shots on goal. That totally makes sense, yeah. And I was like, oh, okay, like we are in such early stages that honestly, I mean, they spent three million GPU hours on this thing. They could spend 30 million and like, obviously it would be way better. Like, we're in such early stages that even these relatively simple, like, don't forget, Lama 1 was published in February of this year. We're in such a easy cycle where it's still within, you know, the order of months

Starting point is 00:32:05 to make and improve one of these things, that it's not too terrible. I do, I guess I should also mention a shout out that not every person who worked on Lama 2 is on the paper. Dioram Lampal, and who's one of the co-founders on Mr. Alda, the French startup that raised like a $100 million C-Brown, apparently worked on Lama 2 and they left him out because, and they left his team out because they left meta before this paper was published. So interesting, how it's intrigued there, if anyone wants to go through that. Come for the dilemma, stay for the drama. Oh.

Starting point is 00:32:38 It's hard to read, you know, into like the, as you know, especially when it comes to like work that then it goes over at source. So we did the work, we didn't. I don't know. Since nobody here worked at meta, I would rather not go down that path. Yeah, I'll just leave a bookmark there. Okay, yeah, but exactly. We're not in the room there.

Starting point is 00:32:56 I'm for one shocked to hear that there may be drama among researchers. I've never heard of that happening before. I'm struggling after three organizational restructures of researchers playing hopscotch from one org to another and being in between jobs. All right. Alex, you have your hand up, and then I wanted to dig more on the preference data that Nathan mentioned. Hey, guys, just to introduce myself real quick, I'm Alex. I would participate in the spaces.

Starting point is 00:33:25 and my angle and the way I vibe quote unquote vibe check models is via languages. And to me, it was really surprising that they released kind of the second iteration, while also knowing how much meta actually does for translation. They have very famous NLRB models, no language left behind. They released the world models that you can speak in multiple, like a thousand languages that understands. And for some reason, their open source models, they are not very strong multilingual. So we've seen this with GPD4, which was way better at multilingual speak.

Starting point is 00:33:59 Claude highlighted this point with Claudeau that is way better at the blue score, I think, for languages. And I've tried, and my go-to like Vibe Check with these models, especially the open-source one, is the ability to translate, the ability to understand the languages. I've tried with Hebrew a little bit. I've tried with very, very impressed. Now, obviously, fine-tuning will come, and obviously people will fine-tune these morals towards different outcomes. comes, but it's very interesting considering how much matter does elsewhere for languages and to bring the world together, how much kind of this model did not focus on this specific kind of issue.

Starting point is 00:34:36 And the second thing is also code. I know you guys talked about human eval. That's fairly low in terms of the score out of the box. And obviously, fine-tuning will make it better, but fairly disappointing score on human eval, right? Fairly low coding abilities. And we've seen previously that There's some assumption that training on more code in your data set actually gives you better kind of logic and reasoning ability. So kind of surprised that that was fairly low. To chairman of me with these two examples about Blama. Yeah, I would say on the human eval piece, don't kind of not just yet. So I've had some DMs with Quinn Slack of Sourcegraph and he is, you know, very actively building Cody, their coding assistant bot.

Starting point is 00:35:21 And it's well known that human eval is not a real. very good or reflective measure of how we use coding chatbots. And so it's probably human involves probably overrepresented in terms of being, being effectively the sole benchmark by which we value code models. We just need new benchmarks for code. I do think it's possible better instruction tuning will improve code performance of the Lama 2 models as well because their reasoning capabilities are actually relatively good, not perfect, but relatively good, which makes me think there may be more code in the pre-training

Starting point is 00:35:54 then it seems. Well, it's difficult to know. We'll see. We'll see. I mean, this is the thing that's so infuriating about these opaque models that don't talk about their training data is as users of the models, we need to know. We need to know how much, like if it's had code in it, all of those kinds of things, in order to make decisions about what we're going to use it for. So I kind of feel like, you know, the secrecy around these models really hurts me as a consumer of these models, just from a practical point of view of being able to make good judgments

Starting point is 00:36:22 about what the model is going to like to be able to be able to. do. I do think that's true, Simon. You know, I want to make just one defense of meta, which is, like, this is pretty amazing what they've released and they've, you know, given to the world. Obviously, it may benefit them commercially as well, but, you know, it actually carries pretty substantial risks for them. And I think it's kind of a courageous act to release. And, you know, so it's the things like the training data safety that, like, really, you know, when you're, when you're meta and you have billions of active users. Like you actually are taking a pretty big risk with these things and, you know,

Starting point is 00:36:58 regulatory bodies have their sights on you. So I do think you're right. I just, you know, for what it's worth, I agree with, you know, I agree with that it's actually a positive thing. I agree with everything you say. But at the same time, right now, I've got a whole bunch of models that I'm choosing to be that I'm trying to choose between. And I don't have the information I need to make the decision.

Starting point is 00:37:17 I feel like at some point it's going to be a competitive advantage to put out a model with transparency over what went into the data because people will be able to use that model more effectively. But yeah, I completely understand these strategic challenges. I'm astonished that meta went ahead with this release. I never thought they'd take the risk of releasing something like this and someone use it for something bad and now they're on the front page all of the papers for it. So yeah, I'm super excited about it on that front. I want to adjourn from the perspective of releasing something as open source as they did. Previously, we didn't have commercial license. Obviously now, the big thing is we have commercial licensing, but the amount of people,

Starting point is 00:37:55 I don't know if you guys notice, but like the amount of people who signed quote-unquote in support of releasing these models, Paul Graham and Mark and Driesin and like a bunch of other folks, like in addition to the model, they also released kind of a counterweight to the moratorium papers and all the AI safety stuff because there was a NFTC probe, right? There was like some regulatory stuff talking about the previous releases of Lama from a long time ago. And now not only they release like the quote-unquote open source, unless it doesn't kick me off here. Not fully open source,

Starting point is 00:38:26 but definitely we're able to use this commercially. But they also released kind of a industry leaders something that open source is needed. And I think that gives a very strong counterweight to the dumerism and keep it close and don't release kind of thing we saw. And it's very interesting. It comes from meta specifically.

Starting point is 00:38:44 So in addition to the courageousness that they did, it looks like they're also kind of leading the industry. of like this is how to do fully commercial again quote-unquote open source not open source license but this is how to release models in a in a safe way so definitely joining the the courage and the applauds for meta and the team yeah i just don't think that like like the cut we're not the customers of meta with respect to this model i think they're trying to build these for their own purposes and then they have very strong like i think it's kind of the principles of like transparency and

Starting point is 00:39:17 research that these organizations at meta have stood by. And I think that's like the newest representation of it. More than, like, I don't think they're trying to make money off releasing this in any way. Like, there is an ecosystem perspective of like where AI content proliferates. There's more creativity for their users and that enables social media and things. But I think we're still pretty far from that. And it's more of like a values and internal research and development tool for themselves. Like, is there a way from them to make money directly off of this?

Starting point is 00:39:46 NPCs in the Metaverse But I mean I don't know Well so we We last hosted one of these Emerging pods I think maybe two pods ago Which is I think in May

Starting point is 00:39:58 Where we did our When the No Motes memo came out from Google And we actually talked a little bit about What an ecosystem around the language model Looks like When you have stackable Loras Customizing and Fine Tunes That are based on top of

Starting point is 00:40:14 An existing base model that is well known I think that might be part of the strategy there. Facebook is also well known for releasing, I guess, Pitech and React, and those are very well. They don't make money from that directly, but they definitely do benefit from the ecosystem that has sprung around it, that essentially represents a lot of free development from the open source community. I think there's a lot to be said for the fact that meta-a-I are at the very heart of openly licensed language model research,

Starting point is 00:40:43 and that's because of Lama. You know, Lama came out and it kicked off this immense tidal wave of interest and of activity with meta-a-i right at the very center of that. And in the world that we live in right now, being at the very center of all of the research and innovation happening around language models, feels like a really valuable place to be. Yeah, it really is. And maybe we can go to a little bit to Matt again. One thing I wanted to get your thoughts on that, you know, I don't know how long you have

Starting point is 00:41:07 with us, but is the impact on the startup ecosystem, right? like how big of an enabler is this or does this, I guess, just commoditize everything to a point where, you know, everyone's just rappers? I think it's a really, really massive deal. You know, we've met with conservatively hundreds of AI startups now. Maybe thousands. We'd have to go back and look. And I sort of alluded to this before, but the really big dilemma is do I train my own model or, do I just use something off the shelf? And we're increasingly seeing that the answer for almost

Starting point is 00:41:47 everybody is kind of a hybrid approach. We're seeing an increasing number of startups, basically triage their AI workloads, where if things require really high levels of accuracy and, you know, human-like text generation, GPT4 is the only answer. But many queries or workloads actually don't require that, right? you can kind of scale down and say, you know, for a really simple query, I can use, you know, an open source model off the shelf or something in the middle. I can fine tune for various tasks. And then you can get pretty sophisticated about what you route where. All of that is only possible if we have commercially usable, really high quality language

Starting point is 00:42:31 models, and especially ones that have been efficiently trained, such that latency is is low and cost is relatively low. So I think what we're going to see happen is there's going to going to be a big push for startups to use Lama 2 models and other open source models that have similar levels of performance, fine-tune it in ways that actually work for specific tasks, right? Not for specific data. Like I think that was sort of a head fake, but for specific tasks and really be able to build more defensible businesses that way. You know, there's nothing wrong with using Open AI. That's fantastic, but it's probably not good to make that 100% of your business and a lot of founders are doing that now.

Starting point is 00:43:11 So that's why I think this is such a huge deal. And, you know, the progress just today has been amazing. Like there's going to be, by the end of today, a number of hosts where you can just easily use the Lama 2 models, like right out of the box, you know, replicates one that we work with, but there are others as well. You know, you can already run it on your local computer

Starting point is 00:43:34 with two-bit precision, which is kind of crazy if you stop and think about that for a second, that with two bits, you can actually run a super advanced language model on your own computer. So I just think this is a huge, huge deal for startups. And I think if you're a startup founder working in AI, you know, you really should be taking a look at open source models now and seeing how they can be used to kind of deepen your mode and, you know, build a really great AI product. Right.

Starting point is 00:44:03 So I would like to help fill in the blank. So apart from replicate, it looks like Hugging Face has also launched an inference. endpoint for that. And as far as I know, it's one of the only few ways to try the 70B model off the shelf. I think base 10 has also maybe put something up. And then for the two-bit quantized model, you can look at the GGML ecosystem. Yeah. And then I also wanted to recognize one of the other respondents in our chat. We have a little comment window here. Gardochi was responding, I think, to Simon. And I did actually have a pushback, right? We don't have to know the full data sets of Lama as long as we are able to eval for everything that we want to know about.

Starting point is 00:44:42 I think we actually have to live with AI becoming more and more of a black box, even though the weights are open. I mean, for me, it comes down to model competition. If I have two equally capable models, and one of them I know what's in it, then I don't, then I'm going to use the open, the more transparent one. And I'm hoping, because there are so many models competing now, I'm hoping this becomes one of the factors that models compete with each other on.

Starting point is 00:45:05 I'm you know data set non-transparency I guess is like an emerging theme because it's not like we had that for Falcon either so yeah we can hope for it and that's a huge problem right falcon if you ask Falcon about human rights abuses in the Middle East it has some very different opinions

Starting point is 00:45:24 and I want to understand why I want to know how they got it to do those things yeah yeah exactly yeah we won't know and we can all we can do is to ask for more transparency there But I do support the concept of building a business on open source models because OpenEI will not randomly deprecate your models on you every three months. And I do think that for people who want a certain level of stability and are okay with trading off and not being stale at the art in three months, I think that is a perfectly reasonable tradeoff. Okay, I wanted to go back to Nathan a little bit and talk a little bit more about the preference data and the RLHF data.

Starting point is 00:46:01 So you estimated a $25 million cost for Lama 2. And as far as you can tell, that's actually primarily data collection, not GPUs. Yeah, this is based on kind of our pilot contract to do preference data collection at HulginPase, because we can give, like, we're collecting a small amount of data in a similar way, and do a back-of-the-enblow cost calculation and scale it up by whatever, like, 10 or 100 X that what they did, then you get towards this 20 million number. And it could be higher depending on how many flags they end up using in their data. So I think what they did with safety is pretty interesting.

Starting point is 00:46:44 So they like separated it and collected metadata. And that means they could also collect other metadata during the process. And as you kind of add more knobs, so the preference data collection, because it takes longer for people to do the task and the cost goes up. So I think pretty safe to say order of 10 million. especially given because that's what was rumored with OpenAI around chat GPT and everything like that. So it's not a shock at all to me. And is the focus on multi-turn significantly higher or comment worthy, I guess?

Starting point is 00:47:17 Not really. So generally when doing on signing this up, it comes down to per prop, like how many tasks the workforce is going to do. And you could do an instruction prompt, which is one turn, or you could do a four-turn chat, and that would you generally be able to trade off the number of labels that you get in that respect. So I think the multi-turn is more because open-source data sets don't contain a lot of that, which is something that we found in our work as well.

Starting point is 00:47:45 And they did that because they needed the model capabilities and they needed to train a preference model that can do that. And I agree. I think they must have figured that out months ago, because this also takes a lot of time. How it works generally, you can see this in the paper, how they say they have these RLHJGELA, versions and generally what happens is you sign a contract and then these people sit you down and

Starting point is 00:48:05 they're like, we are going to try to do this over batches and we scale up the amount of data we're sending over time so that we can do calibration and each batch you get some data from the vendor and then you look through the samples and you see what you like and you see what you don't like and then you change it going forwards and what they did is they took those batches and they trained to model iteratively and then they saw what their model needed and they went back to the vendor to say, okay, we need more data in this regard to improve things. They're really hands-on, really involved process. And I would guess it takes weeks to months for them to get all this data from a vendor.

Starting point is 00:48:39 It's definitely not something you can just get fast. And honestly, a potential reason why code is not as good is because it's way harder to get code data in this regard. So all the task companies are extremely limited in people that know a lot about code. So you get way lower throughput for getting preference labels in code and getting that kind of preference data. That makes a ton of sense. Anyone else have any other commentary, I guess, about the additional data collection? What I sense now is that there's a shift away from, I guess, the pre-training data sets, which are more opaque but also equally well understood towards more of this preference and all-HF data.

Starting point is 00:49:21 Yeah, they spent a lot of time in the supervised fine tuning. data too, they actually compare human vendors to some of their models and they were like, we should just use the human annotators for like reinforcement learning. I tell you what, the annotators are using the models anyway, right? Yeah, exactly. It's models all the way down. I think also the other, I mean, to me, some of these things are like alchemy, right? They're like, we stopped annotating Superfifference and fine-tuning data at 27,540 annotations.

Starting point is 00:49:52 Why? it seems like such an arbitrary number, you know, that I feel like that's going to be one of the max research areas, you know, figuring out where the right limit is. Do we have maybe, do you know if they're sending you really good? Again, like open source, open source like data sets for post, not pre-training for like a fine tuning than RLHF because I think one of the big moments with rubber pajama was like, okay, we can take the Lama one data mixture, use all the open search datasets and just run GPUs at them, how do we get to do the same with the post-training flow?

Starting point is 00:50:26 Okay, so you were breaking up a little bit for the question, so I'm going to say what I think it was, and if it wasn't, you can jump in and clarify. So I think it's like, how do we recreate this supervised truning dataset, and like, can we do anything else with it after the fact? Yeah, so this is another thing that we started doing, and I think that what, so the open as far as equivalence or something like open assistance created a really high quality data set artifact and then the recent trend is for this thing that's like called uncensored dataset which i think is a totally silly name because really what they're doing is they're removing

Starting point is 00:51:05 instructions like as a language model i don't want to say this and therefore when you remove these things the model gets more helpful so that's just going to be the new type of data which is just clean response to the instructions with really strong distribution control and the thing is about recreating this is that it's hard to create a diverse set of tasks. So what they are essentially paying money for is someone to make sure that you're not getting a whole bunch of the same poems or something. It's like getting 27,000 weird creative tasks that don't all overlap with each other is why you have to pay a lot of money for it rather than saying, oh, we have 250 people on this

Starting point is 00:51:42 call. Let's all do 10 of them. And then that's a solid start. Like we would just have a totally misshapen distribution and it wouldn't be that useful. So I think even in, say you can go look at like Instructs UBT and other papers like this have breakdowns of what that instruction data, the supervised fine-tuning data, actually looks like, but actually creating it is pretty hard. And I do think that the vendors provide a really high quality amount of data, but their point about the models being able to

Starting point is 00:52:11 create it is also really true. So it's pretty borderline right now. And Anthropics stop using that in their future work. So like Anthropics new base models are just good enough at responding to instructions where they don't need to do supervised fine tuning. And that's like in the constitutional AI paper. So it's like I don't think that's the place to invest time.

Starting point is 00:52:31 It's much more on the preference side to get the RLHF model and to get these preference models going. So then maybe you can even do creative things like constitutional AI and stuff after that. Yeah. So if you want to do work in open source today, you think you're better off. contributing to this side versus like trying to train another yet another model.

Starting point is 00:52:52 Yeah, there's no preference models out there. It's astonishing to me, especially given that meta's papers like, oh, we use an ensemble of two preference models. The thing that I want to see is them do or someone do is like take a base Lama model and then also train another preference model that's for code and then try to do RLAJF where you like have a prompt flag for all the all the code questions get rated by their own preference model as well and see what that can do because they already broke it down into like instruction helpfulness and safety it's like why can't we add another one it it's so obvious that I'm surprised it didn't it just makes a lot sense seeing it in the paper I was like stoked you know this this conversation gave me a bit of an

Starting point is 00:53:33 idea for essentially Lama Stack Overflow like you imagine like stack overflow with with like sort of Lama at its at its base but then like it's not very good at coding but we can actually do ratings on like preference ratings on on on on and entire conversation chains. And at some point, we'll accumulate the code data set that we need to find Tijuana. That would probably do it. Yeah. There's challenges and base models and how to execute code to get feedback and stuff, but

Starting point is 00:54:02 we've seen early experiments and we worked on one, funny enough that was called StackLama. We did like a... Nice. ...experimentation of that, Hugging face. And it's out there. It's ready for someone to invest more time in it and do it. I think, especially now that Lama 2. Like Lama 2 is going to be easier to work with.

Starting point is 00:54:22 It's just better language models are a little bit easier to steer. Absolutely. Alex, you have a Mars catalog you just joined, and I'm sure you have a question. Yeah, go ahead, Alex. I just want to complete kind of what Naden said. It's going to be easier to work with because the ton of the ecosystem and the different kind of things that the first Lama opened up is now there.

Starting point is 00:54:44 The GGML is there, all the cheap for all, and the Nokia browsers like all different things how to run like Lamo on your laptop are already kind of existing and now we're just going to see

Starting point is 00:54:56 the commercial folk come in the folks for whom working on this actually needs like a dollar sign afterwards and now they'll be able to also participate in this and we've seen this already I don't know if you guys

Starting point is 00:55:06 talked about this or not Scale AI apparently had early access to this and now released a I think open source like full open source toolkit to fine tune Mosaic and which is now

Starting point is 00:55:18 Databricks also chammed in but it's now super simple to fine-tun Lama on their infrastructure even though they have the MTP models that they still want to support Lama and those yeah like the ecosystem exists and I think Nathan's completely right it's going to be easy to use easy to find them yeah like at Hugging Bay I think

Starting point is 00:55:35 every library like all these people at HuggingBase were working super hard this weekend to make day zero support for Lama too like Transformers Pept TRL for RL like all these people put in the hours to make it's it's there like this week it's like people are doing this now instead of talking on a podcast they're fine doing this thing I'm sure for what it's worth I did actually look into the scale thing because I thought that was kind of interesting their announcement they never said that they were directly used at Lama 2

Starting point is 00:56:05 perhaps they're not allowed to say so they all these say the scale the eye is proud to be a meta launch partner we're launching a platform for customizing LLMs blah blah and and obviously you know that scale does annotation. So I think it's just heavily implied, but I don't think they're allowed to say. Yeah, Surge and Arthur did that surgeryized data at least. I think they did more of it too. Go ahead.

Starting point is 00:56:31 Quick Hugging Face Transformers question. I really want to run Lama 2 on my M2 Mac using metal, and so it takes advantage of the GPU integration in the M2. Could somebody please figure out how to do that with Hugging Face Transformers, then publish the world's most strong. straightforward how to do this document because I have not managed it yet and I think that would be a huge capacity increase for all sorts of people.

Starting point is 00:56:53 Yeah. Pedro's at Hugging Face is working on that, at least integrating these models with Apple directly. Fantastic. I agree. I agree. We agree. There's also a project called llama.cp that hardware accelerates for the M2 for the Lama 1. So I'm sure they're going to be updating that for the new models as well. I mean, I love Lama CPP, but I've not seen it run metal yet. I need to, evidently, I haven't checked the reading in the past few weeks.

Starting point is 00:57:27 Isn't it? As long as it's in GGML, it works, right? Yeah, and those are the converted models in GGML format. We were able to run one. You can actually split it between CPU and GPU and I don't know, misnames in the audience. We ran a Lama 2, 7B in GGML and run really fast. Fantastic. Yeah, again, if somebody wants to be really useful, publish a nice detailed step-by-step instructions for getting that working.

Starting point is 00:57:51 And I will benefit from it, and so will loads of it. I don't want to do it myself. I want somebody else to figure it out for me. Yes, and Simon's very good at this. You can just kind of copy and paste the kind of tutorial quality that he does. That'll be great for all of us. Thank you. I want to recognize Anton, who just joined. Hey, Stranger. Hey, Swix. How's it going, man? It's going well. We're very excited about open source models. what you got? Yeah, I mean, it's an exciting time, right? I got asked almost immediately what does this mean for chroma and retrieval and all the other things.

Starting point is 00:58:22 We're in the process of benchmarking and evaluating to see if it's actually suitable in the sort of retrieval augmented generation use case. Intuitively, we have this idea that lighter weight models want to perform well because you don't need so many weights for all the facts. You just need them to be reasoning machines. So, yeah, we're excited to be trying that out. will ship results as soon as we have them available. What evils do you look at for models as reasoning machines? I mean, there's plenty of retrieval, augmented generation benchmarks out there.

Starting point is 00:58:53 The one that I usually run as a quick test is the PsyQ data sets, the multiple choice question answering with distractors and supporting paragraphs. But there's entire batteries of these tests. One of the things that we're actually looking at doing at Croma very soon, and we've been speaking to the AI research labs about this, is nobody's really got benchmarks that are relevant to production data. The benchmarks that exist are very academically oriented and fairly synthetic. So they consist of, you know, crowdsourced exam answer, question answers.

Starting point is 00:59:25 They consist of sort of this really document retrieval oriented thing where it's like find a document that's relevant to this query. But production use cases don't always look like that. So we're actually looking at, you know, community source benchmarks that focus much more on what the real data actually looks like. Yeah, totally. The only one I can think of that is, I guess the most prominent one is the open assistance data set

Starting point is 00:59:50 that is kind of free and clear of any usage restrictions stuff. Yeah, I mean, would you get up to? Usage restrictions, I think for evaluating models, there are very few restrictions for use of these data sets. For benchmarking, it's very few restrictions. For training there is, for sort of commercial purposes there is, but for the case of like, does this model work well in a retrieval context? There are very few usage restrictions. Got it. Amazing.

Starting point is 01:00:19 Who else has questions or topics that you want to bring up about Mama 2? One thing that I was thinking about is in the benchmarks they compared to GPT4, but if what George Hatz said on the podcast was right and GPT4 is like eight attention heads, I wonder when people are going to get eight. you know, get an alumatumik serve expert going and benchmarking that. Maybe it will be better. I don't know. Yes, there is a little bit of a playbook that has been published out there. So, I mean, it takes more skill than I have. But I'm sure someone else out there is currently working on it. I think that the Chinese universities have made some interesting progress there.

Starting point is 01:00:57 Yeah, Simon, and then we'll go to Mars. So we talked about the, we talked about retrieved an augmented generation. The other thing I'm excited about is tool former, right? the thing where it can call functions, essentially. And that's mentioned in the paper. They mentioned they benchmarked along that, but I didn't get a feel for something it was really good at. The thing I want is I want basically exactly the same API as open AI functions, but I

Starting point is 01:01:20 want it to run off of Lama 2. I think that would open up all sorts of opportunities. They said that capability was emergent and they didn't train on it. There's a line in the discussion where it's like, oh, yeah, we've got some tool performance but we didn't train on it. So now we can all go fine tune on it, and it should be easier. We got Russell Kaplan in here from the space from Scale AI. I think we want to bring him up.

Starting point is 01:01:43 I think he's got a few interesting things to say about how Scale is thinking about these things. I know that they were mentioned here before. Hey, Russell. Here you go. Great. Yeah, no, thanks, thanks, Anton. Yeah, we were super stoked about the Lama 2 release. Yeah, we put out an open source library, LM engine for folks to fine tune and serve Lama 2 and other language models,

Starting point is 01:02:03 whether hosted by scale or, on their own infrastructure. And I think generally at scale, we're looking to start doing a lot more open source stuff. So, you know, one of the next things we're going to be doing is starting to fine tune Lama 2 on interesting domain-specific data sets that we create or problem domain. So, Anton, you mentioned not sure how well it's working for retrieval. You know, we'd love to just, like, put together a data set that we could use to fine-tune these models to be good at retrieval. I think we have one planned out for SQL right now, potentially other tool use. So, yeah, I'd be really curious, you know, hear from the audience if there are sort of requests for for good fine tunes of

Starting point is 01:02:39 LMA 2 or if anyone, you know, already has that data, you can just clone our repo LLM engine and try it out. So I've got one for you. I want a clone of chat GPT code interpreter built on top of Lama 2, which I imagine would require quite extensive fine tuning. But my good, I mean, we've talked about this recently how chat QPT code interpreter really is a next level AI tool. Being able to run our own version of that against Lama 2 would be incredible.

Starting point is 01:03:04 Yeah, that would be great. Yeah, we do a lot of code sort of data acquisition right now. So I think that's definitely in the wheelhouse. But yeah, that's a good idea to try out. Code data acquisition sounds so sinister. You know, you've got to write a lot of code. Write a lot of code. Yeah, I think we have something like 350,000 people all around the world who are sort of

Starting point is 01:03:26 helping with this stuff. And within that, there's a lot of domain-specific expertise. Is there a way that, like, so we were talking before you joined about scale, acquiring, I guess, preference data from developers rather than, I guess, the standard annotators that you have. Is this a, is this a need or focus that you have?

Starting point is 01:03:44 Is there a way that we can help? Yeah. How do we crowdsource? Yeah, no, definitely. So one of the interesting things has just been for our business where, you know, we do a lot of the RLHF labeling for all the companies training these foundation models has just been that the level of expertise required

Starting point is 01:04:00 has gone up tremendously, right? So we have a lot of our crowd now, it's really domain experts in specific areas, whether it's programming in a particular language or people who have, you know, passed the CPA or people who have passed the bar or licensed in some profession. That's really been where a lot of our sort of growth has been. And so, yeah, I mean, if anyone is a programmer and wants to kind of infuse their knowledge into the AIs that will power the rest of our society increasingly over time, you can just go to scale.com and sign up to start help programming.

Starting point is 01:04:34 Another benefit of this is by the time we have AI strong enough to simulate entire human beings, your data will already be in them. So you'll be resurrected and get to live forever in the afterlife. Indeed, we are the first immortals. It is the way to achieve immortality. Yeah. Immortality, take it, it's yours, but it's not in the battlefield. It's editing Wikipedia.

Starting point is 01:04:56 That is a mortality. Mars, you had your hand up. Yeah, hey, really been enjoying listening to this conversation. I think it's such an exciting day with Lama 2 and the commercial license. One of the things that I've really been excited about, and I think Qualcomm made an announcement with META, and they said they're going to be looking at optimizing it for Snapdragon, hardware accelerating it.

Starting point is 01:05:21 I think one of the most interesting things about these open source models, especially now that you have a commercial license, is actually running it on your laptop or even your smartphone, you know, maybe the $7 billion parameter model. And the kind of use cases that opens up that, you know, just weren't there a few months ago. I was wondering if people had any thoughts on that and what we might see in that area.

Starting point is 01:05:46 Meta just gave Tip Hook a huge softball for Apple to fix Siri, and they still hate each other. So I've been running Vecuna 7B on my iPhone for a couple of months just as a, mainly as a demo. so I could just shove it at people's face and go, look, my phone's offline and it's still writing me terrible poetry. And I have to admit, it's fun. I've not yet found use cases for that quality of model for when I'm offline. And maybe I'm just not being imaginative enough. My hunch is that models that are small like that can run on your phone are much more interesting

Starting point is 01:06:20 if you combine them with retrieval augmented generation or tool usage and so on. And just as a plain sort of chat GPT-style language model, I've not yet found many practical uses for it. I'd love to hear from, oh, that's not true. I use it for brainstorming occasionally. If I want to come up with a name for something, that's like I used to dread naming things. Now I'm fine with naming things because I get a language model to brainstorm for me. The one of my phone is good enough to do that.

Starting point is 01:06:44 I've had it come up with some names for things for me so far. We talked about evaluation a lot. I've used it for naming and I've also used these models to kind of generate evaluation prompts, which is kind of a different way to do it. It's like come up with some hard Python coding questions where you put a bug in this type of function. Like, I'm not going to come up with that on my own. Yeah, it can be a really useful spot check, I guess.

Starting point is 01:07:08 Or, I don't know, mental orientation tool, whatever we call that. So can we take a minute to do some criminal analogy here? What's the deal with, like, friendship ended with Sam Altman. Now Mark Zuckerberg is my best friend with Satya. I want to get into that. Saad was smiling a lot more in this picture with Mark than with Sam. That's what I noted. Wait, there's a picture?

Starting point is 01:07:30 What? Yeah, Sadia posted a photo with Mark, and he was like just laughing away. And then I looked back at the one that, remember the one you posted Satya and Sam together? And I think the bill conference or something. With Satya, Sam and Sam's nipples, yes. And Sadia was not smiling as much. I don't know. But I really wonder what the dust did, you know, opening eye does have to pay back a lot of money to Microsoft.

Starting point is 01:07:57 It's kind of crazy. that Azure is the launch partner because Open AI is exclusively running on Azure hardware. So it's a very, very curious move, right? And I can't really disentangle it. Given sort of the scope of Microsoft

Starting point is 01:08:13 investment in Open AI is entirely in Azure credits, like one interpretation of this move is that they've already got Open AI locked in, right? They're not going anywhere. So might as well get the other

Starting point is 01:08:28 contending models, right? If you're Satya, how are you thinking? The only thing that we know for sure accrues value in this environment is owning compute, and that's what Microsoft has. Yes, but also AWS is also a launch partner, right? What does it mean to be

Starting point is 01:08:44 a launch partner of an open source model? Like, if you can run compute, you can run it. I think that's the main question. Yeah. But I think Microsoft is clearly happy to be involved to them. It's like a semifference. The first one, Exclusivity just one way.

Starting point is 01:08:59 You know, it's not a two-way exclusivity. So they don't. That's whatever. The other thing is this will probably increase the demand, the compute demand on Azure from all of their enterprise customers, right? So, you know, whether they're selling compute, open AI or all the other enterprises they work with, you know, having more models available that everyone's using should just kind of keep growing that business.

Starting point is 01:09:20 Not to mention, I think a lot of their Azure customers probably have significant concerns about privacy, about putting sensitive business data through this and being able to just run inference on your own hardware that you control probably is more appealing to them in some cases than running a Rest API and calling out to OpenAI's infrastructure Azure. Well, they've got Azure endpoints for the Open AI models. I'm actually not quite up to speed with the privacy model there, but my understanding is there's not really much difference. My hunch is that it doesn't matter if it is.

Starting point is 01:09:56 What matters is what people feel. It's the vibes. And you see so many of these, so many people, so many companies saying, no, absolutely no way we would pump any of our private data through somebody else's model, even if they say they won't use it for training, which they will do. But whereas I guess maybe they're okay with pumping it through Microsoft as you, but at least it's on our own GPU reserved instances. Maybe that's what's going on here.

Starting point is 01:10:21 There's so much paranoia around this space at the moment. Yeah, a lot of the details come down to can you run it within your own virtual private cloud? I wish we could close enterprise customer security requirements on the vibes. But at least in my experience at scale, people do, you know, there's some compliance function somewhere in the organization that has to sort of check the boxes that you're not going to get screwed on later. And so that's definitely been one of the big drivers of people looking to self-host their own open source. LMs more and more. Yeah.

Starting point is 01:10:53 And the other thing is that they did not use any Azure compute to actually train the model. So if you go in the paper, it mentions they only use their super cluster and their internal production cluster. So no Azure was used to train it. I guess it's just the inference partner. Yeah. So I mean, going back to the point of they just want GPUs to run. It's not about this is the best GPUs that we use.

Starting point is 01:11:16 They didn't even use it. I think what's really interesting about this release is that, you know, for a while, people have been talking about how, oh, is meta behind in AI, generative AI language models. And, you know, I think Rune had a tweet that was like, the best open source model sounds a lot better than the fifth best language model. And it's actually totally true. And I actually think that companies, you know, if you are behind, if you're not in first place, if you, if you open source stuff and you just sort of get the community using it, you can get a lot of goodwill, get a lot of adoption and actually really move the industry forward. So yeah, really cool to see Meta sort of put this out. And I think it will also spur a lot more open source from a lot of other companies.

Starting point is 01:11:57 I fully agree. I think this is something that we've been very excited about. We heard some weeks about it a couple months ago and then, you know, earlier this week, or I guess last week. And now it's fully out. Wait, maybe I'll do just a round for predictions. What happens next in open source models or with Lama? I'll go first. I'll go first.

Starting point is 01:12:16 I think the first thing that needs to happen here is the community will actually get a model into its hands and find out its true capabilities. Benchmarks only take us so far. Once that has happened, we're going to see an extensive sort of period of fine-tuning, where people are going to apply it to their particular applications and keep pushing the envelope here. And then if it is sufficiently capable, I actually think that we might find new uses for these models that we don't find in Rest API served ones because you can get at the internal state, right? The thing that I'm always thinking about, obviously, is embeddings and internal states and like modifications here.

Starting point is 01:12:50 And I think that there's actually a great deal of interesting research and engineering to be done by looking into what's happening in these models live, especially a sufficiently capable one, which we can do reasoning. And so I'm particularly excited about that. I'm particularly excited about having something at least sufficiently capable that we can start to reason about because the entire research community has access to it rather than, you know, behind a closed wall inside some of the bigger AI labs. Anyone else?

Starting point is 01:13:16 Simon, Nathan? Yeah, I would mostly just double down on that, and I could comment on how remarkable the collapse of kind of NLP research as it was has been onto Open AI APIs. And this is an opportunity to reset some of that dynamic where so much academic work with just fine-tuning open AI models. And I was like, oh, sorry, we nuked all your fine-tuned models and things like that. from a values perspective, this is huge for research to kind of proceed as it was meant to be in a way. And that is wonderful. I'm looking forward to the first fine tunes. I think like alpaca is what unlocked llama. I can't wait to see what people do, especially since everyone's already amped up and ready to go. So I think it'll be fascinating to see how those start shaping

Starting point is 01:14:05 up the next few days, a few weeks. And yeah, I want to see people, I want to see the applications. I want to see people figure out retrieval augmented generation. I want to see people figure out if it can do toolformer. All of those things, especially the tricks which make the sort of smaller, the 7B models able to solve interesting problems. And I think this is going to happen really quickly. You know, we've got so many more people who know how to work with these models today than we did when Lama came out back at the end of February.

Starting point is 01:14:30 So I'm expecting that to just be a whirlwind of activity starting about four hours ago. And yeah, I can't wait to see what happens. I totally agree. I think there's going to be an explosion of domain-specific and use-case-specific fine-tunes. And I think that the sort of first-order effects are going to be pretty clear on this different industry, this different domain. Everyone is going to start putting out these domain-specific fine-tunes, not just the companies themselves doing it for their own use case. But as someone said, like Alpaca sort of made Lama accessible, will have something really similar. but for each category of application.

Starting point is 01:15:10 And then I think the second order effect that's really interesting to me is, I think tool use and agents are going to get really good. Right now, people are using, you know, sort of off-the-shelf, tuned language models to try to build agents, have them use tools. But if you're building a, you know, an application,

Starting point is 01:15:28 and you just need to use that one tool really, really well, and now you have suddenly a GPT3.5 class model that you can fine tune exclusively to that tool, it's going to work really well. And I think that the barrier to utility is so high for these tool use real-world applications because of this sort of problem of exponential compounding of errors over long chains.

Starting point is 01:15:53 But if fine-tuning works well for that, I think it's going to be a really big game changer. I am so bullish on agents. I'm well aware that they're nothing but toys today, although I can think of a couple of practical use cases, including in the fine-tuning context, Russell. We ought to talk about this actually later. But that's a really good point to my mind,

Starting point is 01:16:12 that sort of having an easy-to-fine-tune model for your particular agent-use case is maybe going to make these things more useful than they are today. I'm very bullish on that, I'm hopeful, of course, because Korma builds memory for agents would be great for us, too. All right, I think, Lesio, I don't know if you have any predictions. I think I'm kind of out.

Starting point is 01:16:30 You guys are definitely taking all the ones that I was going to say. Wait, wait, wait, wait. before we sign up here, let's go around the room. Probability of AI Doom, improved or made worse by the release of Lama 2. Let's go. I couldn't care less. I don't care about the Doom scenarios. I care about building stuff with what we've got.

Starting point is 01:16:51 So none. It has not moved your needle. No, my needle is stuck on the sort of metal, maybe 5%, but not worth thinking about too hard. All right, 5% doom. I'm willing to accept 5%. doom. We've we've we've accepted way more percent doom and other technologies. I'm a doomerism so we're gonna use it for more good than bad we'll be done with it. I would like to believe that having a model that we can actually understand and like go deep and develop on top of

Starting point is 01:17:21 it will not only advert the doom scenarios but will allow us to prepare better in case any crazy person wants to make doom on their own are sufficient enough. community of builders of LLMs and AGIs can stop that. Yeah, I think that's a really great point, actually. The safety story gets better when we have more opportunities to work with the core internals of the models as they actually exist instead of hypothetical abstract objects that we reason about. Yeah, I was going to say, like, I'm a pretty high P-Doom person, but it's moved down

Starting point is 01:18:00 because we can have, you know, GPC5 or Lama 3, you know, explain the weights of Lama 2. And I do think that that improves interpretability quite a bit. How are you going to know if it's telling the truth? I know these, I know about these just ask the model approaches, but I'm pretty skeptical. I've got to tell you.

Starting point is 01:18:19 Give it a go board, you know, swap out one of the positions, see what happens, you know, that kind of stuff. You know, we've done small versions of this. We've done very, very small skills version of this already, right? So, I don't know. This is hand wavy.

Starting point is 01:18:31 I mean, you know. No, I'm just genuinely curious about the ideas here, but that's a different discussion. Exactly. Yeah, yeah. Yeah, I just think it's amazing how these language model capabilities that just a few months ago felt cutting edge when people used them

Starting point is 01:18:46 for the first time in chat GPT have now progressed to a state where it's almost becoming commodified and everybody's having these models. There's more and more of them popping up. People starting things and open source models exploding. I don't think necessarily we can fully understand the significance of what's happening here today.

Starting point is 01:19:08 But going into the future, it's probably going to be really common for pretty much every computer to be running large language models natively on the device. All right. Well, that's a very positive view of the future. I think we're all very encouraged by that. Yeah, I just want to thank everyone for joining and sharing their thoughts on number two. Alessio, did you have parting thoughts? No, that was it.

Starting point is 01:19:32 Thank you, everyone. Thank you so much. We'll clean up the audio of this thing and post it tomorrow on the In-Space. But otherwise, I think we should follow what Russell and Nathan and the others have been saying, which is go play with Lama too. So I guess we'll all go do that.

Starting point is 01:19:45 Have a wonderful day, everyone. Thanks, everyone. Thanks, everyone. Bye. Bye-bye. Have a great time.

Latent Space: The AI Engineer Podcast - Llama 2: The New Open LLM SOTA (ft. Nathan Lambert, Matt Bornstein, Anton Troynikov, Russell Kaplan, Whole Mars Catalog et al.)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.