The AI Daily Brief: Artificial Intelligence News and Analysis - Code Interpreter is GPT-4.5: A Summer AI Technical Roundup [feat. Swyx and Alessio of Latent Space]

Starting point is 00:00:00 Today on the AI Breakdown, I'm joined by the hosts of the Latent Space podcast to discuss everything that happened in AI last month from a technical and developer perspective. From Lama to Code Interpreter to open source debates and beyond, this is your summer technical AI Roundup. The AI breakdown is a daily podcast and video about the most important news and discussions in AI. Go to Breakdown.network for more information about our YouTube, our newsletter, and our Discord. Welcome back to the AI Breakdown. Today I am very excited to be collaborating with the hosts of Layton Space. Layton Space is a podcast focused on AI development and AI engineering. It's hosted by Alessio, who is also a VC investing in AI in other frontier spaces,

Starting point is 00:00:42 and Sean, better known as Swix, who is an AI developer and entrepreneur. Now, one of the things that makes the AI space interesting is that among the people paying attention right now, say the types of folks that are interested enough and focused enough to be listening to a daily AI news analysis show, like the AI breakdown, even among those who are not developer, themselves, they are still actively paying attention to technical developments in the field. There's an understanding, I believe, that when it comes to getting an edge and what's coming next, GitHub is much more important than Twitter for understanding what's coming down the pipeline. With that in mind, latent space and the AI breakdown are hoping to do these technical news

Starting point is 00:01:18 roundups at a pretty regular clip, and we kick off looking back at a month that had some serious action. We're going to be talking about open source developments, the latest in the agent space, why AI companions are getting more attention, and of course, breaking down the significance of code interpreter and Lama 2. It's a great conversation. I think you'll learn a lot, so let's dive in. All right, what is going on? How's it going, boys? Great to have you here. Hey, good. How are you all? Great to be on. Good. I'm excited for this. Yeah, no, I am super excited. I think, you know, we were just talking a little bit before this that the AI audience right now is really interesting. It's sort of on the one hand, you have, of course, the folks who are actually in it,

Starting point is 00:02:01 who are building in it, who are, you know, or dabbling because they're in some other field, but they're fascinated by it and, you know, are spending their nights and weekends building. And then on the other hand, you have the folks who are, you know, what we used to call non-technical, perhaps, but who are actively paying attention in a way that I think is very different to the technical evolutions of this field because they have a sense or an understanding that it's so fast moving that the place that they have to be paying attention to is what's changing from the standpoint of developers and builders. So what we want to do today is kind of reflect on the month of July, which had a couple of, I think, really keystone events in the context of

Starting point is 00:02:44 what it means for the technical development of the AI field and where it leads, how people's frameworks are changing, how people's sort of sense of things have changed over the last month. And I think that the place to start, although we could choose a lot of different examples, is with an idea that you guys have spent a lot of time sharing on Twitter and at other places that the launch of code interpreter from OpenAI, which is nominally a chat GPT plugin, actually represents functionally something closer to the release of GPT 4.5. So maybe we can start by just having you guys sort of explain that idea, and then we can kind of take it from there. I'll maybe start with this one.

Starting point is 00:03:26 Code interpreter was first announced as a plugin, at least in the plugins announcement from March. But from the start, it was already presented as a separate model. Because at least when you look in the UI, you don't go into the chat-tipt plugins UI and pick it from a menu of plugins. It is actually a separate model in the drop-down menu, and it is so today. And I think, yes, it adds on an additional sandbox for running and testing code. than iterating on that.

Starting point is 00:03:55 And actually, you can upload files to it and do operations and files. And people are having a lot of fun uploading different binaries and hacking to see what the container is and trying to break out the container. But what really convinced me that it might be a separate model was when people tried it on tasks that were not code and found it better. So code interpreter is poorly named, not just because, you know, it just sounds like a weird developer tool. But basically it's kind of maybe hiding some progress that OpenEI has made,

Starting point is 00:04:25 but it's completely not been public about it. There's no blog post about it. Quote Interpreter itself was launched in a support forum post, you know, low-key. It wasn't even announced by any of the major public channels that OpenEI has. And so the leading theory is that, you know, I've dubbed it GPD 4.5. I think, like, if they were ever to release an API for that, they might retroactively rename it 4.5 in the same way that 3.5 was retroactively renamed

Starting point is 00:04:49 when Chi-GPT was renamed when Chi-GPT was re-PT was. lose. And I think, and since I've published that post or treated that stuff, the leading reason, the reasoning for why they did not do it is because they would piss off all the AI safety people. Yeah. No, I mean, it was sort of correspondent, obviously, like, a thing that's happened less just this month, but more over the last three months is a total Overton window shift in that AI safety conversation. Starting from, I think, about April or May when Jeffrey Hinton left Google, there has been a big shift in that. conversation. Obviously, regulators are way more active now than they were even a couple months ago.

Starting point is 00:05:26 And so I do think that there are probably constraints in how Open AI and any other company in the space feel like they can label or name things. And even just as we're recording this today, we just saw a trademark for GPT5, which is sort of most likely, I think, just dotting the eyes and crossing the T's as a company because they're eventually going to have a GPT5. I would be very shocked at this point if there are any models that are clearly ahead of GPD5. GPT4 that come out before there is some pretty clear guidance from the U.S. government around what it looks like to release more advanced models than GPT4. So it's an interesting moment.

Starting point is 00:06:04 Let's talk about what functionally it means for it to be that much better, better enough that we would call it GPT 4.5. And maybe what might be useful is breaking that apart into how it is improving the experience for non-coding queries or inputs, and then separately how it has made chat GPT as a as a coding support tool different as well. I think there's a lot of things to think about. So one, models are usually benchmarked against certain tasks and, you know, that works for development. But then there's the reality of the model that, you know, if you ask, for example, mathematical question to like GPD3, 3.5, you don't really get good responses.

Starting point is 00:06:49 because of how digits are tokenized in the model. So it's hard for the models to actually reason about numbers. But now that you put a code interpreter in it, all of a sudden, it's not a math in the tokenizer, in the latent space question. It's like, can you write code that answers the math question? So that kind of enables a lot more use cases that are just not possible with the transformer architecture

Starting point is 00:07:12 of the underlying model. And then the other thing is that when it first came out, people are like, oh, this is great for developers. It's like, I know what to do. I just ask it. But there's this whole hotter side of the world, which is, hey, I have this, like, very basic thing. You know how I'm a software engineer by background?

Starting point is 00:07:29 You know how sometimes people that have no coding experience come to you. And it's like, hey, I know this is like really hard, but could you help me do this? And it's like, it's really easy. And sometimes they think it's easy and it's hard. But code interpreter enables the whole space of problems to be solved independently by people. So it's kind of having, you know, Sean talked about this before, about some of these models being like a junior developer that you have on staff for you to be more productive. This is similar for non-business people. It's like having junior, you know, dev or like an intern analyst that helps you do these tasks that are not even like software engineering tasks.

Starting point is 00:08:07 It's more like code is just a language used to express them. It's like a pretty basic stuff sometimes, but you just cannot, cannot do it without. So for me, the GPD 4.5 thing is less about, you know, is this a new model that is like built after GPD4. It's more about capability. So if you have GPD4 versus 4.5, you're probably going to get more stuff done with 4.5 just because of like the code interpreter piece. So for me, that's enough to use the code name. But as you said, Sam Alman said they're not training the next model. So if they said this is 4.5, he would have like, it would go back to Washington, D.C.

Starting point is 00:08:42 and being in front of Congress and I have to talk about it again. Yeah. Well, one thing that I always want to impress upon people is we're not just talking about, like, yes, it is writing code for you, but actually, you know, if you step back away from the code

Starting point is 00:08:57 and just think about what it's doing, is it's having the ability to spend more inference time on harder problems. And it matches what we do when we are faced with difficult problems as well. Because right now, any LLM, and he's before code interpreter. Any LLM, if you give it a question, like, what is one plus two?

Starting point is 00:09:17 It'll take the same amount of time to respond as something like proved the black show's theorem. And that should not be the case. Actually, we should take more time to think when we are considering harder problems. What I think the next frontier and why I call it 4.5 is not just because it has had extra training. It's not just because it has the coding environment. And also because there's this general philosophy and move that I see a lot of the, on OpenEI and the people that it hires, so in my blog post I called out Noam, who like, I personally met. So it's kind of awkward to talk about it, like I guess a friend or a friend

Starting point is 00:09:51 over friend. But it's true that I have met multiple people not opening eye who have specifically been hired to work on more inference time optimizations as compared to training time. And I think that is the future for GPT5. The reason I think about this working five is that this is the direction of AGI, that we're going to spend more time on inference. And it just makes a whole, lot of sense where you look at Nome's background working on the broadest and then Cicero, all of which is just consistently the same result, which is every second or millisecond extra spent on inference is worth like 10,000 of that in training, especially when you can vary it based on the problem difficulty. And this is basically ties back to the origin of OpenEI, which originally

Starting point is 00:10:36 started playing games. They used to play Dota. They used to play all sorts of also the games in sort of those reinforcement learning environments. And the typical way that you program these AIs during these games is when they have lots of branches, you take more time to search and figure out what the optimal strategy is. And when there's not that many branches to go down, then you just take the shortcut and give the right answer. But varying the inference time is the innovation here.

Starting point is 00:11:05 One of the things that it seems, and what you just described, I think aligns with this, is I think there's a perception that more advanced models are just going to be bigger data sets with more of the same type of training versus sort of fundamentally different techniques or different areas of emphasis that go beyond just how big the data set is. And so one of the things that strikes me listening to or kind of observing how code interpreter works is it almost feels like a break in the evolutionary timeline of GPT because it's like GPT with tools, right? Alessia, you just kind of described it.

Starting point is 00:11:47 It's like it doesn't know about math. It doesn't have to know about math if it can write code to figure out the math, right? So what it needs is the tool of being able to write code, and that allows it to figure something out. And that is akin to, you know, humans are evolving for millennia not using tools. Then all of a sudden someone picks up a rock and this whole entire set of things that we couldn't do before just based on our own evolutionary pathway are now open to us because of the use of the tool. I don't think it's a perfect analogy, but it does feel somewhat closer to that than just,

Starting point is 00:12:18 again, like, it's a little bit better than 3.5. So we called it 4. It's a little bit better than four. So we called it 4.5 kind of mental framework. No argument there. Another big topic that relates to this that was subject of a lot of conversation, not just this month, but has been for a couple months, is this question of whether GPT4 has gotten worse or whether it's been nerfed? And there was some research that came out around that with maybe variable, variable sort of feelings around it. But what did you guys make of that whole conversation? I think e-vals are one of the hardest things in the space. So I've had this discussion with founders before. It's really easy. We always bring up co-pilot as one example of like cutting edge e-val where they not only look at how much.

Starting point is 00:13:03 of their suggestions you accept, but also how much of the code is still in, a minute after, three minutes after, five minutes after. It's really easy to do for code, but like for more open-end degenerative tasks, it's kind of hard to say what's good and what is it. Like if I'm asking to write the show notes for our podcasts, which has never been able to do, how do you eval that? It's really hard. So even if you read through the paper that Ling Zhao and Maté and James wrote, a lot of things are like, yeah, they're, they're, they're work. worse, but like, how do you really say that? You know, like, sometimes it's not cut and dry.

Starting point is 00:13:37 Like, sometimes it's like, oh, the formatting changed and, like, I don't like this formatting as much. But if the formatting was always the same to begin with, would you have ever complained? You know, there's a lot of that. And I think with Lama too, we've seen that sometimes, like, our LHHF can, like, go wrong in terms of, like, being too tight, you know? For example, somebody asked Lama 2 as like, how do you kill a process in, like, Linux? And Lama 2 was like, oh, it's wrong to, like, kill.

Starting point is 00:14:02 and like I cannot help you like doing that, you know. And I think there's been more more chat online about, you know, sometimes when you do reinforcement learning, you don't know what reward and like what part of like the, the suggestion, the model is anchoring on. Like sometimes it's like, oh, this is better. Sometimes the model might be learning that you like more verbose answers, even though they're the right the same way. So there's a lot of stuff there to figure out.

Starting point is 00:14:27 But I think some examples in the paper like clearly worse. some of them are like not as crazy but i mean open eyes under a lot of pressure on like the safety and like all the the instruction side and we cannot like the best thing to do would be hey let's version lock the model and like keep doing evils against each other like doing an eval today and an email like that was like a year ago there might be like 20 versions in between that you don't even know how the model has changed so yeah evils are hard is the tlDR and i think basically this is, what we're seeing is opening eye having come to terms with that the origin of itself as a research lab where updating models is just a relatively routine operation versus a product

Starting point is 00:15:15 or infrastructure company where it has to have some kind of reliability guarantee to its users. And so opening eye, I think internally, as researchers are used to one thing and then the people who come and depend on OpenEI as a product are used to a different thing. And I think there's a little bit of cultural mismatch here. Even within Open AI's public statements, we have simultaneously, Logan from OpenEI saying that the models are frozen and then his VP of products saying that we update models all the time, so they're not frozen. So which is it?

Starting point is 00:15:47 Like, you cannot simultaneously be true. I think they're trying to figure it out. I think people are rightly afraid of them basing themselves on top of a black box. And that's why maybe we'll talk about Lama 2 in a bit. That's why maybe they want to own the black box such that it doesn't change off from under them. And I think this is fine. This is normal. But opening eye, it's not that hard for opening eye to figure out a policy that is comfortable with that everybody accepts.

Starting point is 00:16:15 And it won't take them too long. And this is not a technical challenge. It's more of a organizational and business challenge. Yeah. I mean, I think that the communications challenge that you're referencing is also extreme. And I think that you're right to identify that they've gone from like quirky little, you know, lab with these big aspirations to like epicenter of a national conversation or a global conversation about existential challenges, you know? And the way that you talk in those two different circumstances is very different. And you're sort of serving a lot of different masters, hopefully always guided by your own set of priorities. And that's going to be inherently difficult. But with so many eyes on it and people who are, The thing that makes it different is it's not just like Facebook where it's like, oh, we've got a new feature in the early days that made us all annoyed.

Starting point is 00:17:02 Like, you know, people were so angry when they added the feed, you know, and we all got used to it. This is something where people have redesigned workflows around it. And so small disruptions that change those workflows can be hugely impactful. Yeah, it's an interesting comparison with the Facebook feed because in the era of ad tech, the feedback was immediate. Like you change an algorithm and if the click-through rates or the, you know, the whatever metric you're optimizing for in your social network. If they start to decline, your change will be reverted tomorrow, you know. Whereas here, it's like we just talked about,

Starting point is 00:17:37 it's hard to measure. And you don't get that much feedback. Like I, you know, I, I have, there's sort of the thumbs up and down action that you can take an opening eye, but I'm not sure most people don't, don't give feedback at all. So like opening eye has very little feedback to go with on like what is actually improving or not improving. And I think this is just, just normal. It's kind of what we want in a non-add-tracked universe, right? We've just moved to the subscription of economy that everyone is, like, pining for. And this is the result that we're trading off some amount of product feedback, actually. Super interesting. So the one other thing, before we leave Open AI ecosystem, the one other big sort of feature announcement from

Starting point is 00:18:19 this month was custom instructions. How significant do you think that was as an update? so minor. So it is significant in a sense that you get to personalize chat TPT much more than you previously would have. It actually will remember facts about you. It will try to obey system prompts about you. You had this in the playground since forever because you could enter in the system prompt in there and just chat GPT didn't have it. And this is a rare instance of the chat GPT team lagging behind the general capabilities of the opening eye platform. form and they just shipped something that could have been there a long time ago. It was present in Perplexity AI. And if you think about it, basically every other open source chat company or

Starting point is 00:19:06 every other third party chat company already had it before chat QPT. So what I'm talking about is character AI. What I'm talking about is the various AI wifu, AI girlfriend type companies, each of which have characters that you can sort of sub in as custom instructions. So I think chat GPT is is basically playing catch-up here. It's good for obviously the largest user based in the world of chat AI, but it's not something fundamentally new we haven't seen before. Yeah, I think that it was so clearly a user-centric,

Starting point is 00:19:39 sort of end-user, non-technical user in some ways, focused feature that is incredibly obvious and important in retrospect, but not sort of a massive change, just something that feels like it should have been there the whole time. But I guess building off of that, the sort of assertion or the notion that, it was chat GPT playing catch up. I wonder to what extent one of the sort of sub-themes from this month was also a little bit

Starting point is 00:20:02 about trying to better understand or having a better understanding of where the value is going to accrue in this space. So specifically we had the first round of layoffs. This isn't something we actually had talked about talking about in advance, but we had the first round of layoffs from a couple different companies. And that, you know, we also saw in June the first decline in monthly users measured in terms of site visits and mobile app visits. It's over.

Starting point is 00:20:27 It's other ways to measure it. Yeah. The bubble has burst. It's over. You know, I think that it, to some extent, yes, it was, uh, this is a very convenient narrative shift for publications that have been breathlessly talking about. It's, you know, inexorable rise. So I think it maybe is a little over amplified by that.

Starting point is 00:20:44 But do you think that sort of take this custom instructions, you know, or something like perplexity? Perplexity has been ahead of chat GPT on a lot of different features, right? The sourcing interface is amazing, right? There's a lot of things that I think perplexity does so much better. But ultimately, is a project like that going to just be eaten up by sort of features being copied by sort of the board that actually has the, you know, the technical innovation underneath? I mean, I don't know. What do you guys think about that?

Starting point is 00:21:14 As a, you know, my full-time job has been a venture capitalist. I think that's one of the hardest questions that everybody is grappling with. I think there's a few things to think about. So one is how quickly do the open models catch up? So I think everybody agrees the long term, like access to intelligence through these models would be available to everybody. The question is, how much of a head start do the incumbents have in terms of, and by incumbents, I mean, like the AI incumbents, you know, like, yeah, open AI, like,

Starting point is 00:21:46 perplexity, all these companies that were there like two, three years ago. because then the big companies are going to be like, oh, well, I got a lot more data and I got a lot more distribution. And you can see that with Microsoft, right? It's like, who's trying to build the new Microsoft Word in the world of LLMs? Like, nobody, you know, like the existing players, like Office is building this in. Notion is like building this in their product. Like they already have so much of your data. Like superhuman just rolled in their AI thing in their email product.

Starting point is 00:22:17 they already have so much more to put into the model to tailor it to your use case, then it's going to be hard for the new startups to get there. But if the startups are like two, three years of, you know, at start, that's a different question. But I don't know, Lama 2 doesn't vote well for a lot of them, right? Like the 70 billion parameter model is pretty good. So now all of a sudden, you got Lama 7B. That actually, I think, perfectly brings up a segue to the other, major obvious thing that happened this month from both the technical perspective, but also just,

Starting point is 00:22:52 I think, long term from user perspective, which was Facebook releasing Lama 2. So this was something that was anticipated for a while, but I guess where to even start with the significance of Lama 2? I mean, how do you sum it up? If you're talking to someone who sort of isn't paying attention to the space, what does the introduction of Lama 2 mean relative to other things that had been available previous to it? It is the first fully commercially usable, not fully open source, we'll talk about that. First, fully commercially usable GPT 3.5 equivalence model. That's a big deal because one, you can run it on your own infrastructure, you can run it on your own cloud. So all the governments and healthcare and financial

Starting point is 00:23:33 use cases are opened up to that. And then you can fine tune it because you have full control over all the weights and all the internals as much as you want. So it's a big deal from that point of you, not as big in terms of the, you know, pushing out forward the state of the art, but it's still extremely big deal. I think the open source part, so the day that it came out, I read this post about, you know, why Lama 2 is not open source and why it doesn't matter. And I was telling Sean, I'm writing this thing. And he was like, whatever, man, like, this license stuff is like so, so tired.

Starting point is 00:24:08 I was like, yeah, I'll just post it on, on Agri News in the morning. And I think it was on the front page for, like, the whole. day and it got like 228 comments. And I was regarding the flash attention podcast episode in the morning. So I got out of the studio and there's like 230 comments of people being very like, you know, upset one way or the other about license. And my point and, you know, I was, I started an open source company myself in the past and I contributed to a bunch of projects is that, yeah, Lama 2 is not open source by like the Open Source Institute definition, but we just don't have a better definition for like models, you know, like, because it's mostly open source. You can use it for a lot of

Starting point is 00:24:48 stuff. So what's like the, and it's not source available because for a lot of stuff, you can use it commercially. So how do we find better labels? And my point was like, look, let's figure out what the better label is. But even though it's not fully open source, it's still like three million dollars are like flops donated to the community, basically, you know? Who else in the open source community stepping up and putting 3 million of H100s to make us train this model. So I think like overall net net is like a very positive thing for the community. And then you've seen how much stuff was built on top of it. There's like the quantized versions with GGML.

Starting point is 00:25:24 There's like the context window expansion. There's so much being done by the community that I think it was it was great for everyone. And by the way, three million is the low end. That's just compute. there's a reasonable estimate from scale AI that the extra fine-tuning that they put on top of it was worth about $15 to $20 million. So that's a lot of money just kind of donated to the community. Although they didn't release the data. They didn't tell us any of the data sets.

Starting point is 00:25:50 They just say, trust us, we didn't train on any of your Facebook information. Which is the first instance where the models are more open than the data. And I think that's a reflection of where the relative shift. in value might happen as a result of Lama 2. And so I don't know. You can take that in multiple different directions, but I just want to point that out. Yeah, I was going to say, so we first had the examples I made. So we first had the open models, open source models, which is like red pajama.

Starting point is 00:26:22 So the data is open. The training code is open. The model weights are open. Then stability kind of did the same thing with StableLM, which is like, hey, the weights are open, but we're not giving you the data. So you can download the model, but you can not retrain it yourself. And then Lama too, it's like, we don't give you the data. We'll give you the models, but you can only use it for some stuff.

Starting point is 00:26:45 So there's more and more restriction. But like Sean is saying, and we talked about this before, everybody wants to train their own model. Nobody wants to open source the best data set for X, which maybe is what more open source people should focus on. It's like how to build better specific data sets instead of yet spending given Jensen in Wong another $5 million of GPUs. But the model gets more headlines for now, you know.

Starting point is 00:27:11 So that's what everybody does. Yeah. And I want to point out it's a reversal of the open source culture. There used to be this sequence of openness that you could kind of pick and choose from, whether it's open code all the way down to open data versus all the way down to open weights. And, you know, there's some variant combination. I wrote this post a long time ago. I don't remember the five levels.

Starting point is 00:27:32 But yeah, like it's very strange. I think it's just a relative discussion of where the money is going. And I think it basically shows that compute is becoming commoditized, which, yes, there's a GPU crunch right now. A100s are sold out everywhere across the board. People are commenting all about it this month. And there's people hoarding compute, like nobody's business. But as far as the value in AI is concerned,

Starting point is 00:27:58 it looks like compute is relatively commoditized. It's actually data that people are kind of. safeguarding jealously. Going all the way back to the history of open source models, Luther AI, when they train GPTJ and Jupy T NEO as the first reproductions of GPT3, they release the data first. Stable diffusion, when they train stable diffusion, they release Lyon 500B first. And that's, I think, reflective of like the normal sequence of events. You release the data, then you release the model weights. But now we're just giving the data part. And I think it's just, It's fair. It's a way to guard yourself. You know, I think one of our conversations, I think it was Mike Conover

Starting point is 00:28:40 when he was talking about comparing our current AI era versus the 2000s era in search engines. He basically said like all of the public publishable information retrieval research dried up because all those PhDs went to work at Google and Google just sat on it. And this is now, you know, a fight for IP. And I think that it's just a very rational way of behavior and I guess like a capitalist AI economy. So one of the things that we were talking about before, starting with the code interpreter 4.5 and why, or GBT 4.5 and why they might not call it that, is the emergence of this sort of regulatory, if not pressure, certainly intrigue. Do you think that there's potentially an aspect of that when it comes to why people are so jealously safeguarding the data? Is there more risk for being open about

Starting point is 00:29:30 where the data is actually coming from. The books three example is probably good. So MPT trained their model on a dataset called Books Tree, which is 190,000 books, something like that. And then people on Twitter were like, well, the stuff is not, you know, in the free, you know, it's under copyright still. You just cannot. Public domain.

Starting point is 00:29:52 Yeah, it's not in the public domain. You can just take it and train on it. But the license for some of these books is like kind of blurry, you know, on like what's fair use and what is it. And so there was like this old thing on Twitter about it. And then MPA, you know, Mosaic first changed the license and they changed it back. And I think Sean Presser from Luther was just tweeting about this yesterday. And he was basically saying, look, as ML engineers, maybe it's better to not try and be the main ethics night and just say, hey, look, the data's open and let's try it.

Starting point is 00:30:26 And then maybe people later would say, hey, please don't use the data. and then we can figure it out. But like proactively not using all of this stuff can kind of keep the progress back. And, you know, he's more coming from the side of like Eluther, which is like doing this work in public. So for them it's like, hey, you know, if you don't want us to train all this is fine,

Starting point is 00:30:45 but we shouldn't by default not do it. Versus if you're meta, you know, they said that they trained Lama on like stuff available on the internet. They didn't say the train Lama on stuff that is licensed to train on. it's a small difference. The other piece of this that I wanted to sort of circle back to because we kind of breezed over it, but I think is really significant.

Starting point is 00:31:06 We did get a little lost in this conversation around open source definitions. And I don't think that's unimportant. I think that people are rightly protective when a set of terminology has a particular meaning and a massive global corporation sort of tries to like nudge it towards something that is potentially serving their ends versus actually being by that definition. But I also think that your point, which is that functionally relative to the rest of the space, it probably doesn't super matter because what people mean is almost more about functionally what they can do with it and what it means for the space relative to more closed models.

Starting point is 00:31:44 And I think one of the big observations has been that the availability from when Lama 1 was fully, fully leaked, the availability of all of that has pretty dramatically changed one, the evolution of the space over the past few months, and two, I think from a business standpoint, how the big companies and incumbents have thought about this. So another big conversation this month, going back to sort of the venture capital side of your life, has been the extent to which companies or startups are, or big companies are not wanting to sort of sign on with some startup that's going to offer them, you know, AI, whatever, because their technical teams can just go spin up sort of their own version of it because of the sort of, you know,

Starting point is 00:32:31 availability of these open source tools. But I'm interested, I guess, in bringing the sort of open source, you know, in air quotes side of the conversation into the, to the realm of how it has impacted how companies are thinking about their development in the context of the AI space. I think it's just raising the bar on like what you're supposed to offer. So I think six, nine months ago, it was enough to offer a nice UI wrapper around an open AI model. Today, it isn't anymore. So that's really the main difference. It's like, what are you doing outside of wrapping the model? And people need more and more before they buy versus building. Yeah, I think it actually moves the area of competition towards other parts of productionizing AI applications. You know, I think

Starting point is 00:33:23 that's probably just a positive. I feel like the, actually, the competitive pressure that meta is putting on OpenEIs is a good thing. One of the fun predictions that I made was in the next six months, OpenEI will open source GPT3, which is not open source. It's so far behind the state of the art now that it doesn't matter as far as safety is concerned, and it basically keeps OpenEI in the open source AI game, which would be a nice to have. Of the things that people have been building, You called out a couple context window expansion, but have there been any that really stand out to you as super interesting or unexpected or particularly high potential? One of our short-term podcast guests, the MLC team, they worked on wrapping Lama 2 to run on

Starting point is 00:34:09 MacBook GPUs. So I think that's like the most interesting gap, right? It's like how do we go from paper token to like unlimited local use? That's one of the main things that keep even people. like meme from like automating a lot of stuff right it's like i don't want to constantly pay open AI to do menial stuff but if i go run this locally and do it even if it five times lower i would do it so that's a that's a super exciting space yeah i would say beyond that there hasn't been that much i mean it's it's only a few weeks old so uh there hasn't been that much uh emergence coming from it

Starting point is 00:34:43 i would i would definitely say you want to keep a lookout for basically what happens in post lama one which, you know, keep in mind, it was only in February. The same thing that happened with Facuna, Alpaca, and all the other sort of instruction to use research type models, but just more of them because now they are also commercially available. We haven't seen them come out yet, but it's almost a guarantee that they will. You can also apply all the new techniques that have emerged since then, like JSON former, because now you have access to all the model weights to Lama.

Starting point is 00:35:18 And I think that will also create another subset of models that basically was only theoretically applicable to sort of research quality models before. And so now these will be offered commercially as well. So like, yeah, nothing like really eye-popping, I would say. But it's been five minutes. Yeah, it's been a very short amount of time. And the thing of open source is that the creativity unlock is very hard to predict. And actually, I think happens a lot in the, let's just say, the less official part of the economy where I've been focusing a lot recently on the sort of AI girlfriend economy, which is huge.

Starting point is 00:35:59 I feel like it's not polite conversation that the amount of AI girlfriend and AI has bundles, AI wife is in AI has bundles that people have been talking about. But it's real. There are millions of users. They're making a lot of money. And it's just virtually not talked about in polite SF circles. It feels like one of those areas that's going to be an absolute lightning rod when it comes to the societal debates around this technology. Like, you can feel it.

Starting point is 00:36:26 The people are going to hone in on that as example A of a change that they don't like. That's my guess, at least. So I have a really crazy longer term prediction, like maybe on the order of like 30 to 50 years. But AI girlfriend for Nobel Peace Prize. Because what if it solves the loneliness crisis? What if it cuts the rate of terror and school shootings by like half or something? Like, that's huge. My wife and I have joked about how every generation, there's always something.

Starting point is 00:36:59 Like, they always think that they're, like, so far ahead. And they think that there's nothing that their kids could throw at them that they just, like, fundamentally won't get. And without fail, every generation has something that seems just totally normal to them, that their parents' generation writ large just, like, has, such a hard time with. And we're like, it's probably going to be like AI girlfriends and boyfriends. We're going to be like, yeah, but they're not real. They're like, yeah, but it's real to me, you know, or having his debates with our future 13 year old. Our kids are only four and two now.

Starting point is 00:37:27 So it feels like maybe the right timeline. Yeah. I've heard actually, of all people, Matthew McConaughey on the Lexington podcast. What? Yeah, yeah, no, he was, he was great. Shout out, shout out, shout out, shout out, Matt. They were kind of talking about this and they were noodling this idea of like computers helping us being better. So kind of like we have computers learn how to play chess and then we all got better at chess by using the computers to like learn and like experiment. They were talking about similarly in interpersonal relationship. Maybe, you know, it doesn't have to be you shut off from humans, but it's like using some of these models and some of these things to actually like learn, you know, how to better interact

Starting point is 00:38:10 with people. And if you're like shy and an introvert, it's like, okay, I can like, try these jokes or like these conversation points with a motto and like you know it teaches me hey that's not okay to say or like you know you should maybe be more open or i don't know but i think that's a more wholesome view of it than like everybody just kind of runs away from society and that's like 10 i friends and doesn't talk to humans anymore it's much less sexy to just say like AI friends right even though like there's the if you look at the possibility set the idea that that people might have this sort of, to your point, like conversational partner that helps them effectively work through

Starting point is 00:38:51 their own things in this safe space that doesn't necessarily lead to romantic attachment just because the movie Her came out, right? Right. It can just be a panel of experts. I do have plans to build, you know, small CEO, which is my own boss and just for me to check in. And I actually will flag up, just living here in San Francisco,

Starting point is 00:39:11 you come across a lot of AI engineers who are interested in building mental wellness products. And a lot of these will take the form of some kind of journal. And this will be your most private thoughts that you don't really want to send anywhere else. And so actually all of these will make advantage of open source models because they don't want to send it to open AI. And that makes a time of sense. For people who want to try it out, I'll also give a shout out to circle chat.com,

Starting point is 00:39:36 which is something I just came across from one of my friends here in the co-working space that I have, where it's one of those situations where you can actually try out, like, having a conversation and having a group of AI friends chime in and see what that feels like to you. It's the first example I've come across where someone's actually done this. Super interesting. So Lama and Code Interpreter, I think, stood out pretty clearly as really big things to touch. I wanted to check in just as we sort of start to maybe round the corner towards wrapping up. Claude II and Anthropic, how significant was this? In what ways was it significant? Again, was it something that was sort of meaningful from expanding the capacity set for developers,

Starting point is 00:40:16 or was it sort of more just a good example of what you can do if you increase the context window? But that's something that might ultimately become table stakes later on. Yeah, I can maybe speak to this a little bit. It is significant, but not earth-shattering clearly. I think it is the first time that Claude as a whole has just been generally publicly available. It used to be on a wait list. Yes, it has a longer context window, but to me, more significantly, it is Anthropic finding its foothold in the very competitive AI landscape. Anthropics' message used to be that, yes, we're number two to open AI, but we're safer.

Starting point is 00:40:54 And that's not a super appealing thing to many engineers. It is very appealing to some corporations, by the way. I think having the 100K context window makes them state of the art in one dimension, which is very useful. the ability to upload multiple files, I think, is super useful as well. And actually, I have met a number of businesses, I'm close to answer Sourcegraph, who are actually choosing to build with Cloud 2 API over and above OpenEI just because they are better at latency, better reliability, and better in some form of code synthesis. I think it's Anthropic finding its foothold, finally, after a long while of being in OpenEye Shadow.

Starting point is 00:41:35 Yeah. And we use Claude for the transcript and timestamps in the podcast. So shout out the 100K context window. You know, we couldn't do that. When we first started the podcast, we were like, okay, how do we chunk this stuff for like GPD4 and all of that? And then Cloud was like, just put the whole thing in here, man. And works great. So that's a good start. But I feel like they're always, yeah, second federal. You know, it's like every time they release something, people are like, cool. Okay. Some people like it. Most people are like, okay, I feel bad for them because it's like, it's really good stuff, you know, but they just need some help on the marketing side and the community buy-in. So I just spent this past weekend at the club hackathon, which is, as far as I know, Anthropics first hackathon, I tweeted a pretty well-received video where I was just in the hackathon venue at 2 a.m. in the morning, and there was just a ton of people happy there. There were like 300 people participating for Claude. And I think it's just the first real developer excitement I've ever seen for Anthropic and Cloud.

Starting point is 00:42:40 So I think they're on their way up. I think this paves the way for a multi-model future. That is something that a lot of people are betting on. It's just the odds are stacked against Anthropic, but they're making some headway. I do think that you should always be running all your chats side by side against ChatT and Cloud. in maybe Lama 2. So I immediately, I have a little menu bar effort that does that that sings all the chats across. And yeah, I can say, I can legitimately say that Claude wins about 30% of the time.

Starting point is 00:43:16 As far as any time I give it a task to do, I ask you the question, which is not, you know, doesn't make it number one, but it actually is very additive to your overall toolkit of AIs that you should use. Yeah, it's certainly the first time that you're, if you go on Twitter on any given day, you will see people saying things like, if you haven't used Claude, you know, for writing, you have to try it now. Or so, you know, like people who are really, who have made a switch, who are, have no affiliation, who are very convinced that it is now part of the suite of tools that people should really

Starting point is 00:43:50 be paying attention to, which I think is great. Yeah. We shouldn't be at a stage yet where we're, you know, totally on one, just one tool set. I'll also mention, I think this month or at least July was with the first discussion of where whether is too much context not actually a good thing? So there's a pretty famous paper. I forget the actual title of it that shows a very pronounced view curve

Starting point is 00:44:12 in the retrieval abilities of large context models. And so basically, if the item that's being retrieved is at the start or at the end of the context window, then it has the best chance of being received. But if it's in the middle, it has a high chance of being lost. And so is 100K context a good thing? Are you systematically testing its ability to to retrieve the correct factual information or are you just looking at a summary and going,

Starting point is 00:44:36 yeah, it looks good to me. I think we will be testing whether or not it's worth extending it to 100K or a million tokens or infinite tokens. Or do you want to blend a short window like 8,000 tokens or 4,000 tokens, and couple that together with a proper semantic search system like the retrieval augmented generation and vector database companies are doing? So I think that that discussion has come up in open source a lot. And basically, I think it matches human memory, right? Like, you want to have a short working memory. You know, I was thinking about it. The one other, obviously, big sort of company update that we haven't spoken about yet was around the middle of the month, Google Bard had a big set of updates. A lot of it was sort of business focused, right? So it was available in more

Starting point is 00:45:24 languages. It was, you know, whatever. The sort of from a feature perspective, the biggest thing that they were sort of hanging their hat on was around image recognition and sort of this push towards multi-modality. But do you guys have any thoughts about that? Or was that sort of like not sort of on the high priority list as a as an announcement or development this month? I think going back to the point before, we're getting to the maturity level of the industry. We're like doing like model updates and all this stuff like it's fine. But like people need more. You know, people need me more. And like that's why I quote interpreter. It's like so good, right? It's not just like, oh, we made the model a little better, like we added this thing.

Starting point is 00:46:02 It's like, this is like a whole new thing. If you're playing the model game, if not, you've got to go to the product level. And I think Google should start thinking about how to make that work. Because when I search on Google Maps for certain stuff, it's like completely does not work. So maybe they should use models to like make that better and then say we're using BART and Google Maps search. But yeah, I don't know. I'm kind of tuning off a lot of the single just model announcement. So Bards updates.

Starting point is 00:46:27 I think the multimodality, they actually beat GPT4 to releasing a generally available multimodal model. You can upload an image and have Bard describe it. And that's pretty interesting, pretty cool. I think one of our earliest guests, RobloFlow, Brad, their CTO, was actually doing some comparisons because they have access to a lot of division models. And Bard came up a little bit short, but it was pretty good. It was close to the state of the art. I would say the problem with Bard is that you can't rely.

Starting point is 00:46:57 on them having reliable updates because they had a June update, I don't know if you remember, of implicit code execution, where they started to ship the code interpreter type functionality, but in a more limited format. If you run the same code, same questions that BARD advertised in the June blog post, that Sundar Pachai advertised in a video that he tweeted out, they no longer work in Bard. So they had a regression. That was very embarrassing, obviously unintended. And it shows that it's hard to keep model progress up to date. But I think Google has this checkered history with its products being reliable. You know, they also killed off Google domains, RIP.

Starting point is 00:47:39 And I think that's something that they have to combat, which is like, yes, they're trying to ship model progress. I've met the bar people. They're, you know, good, earnest people. But they have struggled to ship products even more than open the eye, which is frankly embarrassing for a company the size of Google. Outside of the biggies, are there any other? sort of key trends or, you know, maybe not even key trends, but sort of bubbling interest that you guys are noticing in the developer community that aren't necessarily super widely seen outside. You know, one of the things that I keep an eye on is all the auto-GPT-like things.

Starting point is 00:48:14 You know, in this month we had GPT engineer and we had Multian who held a hackathon and, you know, there's a few things like that, but, you know, not necessarily in the agent space, but are there any other themes that you guys are keeping an eye on, let's say? I'm sure, unless you all can chime in, but I do keep a relative close eye on that agent stuff. It has not died down in terms of the heat. Even the auto-GPD team, who, by the way, I work, they're on the first floor, the building that I work on. They're hard at work shipping the next version. And so I think a lot of people are engaging in the dream of agents. And I think, like, scoping them down to something usable is still a task that has so far eluded every single team so far. And it, in the

Starting point is 00:48:56 is what it is. I think, I think all these very ambitious goals. We are at the very start of this journey, the same journey that maybe self-driving cars took in 2012 when they started doing the DARPA challenge. And I think the other thing I'll point out interest in terms of just overall interest, I am definitely seeing a lot of E-VAL-type companies being formed and winning hackathons too. So what are E-Val companies? They're basically companies that let you monitor the the success of your prompts or your agents and version them and just share them potentially. I feel like I can't be more descriptive just because it's hard to really describe what they do is just because they're not very clear about what they do yet.

Starting point is 00:49:40 Langchain launched Langsmith, and I think that is the first commercial products that Langchain probably, you know, the top one or two developer-oriented AI projects out there. And that's more observability, but also will tend to as e-val as well because they acquired in AI Eval projects as well. So I would just call out just the general domain of how to EVL models is a very big focus of the developers here in this setting. Yeah, we've done two seats and companies doing agents, but they're both verticalized agents.

Starting point is 00:50:12 So I think the open source motion has been auto-GPT do anything. And now we're seeing a lot of founders. It's like, hey, you know, if you take that and then you combine it with like deep industry expertise, you can get so many improvements to it. And then the other piece of it is how do you do information retrieval. So in general, knowledge, like documents, everything is kind of flat. But when you're in specific vertical, say finance, for example, if you're looking at the earnings from this quarter, like 10 quarters ago, like the latest ones are like much more important. So how do you start to create this like information hierarchy between documents and then how do you use that? Instead of

Starting point is 00:50:52 doing simple like retrieval from like an embedding store. It's like how do you also start to score these things? That's another area of research from founders. Oh, I'll call out two more things. One more thing that happened this month was SDXL. You know, text to image doesn't seem as sexy anymore even though like last year was all the rage. But I do think like it's coming along. I definitely wish that Google was putting up more of a fight because they actually at the start

Starting point is 00:51:21 the year released some very interesting papers that they never followed up on that showed some really interesting transformers-based text image models that I thought was super interesting. And then the other element which, you know, I'm just like very fascinated by a lot of the, I don't know, like the, I hesitate to say this, but it's essentially like the character and like the character AI, let's just call it, character, replica and all the sort of not safe for word versions with that. I do think that a lot of people are hacking on this kind of stuff. The retention metrics on character AI blows away, you know, a lot of the metrics that you might see on traditional social media sites.

Starting point is 00:52:01 And basically, AI-native social media is something that is something, that is, there's something there that I think people haven't really explored yet. And people are exploring it, you know, like no Shazir, basically that the leading light behind the Transformers paper, like, Character AI is this company. And like, you know, he's always a few years ahead of it. So not to keep returning to this theme, but I just think like it's definitely coming for a lot of the ways that we view things. Like right now we think co-pilot and right now we think chat GPT, but like what we really want to speak to is a way of serializing personality and intelligence. And potentially that is a leading form of mind upload.

Starting point is 00:52:43 So that gets into science fiction territory. I'd want to go there, but I do see a lot of people working on that. Yeah, I mean, we just got a Financial Times report that says that AI personas from Meta from Facebook could be coming next month. Oh, that's what the report was. Mm-hmm. There's one that's Abraham Lincoln, one that's like a surfer dude who gives you travel advice. So it's, you know, the sourcing is three people with knowledge of the project or whatever. And it, you know, no, obviously no confirmation from meta, but it's no secret that Zuckerberg has been interested in this stuff.

Starting point is 00:53:15 Yeah. And, you know, the FTPs is actually, it's a good overview of why a company like META would care about it in very dollars and cents terms. And I want to state, like, the first version of this is very, very lame. Like, when I first looked at character AI, it was like, okay, I want to talk to Genghis Khan if I'm doing a history class, but it's like what a 10-year-old would enjoy. But I think the various iterations of this professionally would be very interesting. So on the developer point side of this, I have been calling for the development of agents. cloud, which are clouds that are specifically optimized, not for human use, but for AI agents use. And that is a former character. It's a character with the different environments,

Starting point is 00:53:56 with the different dependencies pre-installed that can be programmatically controlled, can get programmatic feedback to agents. And there's a protocol forming that some of the leading figures like AutoGBT and E2B are creating that lets agents run clouds. This is definitely terrified the AI safety people because we have gone from like running them on a single machine towards running, I know, clusters of machines, but it's happening. Let's talk about what comes next. Do you guys have any predictions for August or, if not predictions, just things that you're watching most closely? I think like for me, probably starting to see more public talk about open source models in production with

Starting point is 00:54:40 people using that as a differentiator. I think right now a lot of it is. kind of like, oh, these models are there, but nobody's really saying, oh, I moved away from open AI, I'm using this. But in our, we run a early adopters community with about 1,500, kind of like a Fortune 500 large companies leaders. And some of them were like, oh, we deployed Dolly in production and we're using it. We're not writing a blog post about it. So I think right now the perception is still everybody's using open AI and the open source models are like really toys. But I think we're going to get into September and, you know, you're not going to see a lot of announcements in August proper, but I think a lot of people are going to spend August getting these models ready and then going to end of the year and say, hey, we're here too. We're using the open models. Like, we don't need open AI. I think right now there's still not a lot of public talk about that. So excited to see more. Yeah, I'm a little bit, as for myself, this is very self-interested, obviously, but we had to head on the agenda. I wrote about the rise of the AI engineer.

Starting point is 00:55:43 And I think it's definitely happening as we speak. I have seen multiple tag, like people tag me multiple times a day on like how they're reorienting their careers. I think people professionalizing around this and going from essentially like informal groups and Slack channels and meetups and stuff towards certifications and courses and job titles and actual AI teams in every single company. I think is happening. I just got notification like two days ago that the, you know, in meta, apparently you can sort of name your name a job type title or whatever you want internally. And so the emergence of the first AI engineer within meta has been announced. And so I think as far as, you know, the near term, I do see this career, this profession come into place that I've been forecasting for a little bit.

Starting point is 00:56:30 And I'm excited to help it along. Awesome. Well, guys, great conversation. Tons of interesting stuff happening, obviously. Ironically, I think it's a relatively more quiet time in some ways than it even was. And my prediction for August is that we're going to see the extension of that. We're going to see sort of the biggest breath that we've had, at least from a feeling perspective, maybe since ChatchipT, but then we are going to rage back in September.

Starting point is 00:56:58 You've got Facebook Connect in September. You've got sort of just the return to business that everyone does after August. But of course, I think the hackathons aren't going to stop in the Bay Area. So people are going to keep building. And it's entirely possible that something hits in the next four weeks that, Totally changes that. Be exciting to see. Looking forward.

The AI Daily Brief: Artificial Intelligence News and Analysis - Code Interpreter is GPT-4.5: A Summer AI Technical Roundup [feat. Swyx and Alessio of Latent Space]

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.