The AI Daily Brief: Artificial Intelligence News and Analysis - Is Mysterious GPT2-Chatbot Actually GPT5?

Starting point is 00:00:01 Today on the AI breakdown, is GPT2 chatbot actually GPT5 in the wild? Before that on the brief, Marquez Brownlee calls the Rabbit R1 barely reviewable. The AI breakdown is a daily podcast and video about the most important news and discussions in AI. Go to Breakdown.network for more information about our YouTube, Discord, and our newsletter. Welcome back to the AI breakdown brief, all the AI headline news you need in around five minutes. Recently, prominent YouTuber and product reviewer Marquez Brownlee got a lot of attention in the AI space when he called the Humane Pin

Starting point is 00:00:38 the worst product I've ever reviewed, dot-to-dot, for now. Basically, the gist of that review was that almost everything that was useful about the Humane Pin was better done with just your phone. However, at the same time, he saw the glimpse of the future, where it could be really nice to not have to be tethered to a screen. This kicked up a whole discussion about the role of reviewers, and the nature of what product launches should be.

Starting point is 00:01:01 And of course, it was a little bit colored by the fact that Humane had been building this product for years and had spent hundreds of millions of dollars on it, but it also brought up this larger question of just whether AI hardware is ready for primetime. While that question has come up again with a vengeance, now that Marquez has dropped his review for the Rabbit R1. He sums it up in two words, barely reviewable. Interestingly, Marquez has not just identified this as a problem in the AI hardware space, but sees it as the apex of a trend that has been growing for a long time.

Starting point is 00:01:31 As he puts it, this is the pinnacle of a trend that's been annoying for years, delivering barely finished products to win a race, and then continuing to build them after charging full price. Games, phones, cars, now AI in a box. Let's listen to a clip from his review to get the gist of what he's saying. Like it feels like it used to just be, make the thing, and then put it on sale. Now it's like, put it on sale, and then deliver the half-based, thing and then iterate and make it better and hopefully with enough updates then it's it's

Starting point is 00:02:00 ready and it's what we promised way back when we first started selling it and then this this whole period in the middle is a mess and it's across all kinds of product categories too it's also happening with cars and vehicles getting announced and then delivering with like a half-finished state where you just don't get a lot of the features that you paid for and they're eventually coming soon with a software update you know smartphones obviously we've been seeing this for years but it does seem like now more than ever, there's at least one feature, one major feature of every smartphone launch that gets announced, but that's not coming until later in the year. And now these AI-based products are at like the apex of this horrible trend where the thing that you get at the beginning is like borderline non-functional

Starting point is 00:02:45 compared to all the promises and all the features and all the things that it's supposed to maybe someday be. But you still pay full price at the beginning, which is what makes it. it's so crazy. Now, let's talk about community responses to this. A lot of people agreed. Adam Vestika from the Shortcut writes, this is so true of gaming and it's a horrible trend that started during the last generation. I've got nothing against launching a game in early access, but so many titles are released with the promise of getting better down the line. Darrow Bassenjo writes, The Rabbit R1 and Humane AI Pin are the culmination of a trend of launching unfinished products that are sold at full price, and then users get the promised features only after software updates.

Starting point is 00:03:21 I saw many people compare this to treating your early customers like investors, which of course they're not. Dave Snyder writes, this is the byproduct of the product design and management strategy that pushes half-baked products for a few data points, an overwrought practice where too many test underdeveloped solutions groping blindly for signal while ignoring the glaring flaws. In the realm of physical products, this approaches an absolute nightmare verging on fraud. Maybe, just maybe, we'll get back to shipping what's right. Any younger joked, you and AI hardware CEO says, I want to build a fun product that will improve the pace of AI progress.

Starting point is 00:03:56 Your investors say, ship an MVP and iterate, do things that don't scale. Marcus Brownlee says, I'm about to end this man's whole career. And just as with the Humane Pin, where Marquez was representative of the larger sort of consensus review take, the Rabbit R1 is also getting some pretty negative reviews even beyond Marquez. Snazy Labs writes, The Humane AI Pin is one of the worst products I've ever reviewed. And the Rabbit R1 is even worse.

Starting point is 00:04:21 Video later this week on why good ideas don't translate to good products and why Apple and Google will clobber all of these AI startups by the end of the year. Now, just to add a little bit of controversial sauce on top of this, Snazzy Labs also added, what I would like to reiterate is that the AI pin is leagues better than the R1. Yes, it's overpriced and not very good, but it knows what it wants to be and does a very limited number of things decently well. The same cannot be said of the Rabbit. They're not comparable.

Starting point is 00:04:45 But what about the Rabbit team? Ryan Fenwick, who does communications over there, says honest feedback, which is cool. We're in the very early stages of the new AI hardware industry. The important thing is to move fast, continuously update the product, and keep improving the experience for those of you who are on this ride with us. Jesse Liu, the CEO, said, we shall see how fast R1 improves and evolves. We are a tiny team trying to catch the fast pace of AI. The current levels of AI need strong human supervised fine tuning.

Starting point is 00:05:10 You can't take your time polishing features without real user testing. We will push the OTA fix as early as tomorrow to address most of the bugs we found. Thanks Marquez Brownley for your very detailed explanation of LAM. Looking forward for you to do an R1 revisit very soon. So I think that this tweet is important because it's not just a founder trying to justify that they're early, but is at least in a small level, giving an argument for why this is actually the right process to use. Jesse writes, you can't take your time polishing features without real user testing.

Starting point is 00:05:40 Joel Kerenen responded to snazzy Labs, with another argument for why the Rabbit R1 approach is better than the humane AI pen saying, I would argue that the Rabbit R1 is better because it's substantially cheaper with no subscription. It's meant to be an impulse gadget purchase rather than a lifestyle commitment. Snazzy Labs responded with the snazzy response, I will say, you get what you pay for, but I think it's an interesting point. Is it more justified to charge someone $199 once to get that user feedback rather than $699 with an ongoing subscription? Now, there is a lot to unpack here. You can make all sorts of arguments from all sorts of angles.

Starting point is 00:06:12 I think one of the key questions is consumer expectation versus reality. One of the things that clearly frustrates Marquez is the gap between what these devices promise and what they actually deliver. In other words, I think that he would say that these companies are not presenting themselves as an early beta, but presenting themselves as the thing, ready to go, ready to be in the world, and then trying to have it both ways, saying we're just a tiny team and we're trying to grow fast when things don't go that well. Of course, another line of common commentary, largely from entrepreneurs, was pointing out that

Starting point is 00:06:41 every new technology seems barely reviewable at the beginning. Graham Fleming even shared a video of one of Marquez's very early reviews and said if Marquez Brownley decided his initial videos were barely reviewable instead of posting and iterating, would he be where he is today? Now, that's a great score for Twitter points, but of course, the difference is that Marquez wasn't asking people to pay money for something. Figma Engineer Vivek writes, absolutely love what Marquez Brownlee is doing. I don't understand how people can defend dumping trash on consumers and overselling it as the next great thing. Nobody owes anyone a benefit of the doubt. Don't want to get reamed by consumers and reviewers? Make something good. I think that these questions are really interesting. It seems pretty clear so far that the agentic capacities of AI just aren't really sufficient for the agentic promises of the AI devices that have been announced and released so far.

Starting point is 00:07:26 At the same time, that doesn't mean that A, those devices won't actually catch up, especially as the agentic capacity of the AI software underneath catches up. And B, it doesn't mean that AI hardware couldn't take a different approach focused on a specific. type of use case that might actually fit better. For example, neither the humane AI pin nor the R-1 are focused on the same use case that some of the next generation of AI devices seem to be of keeping a record of all your conversations and interactions in a way that allows you to go back and better recall that information. Will that be a killer application that doesn't require AI agent capacity? Maybe. Then again, we won't know until they get released. All in all, if nothing else is clear, it's that AI hardware is going to be a very difficult space. And so for now,

Starting point is 00:08:08 anyone who buys into these products has to understand that that's what they are getting into. For what it's worth, I think by and large, most people do. Now, that was obviously a little long for an AI breakdown brief, but it was a big point of conversation. So I'll just hit a couple more stories before we head on to the main part of the episode. First, Apple is apparently poaching from Google for a secretive AI research lab based in Zurich. Apparently, they've hired something like 36 employees from Google over the last few years, much more than any other company they've poached from.

Starting point is 00:08:35 and Microsoft has announced its latest big global AI investment, with CEO Satchinadella announcing yesterday that the company would invest $1.7 billion in Indonesia over the next four years. As part of that, they're also committing to help train 2.5 million people in Southeast Asia with AI skills, including about 840,000 in Indonesia itself. Anyways, friends, that is going to do it for today's AI breakdown brief. Next up, the main AI breakdown. All right, breakers, consensus 2024 marks the 10th gathering of the biggest event

Starting point is 00:09:05 that's devoted to all sides of the crypto, blockchain, and Web3 ecosystems. Join pioneering fingers and builders as they delve into the future of Defy and explore game-changing tech, from AI to ZK Proofs and everything in between. The event is three days of jam-packed content, networking, and so much more. Some of the speakers at the event include Chris Dixon, the founder and managing partner at A16Z Crypto, Sergey Nazaroff, the co-founder of Chainlink, Kathy Wood, the CEO of Arc, Hester Perce, commissioner, of course, from the USSEC, and Tom Emmer, Republican Majority whip for the U.S. House of Representatives.

Starting point is 00:09:36 Visit Consensus 24.com to learn more and save 15% on registration with the code breakdown. That is 15% on registration with the code breakdown. Hello, breakers. Quickly before we get back into the rest of the episode, you guys might have been following along my journey with the AI breakdown, which is a very similar show to this, but for the artificial intelligence industry. One of the things that I found as I dug into that show is that there was a huge need for better educational resources that were actually practical and useful right away. I've just announced

Starting point is 00:10:07 a new company and platform called Super Intelligent that's trying to build exactly that. It's a video platform for learning AI that features fun, fast, and immediately useful video tutorials. Each video tutorial is around five minutes and comes with a step-by-step how-to that gets people actually using the tools that we're talking about. We've just gone live with more than 300 tutorials and are adding 30 to 50 more per week. To check it out, go to be super.aI. That's be super.aI. Can't wait to see you guys over there. Welcome back to the AI breakdown. It has been a minute since an open AI model was where we were focused.

Starting point is 00:10:41 For the last couple weeks, everything has been about Lama 3. In fact, as we've discussed on this show, even the small models of Lama 3 that have been released so far come sufficiently close to GPT4 level performance that it's made people think very differently about the competitive landscape of LLMs. We're getting serious questions around whether if open source keeps being this close to the state of the art,

Starting point is 00:11:04 does it fundamentally change the economics of advanced models? Well, people just all opt to build on top of Lama 6 instead of paying a premium for GPT6. Anyways, the point is that Zuckerberg has been once again dominating the conversation, but apparently OpenAI was sick of that. Because for the last day or so, everyone has been talking about a model called GPT2 chatbot, which is rumored to secretly be GPT4.5 or even GPT4.5. or even GPT-5 out in the wild in advance of an official launch. So where was this model found?

Starting point is 00:11:36 Well, it was on the LimSys chatbot arena. This is an LLM benchmarking site, and it appeared as one of the model options which people on the site could test. As Dan Shipper from every put it, limsys dot org enables users to chat with various LLMs and rate their output according to different benchmarks without needing to log in. One of the models recently available is GPT2 chatbot.

Starting point is 00:11:56 There is no information to be found on that particular model name anywhere on the site or elsewhere. The ratings results generated by LIMSIS benchmarks are available via their API for all models except for this one. The model name simply appears to be a cover for something else entirely. Yesterday afternoon, Professor Ethan Mollick wrote, There is a mysterious new model called GPT4 chatbot accessible from a major LLM benchmarking site. No one knows who made it or what it is, but I have been playing with it a little and it appears to be in the same rough ability level as GPT4. A mysterious GPT4 class model? Neat. One additional thing that Ethan noted was that anonymous testing is a service that limsys.org offers.

Starting point is 00:12:34 This came out after Simon Willison responded to Ethan and said, Feels to me like a bit of a reputation risk to Limsys if this is indeed a stealth model launch. There's supposed to be a neutral benchmarking tool, which is not a great look if they're working behind the scenes with model vendors in an opaque manner like this. Limsys dot org responded to Simon saying, Just to clarify following our policy, we've partnered with several model developers to bring their new models to our platform for community previews. review testing. These models are strictly for testing and won't be listed on the leaderboard until they go public. In other words, this is a thing that they do. Now, of course, it wasn't just the fact that this had a GPT name, but the fact that it appeared really good that got people talking.

Starting point is 00:13:10 Everart founder Pietro Sharano wrote, A mysterious new model called GPT2 Chatbot has appeared on LIMSys and it's really good. Not only does it show incredible reasoning, but it also gets notoriously challenging AI questions right with a much more impressive tone. The example that he shared was a question, what weighs more a kilo of feathers or a kilo of lead? This is a historically difficult problem for LLMs, but GBT2 Chatbot says, a kilogram of feathers and a kilogram of lead weigh the same. They both weigh one kilogram.

Starting point is 00:13:37 The key difference between the two is in their volume and density. A kilogram of feathers will take up much more space and appear much larger, while the kilogram of lead will be compact and small. This often leads to the classic trick question, but in terms of weight, a kilo is a kilo. It also apparently is good at math. Andrew Gao writes, A GBT2 chatbot just solved an international math Olympiad problem in one shot.

Starting point is 00:13:58 This, in my opinion, is insanely hard. Only the four best math students in the U.S. get to compete. CodeGen founding engineer Chase writes, can confirm GBT2 chatbot is definitely better at complex code manipulation tasks than Claude Opus or the latest GPT4. Did better on all the coding prompts we use to test new models. The vibes are deaf there. The model is also apparently good at code art.

Starting point is 00:14:21 Phil on Twitter writes GBT2 Chatbot is insane at Asky Art, miles ahead of any other model. With Lama 370B, when asked to create an Asky Art Unicorn, the output is very janky. Whereas with GBT2 Chatbot, it completely nails it. Cool Zippity on Twitter also asked about art, saying, I asked GPT2 Chatbot to generate a simple program for turning doodles into art. Every other model I've tested fails at this. They get a doodle function, but the button does nothing of artistic value. The version that he shares does a much better job.

Starting point is 00:14:51 So what are the theories for what this model could actually be? Brian Romley writes, I've been testing GPT2 chatbot for a few days. Today it seems to have gotten much more attention. It surpassed all of our chat GPT4 benchmarks. Hypothesis, a few of us concluded it is a form of pre-lebotomized chat GPT4 or heavily trained on it. Runway CEO, Siki Chan writes, My best guess, GPT4 knowledge plus Q-Star search reasoning equals GPT2.

Starting point is 00:15:18 General knowledge seems near identical to GPT4 with much, better reasoning and planning capabilities. More expensive inference from Tree of Thought search would explain relatively slow inference and low rate limit. If true, this is a much bigger deal than it seems. GPT2 is likely to feel pretty similar for most general knowledge queries, but will outperform on reasoning. So I don't think it's an accident that this isn't named GPT 4.5 or GPD5.

Starting point is 00:15:39 It is neither. It's a test bed for Q Star or whatever you want to call Tree of Thought plus PRM these days. QSTAR you'll remember was a rumored open AI reasoning model that got a lot of attention last year. Continuing Siki writes, The next GPT-5 will likely continue to ride on the scaling hypothesis plus a reasoning boost from this. Siki also writes, and no, it isn't GPT2 fine-tuned, you are all out of your mind. The knowledge cutoff alone would make that make zero sense. What he's referring to is that another theory is summed up here by Albs who writes,

Starting point is 00:16:07 my guess is this mysterious GP2 chatbot is literally OpenAI's GPT2 from 2019 fine-tuned with modern assistant datasets, in which case that means their original pre-training is still amazing and better than everyone else's four years later. Admittedly, most people did not agree. Julian Shaman, the CTO of Hugging Face, writes, My personal guess on GPT2 Chatbot, Given Omar Sansever Rio, the chief llama officer at Hugging Face has been off for the last 10 days at this stage, strongly suspect it's a side project of his gone viral. Now, this was a little tongue-in-cheek, of course, but just speaks to how little information we actually have. What about what GBT2 said about its origins? Andrew Gowagan writes,

Starting point is 00:16:45 it told me and others that it was made by OpenAI. This is a weak signal, though, because of data contamination. A lot of models are trained on OpenAI chats and thus think they were made by OpenAI. When he polled to ask, what do you think, GPT 4.5, GROC 2, or another AI company, 58.9% of 2100 voters said GPT 4.5. And while that reflects the quality that people are seeing, some have pointed out that if this was the jump between 4 and 4.5, they wouldn't be real happy about that either.

Starting point is 00:17:13 Matt Schumer from Hyperite AI says, GPT2 chatbot is good, really good. But if this is GPT4.5, I'm disappointed. Flowers from the future, the frequent OpenAI leaker writes, GPD2 isn't better than current GPT4 turbo, so it's definitely not 4.5 or 5, and not even 4. Either this is a new GPT4 light model, or it really is a new GPT2 model

Starting point is 00:17:34 with a totally new kind of training or processing. The implications of the latter would be absolutely crazy. Being no help and adding more mystery to the whole thing was Sam Altman, who tweeted, I do have a soft spot for GPT2. Funny enough, he had initially written it as GPT-Dash 2, but then about eight seconds later edited it to, I do have a soft spot for GPT2, no dash, which of course sent a whole group of people speculating on what that might mean.

Starting point is 00:18:00 Even Malik again pointed out the frustration of the crypticness of the industry saying, OpenAI may be one of the most important technology companies in the world today, but they really like to communicate through hints and Oracle whispers. What is GPT2? At this rate, we will only know that GPT-5 is being launched from an I Love Bees-esque alternate reality game. Gpte 6 will be known by the shapes made by the wheeling of starlings over Palo Alto, as well as certain signs in the heavens, and the first letters of every third tweet by Rune. Smokeaway writes, GBT2 is not the AGI you're looking for. And I think ultimately that's where we're going to land on this.

Starting point is 00:18:33 This mystery will likely at some point be solved, or it won't and it'll just stay mysterious. But the reason that there's so much attention on it is that for as much as we will, were talking about last week around how a close to GPT4 class open source model could change the game, people are still obsessed with the frontier. They are still obsessed with the true state of the art. Right now, it doesn't seem like that's going to change any time soon. So for the moment, we're just going to have to be content with this mystery. Sure, it's speculative, but it's not the least fun I've ever had in the AI space. Anyways, friends, that is going to do it for today's AI breakdown. Until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - Is Mysterious GPT2-Chatbot Actually GPT5?

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.