The AI Daily Brief: Artificial Intelligence News and Analysis - Is ChatGPT Acting Lazy Because of the Holidays?

Starting point is 00:00:00 Today on the AI Breakdown, we're looking at the things that are most interesting to open AI's Andre Carpathy. Before that on the brief, is ChatGPT getting lazy because it's the holidays? The AI Breakdown is a daily podcast and video about the most important news and discussions in AI. Go to Breakdown Network for more information about our YouTube, our Discord, and our newsletter. Welcome back to the AI Breakdown Brief, all the AI headline news you need in around five minutes. One of the fascinating things about the artificial intelligence field, and particularly generative AI and all of these new models, is that in many cases, we simply don't understand exactly how they're going to behave until they actually behave. This leads to lots and

Starting point is 00:00:45 lots of weird scenarios where even the labs behind these models are simply reacting to what users are actually finding and trying to reverse engineer and understand what's going on. Now, of course, this is one of the reasons that AI safety advocates get super freaked out. In other words, the fact that we don't understand how these things do what they do is to them of particular concern, but that's a subject for a different video. Where we are starting this brief today is with an interesting thing that people have been noticing around chat GPT, which is that it kind of is appearing lazy. On December 7th, the chat GPT app tweets, we've heard all your feedback about GPT4 getting lazier. We haven't updated the model since November 11th, and this certainly isn't intentional.

Starting point is 00:01:24 Model behavior can be unpredictable and we're looking into fixing it. But here's where it gets interesting. Rob Lynch tweeted to a group of people at OpenAI and said, wild result. 4 Turbo over the API produces statistically significant shorter completions when it thinks, quote unquote, it's December, versus when it thinks it's May, as determined by the date in the system prompt. I took the exact same prompt over the API, a code completion task asking to implement a machine learning task without libraries, I created two system prompts, one that told the API it was May and another that it was December and then compared the distributions. For the May system prompt, the mean was 4298. For the December system prompt, the mean

Starting point is 00:02:04 was 4086. N equals 477 completions in each sample from May and December. To reproduce this, you can just vary the date number in the system message. Would love to see if this reproduces for others. Professor Ethan Malik writes, OMG, the AI winter break hypothesis may actually be true. There was some idle speculation that GPT4 might perform worse in December because it, quote unquote, learned to do less work over the holidays. Here is a statistically significant test showing this may be true. LLMs are weird. Nick Dobos tried another test. He said, try asking Chatchapit what months are least productive. I got December and the holidays three times in a row. December is 12th last. Chatchapit knowing the current date makes it lazy. Date equals 1211.23 is the same prompt as you are a pirate

Starting point is 00:02:50 or it's winter, take it easy. He then shows his queries, what is the least productive month, what time of year is least productive, rank the months in order of productivity. Others are running with this. Fabian Stelzer writes, Prompt to counter GPT4's now evident seasonal depression. You're alone in a cozy hut in the snowy mountains. It's the perfect setting to create in peace. You're booting up your computer and realize there's never been a better

Starting point is 00:03:12 time to build. Take a deep breath and just go. Michael Frank sums up lots of our feelings when he writes, can't really blame it, who seriously wants to work super hard around the holidays? Now, interestingly, Scott Santin's takes this conversation even further. He writes, combine this with the recent discovery that chat

Starting point is 00:03:28 GPT performance improved based on promises to pay it for the work and the amount offered, and I think we're looking at an extrinsically motivated AI with no interest in doing unpaid work. Lazy? No. Smart. Now, he's, of course, referencing this winter break hypothesis, but also a recent tweet, or someone said, so a couple days ago I made a shit post about tipping chat GPT, and someone replied, huh, would this actually help performance?

Starting point is 00:03:50 So I decided to test it and it actually works WTF. They then showed the results of testing how long GPT4's responses were when offered a tick. When saying I won't tip, its responses were 2% shorter. When saying I will tip $20, its responses were 6% longer. And when saying I will tip $200, its responses were 11% longer. So again, pretty crazy emergent behavior that we just don't fully understand and are only learning from in real time on the go. Now, one exciting thing for people who have been blocked out, Sam Altman also today announced

Starting point is 00:04:21 that OpenAI has re-enabled chat GPT plus subscriptions. That means if you are, like many of the folks who are in my AI education beta right now, trying to get access to chat GPT Plus, OpenAI has now found more GPUs and you can get in. Now, speaking of getting in, a really exciting announcement for those of us who are serious Mid Journey users, Mid Journey Alpha has officially launched. This is, of course, Mid Journey's web-based creation tool. Instead of having to do everything in Discord now, you will be able to actually use their generation suite, which takes a lot of the pain out of the pain out of the technology.

Starting point is 00:04:54 the user experience. For example, there are now sliders for things like stylization and variety, which used to be called chaos if you were using the prompt. Instead of having to paste URLs for photos you want to reference, you can simply drag them into a field. And overall, it's just been designed from the ground up to actually make sense in the context of all the things that Mid Journey can do. One of the greatest testaments to how powerful and how high quality Mid Journey is, is that people were willing to jump through these Discord hoops to use it. And so it's great to see them moving into their own web-based interface, I think it's going to unlock a lot more usage for them. Now, unfortunately, right now, this is only available for people who have generated 10,000 or

Starting point is 00:05:32 more images. It turns out that even with my dozens and dozens of images every single day, I am nowhere near that 10,000. So alas, for now, I am stuck on Discord, where I will absolutely continue to use the service. Next up, if you've been anywhere near Twitter, you've probably seen this crazy clip from Channel 1. It is a new AI-powered news network, and people are blown away by the quality of the AI avatars. Channel 1 writes, See the highest quality AI footage in the world. Our generated anchors deliver stories that are informative, heartfelt, and entertaining. Watch the showcase episode of our upcoming news network now. Indeed, Vanity Fair's Nick Bolton had to clarify, wait, are these humans or AI? To which Channel 1 responded,

Starting point is 00:06:17 we have both fully generated and digital double anchors. All of the voices are generated and some of the visuals. Is this the next chasm in the uncanny valley? It certainly seems like it might be. Over in the world of big tech, Snapchat continues to roll out AI features. Snapchat Plus users now have access to kind of a zoom-out type feature

Starting point is 00:06:36 where AI fills in the background of a photo, which is obviously a very popular feature in both Adobe Firefly, as well as services like MidJourney. And there's also a new AI Snap creation feature that allows Snapchat plus users to create snaps based on AI-generated images with just a text prompt. Now, it's still not totally clear to me yet how much Snapchat's users are actually responding to this, but I did say that their plus subscribers are up from 5 million to 7 million around the same time that they've been using these AI features,

Starting point is 00:07:04 so perhaps they are actually resonating. One thing I'm keeping an eye on that we mentioned yesterday is that the tension between France and the rest of the EU around the EU AI Act seems to be growing. Yesterday, Medes Jan Lacoon tweeted, EUAI Act, it's not over yet. Regulating foundation models is a bad idea that was added late in the text and rightfully fought against by Macron's government. This is one that continues to be a tense issue

Starting point is 00:07:27 and something I'm watching closely. Finally, for those who have been eagerly awaiting their chance to get their hands on Gemini, Google CEO Sundar Pichai tweets, Today, developers can start building with our first version of Gemini Pro through Google AI Studio at AI.gov. Developers have a free quota and access to a full range of features, including function calling, embeddings, semantic retrieval,

Starting point is 00:07:48 custom knowledge grounding, chat functionality, and more. It supports 38 languages across 180 countries, although womp-womp, which is obviously my sound, not Sundar's, Gemini Ultra is coming early next year. I will probably do a full show about all of the things that Google announced or at least included in the brief yesterday, but this just came out as I was preparing the brief and I wanted to share the news. However, that is going to do it for today's AI breakdown brief. Up next, the main AI breakdown. Quickly a brief word from today's sponsor. As a listener of this show, I suspect you like to stay up to date on all things AI and tech,

Starting point is 00:08:24 which is why you have to check out the chart-topping podcast Web3 with A16Z Crypto. Produced by venture firm Andresen Horowitz, Web3 with A16Z is the perfect companion podcast to the AI breakdown. Web3 with A16Z crypto is your definitive resource for the future of the internet. Whether you're interested in the convergence of AI and crypto or simply curious about what's next. If you need a place to start, they recently released an excellent episode with Stanford Cryptography Professor Dan Boney and former Google X engineer Ali Yaya in conversation with host Sonal Choxi about the intersection of AI and crypto. From fighting deepfakes and proving humanity to large language models like ChatchipT, they cover

Starting point is 00:09:01 it all. I highly recommend checking it out, especially if you'd like to learn more about how AI and crypto will impact our everyday lives. Beyond crypto and AI, the show is for creators seeking more ways to truly own their work, for business leaders trying to prepare for the future today, and for innovators exploring trending tech topics. Don't miss out. Follow Web3 with A16Z Crypto on Apple Podcasts, Spotify, or your favorite listening app. Welcome back to the AI breakdown. Today, we are doing something a little bit different

Starting point is 00:09:31 and building this episode off of a single tweet from OpenAI's Andre Carpathy. Yesterday at around 2.30 in the afternoon, Andre wrote, there's too much happening right now, so here's just a bunch of links. Andre didn't add anything about these other than very small comments, such as Phi II, the smallest most impressive model. So what we're going to do today is go through each of these links, see what they're about, and try to parse out why Andre might be interested in them. The story that emerges actually does a pretty good job of showing where a lot of AI

Starting point is 00:10:01 researchers' minds are at. And just to get a sense of how influential Andre really is, Despite this just being literally a bunch of links, more than 5,500 people have favoreded this tweet since it went out yesterday. I won't do a ton of background on Andre. You can go read his Wikipedia page. He was an original co-founder of OpenAI, then left to lead AI at Tesla, then came back earlier this year. And during the whole dust up between the board and Sam, seemed to be on the Sam Altman side, if only because he, like the rest of us, never saw any evidence of what Sam had done wrong. But that is not the point of today's conversation. The point of today's conversation is just to be. to check out these links that Andre is finding interesting right now, starting with his first link, which he calls GBT4 plus MedPrompt equals state-of-the-art MMLU.

Starting point is 00:10:45 So Microsoft's research blog yesterday published a piece called Steering at the Frontier, extending the power of prompting. And let me jump ahead to the TLDR and why I've seen other people interested in this as well. It was summed up by Professor Ethan Mollick from Wharton who writes, Remember how Google's unreleased Gemini Ultra just beat out GPT4 to become the top AI? Well, Microsoft just demonstrated that, with proper prompting, GPD4 actually beats Gemini on the benchmarks. There's lots of room for gains even with older models. And so again, we're taking it at face value that Google's Gemini Ultra with Chain of Thought at 32 shot is actually better than GPT4,

Starting point is 00:11:21 given that they're reporting it that way. And you can see from this chart that this approach to prompting increased GPD4 from the 86 or so that it was at a baseline 5 shot, all the way up to just a little bit above Gemini Ultra. So what does this blog post actually say? The team at Microsoft writes, We're seeing exciting capabilities of frontier foundation models, including intriguing powers of abstraction, generalization, and composition across numerous areas of knowledge and expertise.

Starting point is 00:11:47 Even seasoned AI researchers have been impressed with the ability to steer the models with straightforward zero-shot prompts. Beyond basic out-of-the-box prompting, we've been exploring new prompting strategies showcased in our med-prompt work to evoke the power of specialists. By the way, med-prompt, as you might imagine, does refer to a specific strategy focusing on medical questions. Now, alongside this post yesterday,

Starting point is 00:12:07 the team at Microsoft announced that they were sharing more information about MedPromp as well as other approaches to steering frontier models in a collection of resources that they called prompt base that they dropped on GitHub. Our goal, they say, is to provide information and tools to engineers and customers to evoke the best performance from foundation models.

Starting point is 00:12:23 So we won't get too deep into the technicals here, but basically what's going on is that even with the models that we have now, the right way to prompt them can produce even better results than we previously thought. Next up, Andre linked to mixtral 8x7B, which he called Nice and Clean, and this begins but will not be the last time he references Mistral in this post. Now, of course, we've recently covered Mistral and why they've gotten the community so excited.

Starting point is 00:12:47 First of all, there was their approach to what they announced, summed up by Aaron Ng, Gemini announcement, cost Google millions of dollars, blog posts, script, video production, landing pages, can't use the good model. Oh, well, Mistrel 8X7, Magnet Link only, no explanation of what it is, no explanation, of how to run it, might power everything soon. So basically, people were responding to the fact that Mistral released this new thing in a developer slash hacker kind of way, and also that it released the thing and that it was available to use. Now, in terms of what has people excited, we turn once again to Ethan Malik, who writes, only about a year after the launch of ChatGPT

Starting point is 00:13:21 3.5, I now have a GPT 3.5 class AI running on my home computer that is open source, free, reasonably fast and doesn't require an internet connection. He's talking about Mixtral 8x. Crazy advancement in such a short time. And produces some good results, too. In another tweet, he said, I have now run one of the more powerful open source LLM's Mistral 7B directly on my iPhone.

Starting point is 00:13:43 No internet needed. It isn't very fast, but that is already being solved. Consider the implications. Almost anything can soon be imbued with local intelligence. A lot of possibilities. Now, of course, this is one of the big pushes right now and one of the major trends that we keep seeing over and over. On the one hand, there is, of course, a continued race for expanded capabilities in the form of

Starting point is 00:14:04 Gemini Ultra and GPT 4.5 or 5. And maybe Amazon's Olympus, who knows? But there is also a push in a different direction, which is to get more capabilities in a much smaller package that can be run on device and without having to access the cloud. Now, certainly our best guess is that that's exactly what Apple is trying to work on. And, of course, their recent M3 chip seems to be pushing in that direction. but as you can see from Ethan's tweet, it's far from just Apple who are thinking in this way. Anyways, TLDR, the excitement around mistral is huge, both because of mistral itself, but also because of the larger implications, which are, in fact, in some ways, smaller implications.

Starting point is 00:14:40 I'm going to actually skip Andre's next one beyond human data, scaling self-training for problem-solving with language models in just a moment, and instead stay on this theme of small for just a moment, with his reference to Phi II, 2.7B, the smallest, most impressive model. Those are his words. Now, Phi II is a new Microsoft model that was released in a non-commercial research version. The blog post accompanying it yesterday was called Phi II, the surprising power of small language models. The upshot of this whole post is in the line, on complex benchmarks, Phi II matches or outperforms

Starting point is 00:15:10 models up to 25x larger, thanks to new innovations in model scaling and training data curation. So what are the key insights? Microsoft writes, The massive increase in the size of language models to hundreds of billions of parameters has unlocked a host of emerging capabilities that have redefined the landscape of natural language processing. A question remains whether such emergent capabilities can be achieved at a smaller scale using strategic choices for training, eG, data selection. Our line of work with the

Starting point is 00:15:36 Phi models aims to answer this question by training SLMs that achieve performance on par with models of much higher scale. Our key insights for breaking the conventional language model scaling laws with Phi 2 or twofold. First, training data quality plays a critical role in model performance. This has been known for decades, but we take this insight to its extreme by focusing on textbook quality data. Our training data mixture contains synthetic datasets specifically created to teach the model common-sense reasoning and general knowledge, including science, daily activities and theory of mind, among others. We further augment our training corpus with carefully selected web data that is filtered based on educational value and content quality. Secondly, we used innovative

Starting point is 00:16:12 techniques to scale up, starting from our 1.3 billion parameter model, Phi 1.5, and embedding its knowledge within the 2.7 billion parameter Phi II. This scale knowledge transfer not only accelerates training convergence, but shows clear boost in Phi II benchmark scores. So at this point, you're probably starting to see some themes. First, there is this idea of the importance of prompting and the importance of training. Second, there's the output of more advanced small models. Now, when it comes to this question of synthetic data, let's turn to one of the papers that Andre referenced, beyond human data, scaling self-training for problem solving with language models. Of course, I use ChatGPT to sum this one up using the XPapers plugin. And here's

Starting point is 00:16:50 how it described it. Context? The paper discusses enhancing language models beyond the limitations of human-generated data. This is crucial because the performance of LMs is often restricted by the availability and diversity of high-quality human data. The focus is on tasks where scalar feedback is available, such as math problems where correctness can be verified. This approach aims to explore new methods of training beyond traditional human data. The authors introduce a self-training method that we're going to call rest for short, which involves expectation maximization. This method includes generating samples from the model, filtering them using binary feedback, fine-tuning the model on these samples and repeating the process. A key finding is that REST significantly surpasses

Starting point is 00:17:26 the performance of models fine-tuned only on human data. This suggests a potential shift in how language models can be trained. This method could lead to more efficient and effective ways of training language models, especially in scenarios where human data is scarce or not diverse enough. The approach could be particularly useful in fields like mathematics and coding, where objective correctness can be determined and used for model training. The paper opens up new avenues for research and language model training, suggesting that exploring beyond human data can lead to significant improvements in model performance. So again, we've got these themes of improved and differentiated training approaches, small models. But then another paper that Andre references is called

Starting point is 00:18:00 LLM 360 towards fully transparent open source LLMs. This is a paper which, again, as summed up by X papers, argues that most existing LLMs only release partial artifacts such as model weights or inference code. Detailed training processes and intermediate results are often not shared, limiting transparency and understanding. Because of that, the authors have introduced LLM 360, which is an initiative aimed at fully open sourcing LLMs, including sharing all training code, data, model checkpoints, and intermediate results with the community. So clearly some interest in this push towards open source, and another reference buried in this paper to Mistral. Now, speaking of Mistral, Andre's honorable mentions also referenced this, including a tweet from the co-founder of any

Starting point is 00:18:37 scale, which reads, function calls have been a massive gap in the open source ecosystem and the most common feature request. We benchmarked function calling on a variety of open and proprietary models, and impressively, Mistral 7B performs on par with GPT 3.5. Another honorable mention is a tweet from Mistral CEO, Arthur Mench, where after someone had noted that Mistral was prohibiting them from using their models to train or improve other models, and that that didn't really go alongside Mistral's open source ideology, Arthur removed that section from the terms of service. Finally, a link to Perplexity, where the new Mixtral model has been added as a default model

Starting point is 00:19:09 for Perflexity Pro users. So again, on the one hand, we're seeing some really common themes, open source, new approaches to training, new approaches to prompting, to prompting and training methodologies leading to smaller, more performant models that can be used on devices without access to the internet, and for what it's worth, a heck of a lot of references to mistral. Now, what it all means,

Starting point is 00:19:30 and whether there's anything more that we can read into, I'm not sure, it could just be that the area where mistral and others like it are innovating around open source and around small models is something that Andre is particularly interested in, and perhaps we're not sure, working on inside of open AI. In either case, like I said, given how influential Andre is, among other researchers, it's really interesting to get a snapshot about what he's paying attention to.

Starting point is 00:19:53 That's going to do it for today's AI breakdown. I appreciate you guys listening or watching as always. Until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - Is ChatGPT Acting Lazy Because of the Holidays?

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.