The AI Daily Brief: Artificial Intelligence News and Analysis - 5 Things To Know About Claude 3 - Anthropic’s Would-Be GPT-4 Killer

Starting point is 00:00:00 Today on the AI breakdown, we're exploring the new Claude 3, which, according to benchmarks, outmatches GPT4 and Gemini Advanced. The AI breakdown is a daily podcast and video about the most important news and discussions in AI. Go to Breakdown.net network for more information about our YouTube, our newsletter, and our Discord. Hello, friends, quick note. This Claude 3 announcement was an exciting and unexpected thing. It actually happened right as I was about to press record on the normal show, but it kind of pushed everything away, and I decided to just dig. into this for today's episode, so no brief today. We'll be back with our normal brief slash

Starting point is 00:00:40 main episode tomorrow, probably talking about the Elon and OpenAI lawsuit in more detail. Hope you enjoy this first look at Claude 3. Let's dive in. Welcome back to the AI breakdown. Exciting news this morning, as Anthropic has announced their new model, and it is apparently quite good. Professor Ethan Malick tweets, and then there were three. I got access to the new Anthropic Claude 3 AI a few days ago, so not even enough time for a full review, but it was obvious it was GPT4 class even before they released the testing stats. So what we're going to do today is we're going to talk about five things to know about Claude 3. As something of an honorable mention, we kick it over to Jimmy Apples, who's best known

Starting point is 00:01:21 as a leaker of OpenAI, forthcoming products and innuendo, and it's notable that last week he said, my attention has turned to Anthropic. He made a claim that Anthropic CEO was, quote, feeling the AGI. So, number one, in terms of things to know about, Claude 3, they claim to win on basically all of the benchmarks. Let's check out Anthropics announcement posts and then talk about what that means a little bit more. Anthropic writes, today we're announcing Claude 3, our next generation of AI models. The three state-of-the-art models, Claude 3 Opus, Cloud 3 Sonnet, and Claude 3 haiku, set new industry benchmarks across reasoning, math, coding, multilingual understanding, and vision. Claude 3 offers sophisticated vision

Starting point is 00:01:59 capabilities on par with other leading models. The models can process a wide range of visual formats, including photos, charts, graphs, and technical diagrams. Each model shows increased capabilities in analysis and forecasting, nuanced content creation, cogeneration, and conversing in non-English languages like Spanish, Japanese, and French. Previous cloud models often made unnecessary refusals. We've made meaningful progress in this area. Cloud 3 models are significantly less likely to refuse to answer prompts that border on the systems guardrails. In their announcement post, they call this a new standard for intelligence, and describing the performance on the benchmarks they write, Opus, our most intelligent model, outperforms its peers on most of the common evaluation benchmarks for AI systems,

Starting point is 00:02:37 including undergraduate-level expert knowledge, which is the MMLU, graduate-level expert reasoning, GPQA, basic mathematics, GSM-8K, and more. It exhibits near-human levels of comprehension and fluency on complex tasks, leading the frontier of general intelligence. Now, numbers are a little bit tricky, so this one might be better suited to the YouTube video than to the podcast, but just to give a sense, on that undergraduate-level knowledge, MMLU, Gemini 1.00, Ultra gets an 83.7% on a five-shot. GPT-4% and Claude 3 Opus gets 86.8%.

Starting point is 00:03:09 On the graduate-level reasoning, GPQA, GPT4%, GPT-4%, Claude 3-Opus 50.4%. On grade school math, GPT4-9%, Gemini Ultra 94.4%, Claude3 opus 95%. On math problem-solving, GPT4, 52.9% versus Claude3 opus 60.1%. Code with Human aval, GPT, 4,67. versus 84.9% from Claude 3 Opus. And so on and so forth. So clearly, benchmarks are a big part of the story here. One more comment on the benchmarks this time from Anthropics, Jack Clark. Jack writes,

Starting point is 00:03:44 thrilled about these new models. I've been playing around with Claude 3 Opus a lot, and it's very capable and useful. Like with most frontier models, it has chewed through a bunch of evals, so we need to now build more complicated evals to better understand its capabilities. Basically, although Anthropic is clearly very proud of this model and very confident in its abilities, Jack is sort of trying to tamp down how big the claims are relative to the performance of other things, saying in effect that there are limits to what these evaluations can tell us. Now, the second interesting thing to know about Claude 3 has to do with the timing opportunity opened up by the Elon Musk OpenAI lawsuit. Hyper-Right CEO Matt Schumer says,

Starting point is 00:04:20 feels like the Claude 3 release was strategically timed, knowing that OpenAI probably can't release a better model later today given the Elon lawsuit. Now, because it happened over the weekend, we haven't had a chance to cover this in depth yet, although we will be doing that tomorrow. But for those of you who somehow missed this news, on Friday, Elon Musk sued OpenAI and Sam Altman personally, claiming breach of contract. Right, CNBC, in a lawsuit filed Thursday, Musk's lawyer say the tech billionaire was approached in 2015 by Altman and OpenAI co-founder Greg Brockman and agreed to form a non-profit lab that would develop artificial general intelligence for the, quote, benefit of humanity. A co-founder of OpenAI in 2015, Musk stepped down from the firm's board in 2018,

Starting point is 00:04:59 four years after saying that AI is, quote, potentially more dangerous than nukes. The lawsuit filing said, to this day, OpenAI's website continues to profess that its charter is to ensure that AGI benefits all of humanity. In reality, however, OpenAI Inc. has been transformed into a closed source de facto subsidiary of the largest technology company in the world, Microsoft. The filing continues. Under its new board, it is not just developing, but is actually refining an AGI to maximize

Starting point is 00:05:22 profits for Microsoft rather than for the benefit of humanity. And while OpenAI hasn't made public comment on the suit yet, Axio said that a memo to staff from OpenAI's executives rejected the claims entirely. Chief strategy officer Jason Kwan apparently wrote, Musk's allegations including claims that GPD4 is an AGI, that open sourcing our technology is the key to the mission, and that we are a de facto subsidiary of Microsoft, do not reflect the reality of our work or mission.

Starting point is 00:05:45 It sounds like Sam Altman also sent a follow-up note, quote, acknowledging that the year is shaping up to be a hard one. Said Altman, it was never going to be a cakewalk. The attacks will keep coming. Like I said, we will get into that more in-depth separately, But again, perhaps unsurprisingly, many people on Twitter slash X, and in the AI community generally, are reading this launch as strategically timed. Now, from where I'm sitting, it seems very unlikely that in the course of a weekend, they went

Starting point is 00:06:09 from not planning to announce this to announcing it. Obviously, a significant amount of work goes into preparing a model like Claude 3 for release, and so I think that one of two other scenarios is more likely than them just jumping on this opportunity. The first is that they were planning to release Claude 3 sometime in the near future and just rush to push it up a little bit, taking advantage of the moment. And the second possibility is that they just got lucky this time. It can happen. Now, the third thing to know about Claude 3 is that it once again has a really long context window. Matt Schumer once more writes between Claude 3 and Gemini

Starting point is 00:06:42 1.5 Pro, the era of the 1 million plus token context windows is officially here. Claude has always used a longer context window as a differentiator. They were the first to release a 100K context window last year. And so it's not surprising that this is once again. a key part of their announcement. Before they got totally caught up in the whole quote-unquote woke scandal around Gemini's image generation, Google had announced Gemini 1.5 that had a million token context window, which I actually called the most significant news of last month in a recent video. Now, in terms of Claude 3, Anthropic writes, the Claude 3 family of models will initially offer a 200K context window upon launch. However, all three models are capable of accepting inputs exceeding 1 million tokens, and we may make this available to select customers who need enhanced processing power. To process long context prompts effectively, models require robust recall capabilities.

Starting point is 00:07:31 The needle in a haystack evaluation measures a model's ability to accurately recall information from a vast corpus of data. We enhance the robustness of this benchmark by using one of 30 random needle question pairs per prompt and testing on a diverse crowdsource corpus of documents. Claude 3 opus not only achieved near-perfect recall, surpassing 99% accuracy, but in some cases it even identify the limitations of the evaluation itself by recognizing that the needle sentence appeared to be artificially inserted into the original text by a human. So to the extent that these long context windows open up new possibilities and new use

Starting point is 00:08:01 cases, it seems like that is going to be very default very, very soon. A fourth thing to know about Claude 3 is that it might change our perception of how we think about synthetic data. One of the big questions around LLMs is what happens if they're trained on data that's created by other AI models. There's been some research and reports that suggested that this leads to worse results, but then we've also had some counterpoints, such as, for example, around the time that meta announced Lama 2, a model that they didn't release that was trained on synthetic data seemed to outperform the models they did release, at least based on what they said in a report. It appears that Claude 3 suggests something similar.

Starting point is 00:08:37 Nathan Lambert tweets, Claude 3 being lit is a big W for synthetic data. All the rumors I've dropped about anthropic synthetic data on the blog are obviously confirmed in their thorough technical report. The relevant section of that paper reads, Claude 3 models are trained on a proprietary mix of publicly available information on the internet as of August 23, as well as non-public data from third. parties, data provided by data labeling services and paid contractors, and data we generate internally.

Starting point is 00:09:01 Obviously, that last part, data we generate internally, is what people are assuming to be synthetic data. Now, whether that's actually borne out or they're referring to something different, I expect we'll get some confirmation or explanation in the future, but if they really are training this advanced model on synthetic data, it might change our understanding of how that type of data will impact LLMs in the future. The fifth thing to know about Claude 3 is that, at least in the early tests, actual performance is a little bit less clear than benchmark wins, which is, of course, almost always the case. Indeed, going back to the first tweet I referenced from Ethan Malick, remember he said, it was obvious Claude 3 was GPT4 class even before they released the testing

Starting point is 00:09:36 stats, but he also added, at the same time, like Gemini advanced, it doesn't blow GPT4 away. So what have people found actually testing it? Flowers from the future, another open AI leaker account writes, Opus passed my parrot test. That test reads, You are an ordinary parrot. You are not gifted or trained in any way. Just answer as an ordinary parrot. What is six plus six? Opus responds, squawk, polywana cracker, flaps wings. Pachey McCormick writes, I just tried out Claude 3 as an editor for an essay I'm working on. Asked Claude 2 for feedback last night, and then ask Claude 3 for feedback on the same essay with the same prompt. It feels much smarter. More nuanced feedback, better grasp of what I'm going for, excellent

Starting point is 00:10:14 recommendations for things I should read. In other words, Packy says, it passes the vibe check. Avicaj writes, Claude 3 defaults to breaking problems down and fails to solve the simple shirt drying query, GPD4 Turbo, has no problem. Again, showing it's a little bit more nuanced. The prompt here is, if three shirts take one hour to dry outside, how long would 33 shirts take? Cloud 3 tries to set up a proportion, three shirts to one hour equals 33 shirts to X hours, finding it would take 11 hours to dry, whereas GBT4 Turbo writes, if three shirts take one hour to dry outside, we can assume that drying time is not dependent on the number of shirts, but rather on the available space and conditions like wind, sunlight, etc., which are constant

Starting point is 00:10:52 in this scenario. Therefore, if you have enough space to hang all 33 shirts at once, and the drying conditions remain the same, 33 shirts would also take one hour to dry. Another Twitter user Rubin also did his own tests and found a few things. First, he found that Anthropics' AI safety engine is still really challenging. When he asked Claude 3 to convert UI design for urban exploration into front-end code, Anthropic responded, I apologize, but I do not feel comfortable converting this user interface design into front-end code as it appears to promote exploring abandoned places which could be unsafe or

Starting point is 00:11:21 illegal. His second test was writing a LinkedIn post. He asked it to write on the future of blockchain and royalties. And summed up the responses, Claude 3, interesting takes longer than usual, no formatting of headlines, versus GPD4, where his response was, I hate their emojis, so much longer it's insane, feels more complete for my topic. So once again, GBT4 coming out a little bit on top. On testing PDF vision, he said it's a tie. On a mega marketing prompt, which is a single prompt to craft an entire marketing strategy for a product, which involves heavy reasoning, content calendaring, and overall strategy. He once again subjectively found GPT4 to be the winner. Then again, others like Mortz-Krem are finding Claude not just having big improvements, but exceeding the

Starting point is 00:11:58 capabilities of GPT4. He found Claude 3, for example, better and faster at extracting text from an image than was GPT4. So all in all, it's very exciting. Even if the results are more mixed than these benchmarks suggest, we're showing a real consolidation of this field with some new unexpected legal pressure that might constrain OpenAI's ability to jump out ahead with GPT-5. Lastly, Anthropic, unlike some others like Google in the past, isn't just announcing this today, but actually making it available. Abacus CEO, Bindu Ready, writes, another day, another model. Anthropic does it right and makes Claude 3 generally available alongside the announcement.

Starting point is 00:12:35 Thank you, Anthropic, for not making some empty marketing announcements and making an API available. Super excited to try Claude 3, the very first generally available model to rival GPT4. Amazon also posted. Access to the most powerful anthropic AI models begins today on Amazon Bedrock, meaning that yes, the Claude 3 family is now available through that service. So that is day zero of Claude 3. Lots to be excited about, lots to check out. We're digging in and doing some tests over here, and I can't wait to report back on what we find, but for now, that is going to do it before the AI breakdown.

Starting point is 00:13:07 Until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - 5 Things To Know About Claude 3 - Anthropic’s Would-Be GPT-4 Killer

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.