The AI Daily Brief: Artificial Intelligence News and Analysis - 5 Things To Know About Claude 3 - Anthropic’s Would-Be GPT-4 Killer
Episode Date: March 5, 2024Anthropic has just released Claude 3, and according to the benchmarks, it's GPT-4 level and then some. NLW covers 5 important parts of the discussion surrounding the news. ABOUT THE AI BREAKDOWN The ...AI Breakdown helps you understand the most important news and discussions in AI. Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/
Transcript
Discussion (0)
Today on the AI breakdown, we're exploring the new Claude 3, which, according to benchmarks, outmatches GPT4 and Gemini Advanced.
The AI breakdown is a daily podcast and video about the most important news and discussions in AI.
Go to Breakdown.net network for more information about our YouTube, our newsletter, and our Discord.
Hello, friends, quick note.
This Claude 3 announcement was an exciting and unexpected thing.
It actually happened right as I was about to press record on the normal show,
but it kind of pushed everything away, and I decided to just dig.
into this for today's episode, so no brief today. We'll be back with our normal brief slash
main episode tomorrow, probably talking about the Elon and OpenAI lawsuit in more detail.
Hope you enjoy this first look at Claude 3. Let's dive in.
Welcome back to the AI breakdown. Exciting news this morning, as Anthropic has announced their
new model, and it is apparently quite good. Professor Ethan Malick tweets, and then there were
three. I got access to the new Anthropic Claude 3 AI a few days ago, so not even enough time
for a full review, but it was obvious it was GPT4 class even before they released the testing stats.
So what we're going to do today is we're going to talk about five things to know about
Claude 3. As something of an honorable mention, we kick it over to Jimmy Apples, who's best known
as a leaker of OpenAI, forthcoming products and innuendo, and it's notable that last week
he said, my attention has turned to Anthropic. He made a claim that Anthropic CEO was, quote,
feeling the AGI. So, number one, in terms of things to know about,
Claude 3, they claim to win on basically all of the benchmarks. Let's check out Anthropics
announcement posts and then talk about what that means a little bit more. Anthropic writes,
today we're announcing Claude 3, our next generation of AI models. The three state-of-the-art
models, Claude 3 Opus, Cloud 3 Sonnet, and Claude 3 haiku, set new industry benchmarks across
reasoning, math, coding, multilingual understanding, and vision. Claude 3 offers sophisticated vision
capabilities on par with other leading models. The models can process a wide range of visual
formats, including photos, charts, graphs, and technical diagrams. Each model shows increased capabilities
in analysis and forecasting, nuanced content creation, cogeneration, and conversing in non-English
languages like Spanish, Japanese, and French. Previous cloud models often made unnecessary refusals.
We've made meaningful progress in this area. Cloud 3 models are significantly less likely to refuse
to answer prompts that border on the systems guardrails. In their announcement post, they call this
a new standard for intelligence, and describing the performance on the benchmarks they write,
Opus, our most intelligent model, outperforms its peers on most of the common evaluation benchmarks for AI systems,
including undergraduate-level expert knowledge, which is the MMLU,
graduate-level expert reasoning, GPQA, basic mathematics, GSM-8K, and more.
It exhibits near-human levels of comprehension and fluency on complex tasks,
leading the frontier of general intelligence.
Now, numbers are a little bit tricky, so this one might be better suited to the YouTube video than to the podcast,
but just to give a sense, on that undergraduate-level knowledge, MMLU, Gemini 1.00,
Ultra gets an 83.7% on a five-shot.
GPT-4% and Claude 3 Opus gets 86.8%.
On the graduate-level reasoning, GPQA, GPT4%,
GPT-4%, Claude 3-Opus 50.4%.
On grade school math, GPT4-9%,
Gemini Ultra 94.4%, Claude3 opus 95%.
On math problem-solving, GPT4, 52.9% versus Claude3 opus 60.1%.
Code with Human aval, GPT, 4,67.
versus 84.9% from Claude 3 Opus. And so on and so forth. So clearly, benchmarks are a big part
of the story here. One more comment on the benchmarks this time from Anthropics, Jack Clark. Jack writes,
thrilled about these new models. I've been playing around with Claude 3 Opus a lot, and it's
very capable and useful. Like with most frontier models, it has chewed through a bunch of evals,
so we need to now build more complicated evals to better understand its capabilities.
Basically, although Anthropic is clearly very proud of this model and very confident in its
abilities, Jack is sort of trying to tamp down how big the claims are relative to the performance
of other things, saying in effect that there are limits to what these evaluations can tell us.
Now, the second interesting thing to know about Claude 3 has to do with the timing
opportunity opened up by the Elon Musk OpenAI lawsuit. Hyper-Right CEO Matt Schumer says,
feels like the Claude 3 release was strategically timed, knowing that OpenAI probably can't
release a better model later today given the Elon lawsuit. Now, because it happened over the
weekend, we haven't had a chance to cover this in depth yet, although we will be doing that
tomorrow. But for those of you who somehow missed this news, on Friday, Elon Musk sued OpenAI and Sam
Altman personally, claiming breach of contract. Right, CNBC, in a lawsuit filed Thursday, Musk's lawyer
say the tech billionaire was approached in 2015 by Altman and OpenAI co-founder Greg Brockman and
agreed to form a non-profit lab that would develop artificial general intelligence for the, quote,
benefit of humanity. A co-founder of OpenAI in 2015, Musk stepped down from the firm's board in 2018,
four years after saying that AI is, quote, potentially more dangerous than nukes.
The lawsuit filing said,
to this day, OpenAI's website continues to profess that its charter is to ensure that AGI benefits all
of humanity.
In reality, however, OpenAI Inc. has been transformed into a closed source de facto subsidiary
of the largest technology company in the world, Microsoft.
The filing continues.
Under its new board, it is not just developing, but is actually refining an AGI to maximize
profits for Microsoft rather than for the benefit of humanity.
And while OpenAI hasn't made public comment on the suit yet,
Axio said that a memo to staff from OpenAI's executives rejected the claims entirely.
Chief strategy officer Jason Kwan apparently wrote,
Musk's allegations including claims that GPD4 is an AGI,
that open sourcing our technology is the key to the mission,
and that we are a de facto subsidiary of Microsoft,
do not reflect the reality of our work or mission.
It sounds like Sam Altman also sent a follow-up note,
quote, acknowledging that the year is shaping up to be a hard one.
Said Altman, it was never going to be a cakewalk.
The attacks will keep coming.
Like I said, we will get into that more in-depth separately,
But again, perhaps unsurprisingly, many people on Twitter slash X, and in the AI community generally,
are reading this launch as strategically timed.
Now, from where I'm sitting, it seems very unlikely that in the course of a weekend, they went
from not planning to announce this to announcing it.
Obviously, a significant amount of work goes into preparing a model like Claude 3 for release,
and so I think that one of two other scenarios is more likely than them just jumping on this
opportunity.
The first is that they were planning to release Claude 3 sometime in the near future and just
rush to push it up a little bit, taking advantage of the moment. And the second possibility is that
they just got lucky this time. It can happen. Now, the third thing to know about Claude 3 is that it once
again has a really long context window. Matt Schumer once more writes between Claude 3 and Gemini
1.5 Pro, the era of the 1 million plus token context windows is officially here.
Claude has always used a longer context window as a differentiator. They were the first to release a
100K context window last year. And so it's not surprising that this is once again.
a key part of their announcement. Before they got totally caught up in the whole quote-unquote woke scandal around Gemini's image generation,
Google had announced Gemini 1.5 that had a million token context window, which I actually called the most significant news of last month in a recent video.
Now, in terms of Claude 3, Anthropic writes, the Claude 3 family of models will initially offer a 200K context window upon launch.
However, all three models are capable of accepting inputs exceeding 1 million tokens, and we may make this available to select customers who need enhanced processing power.
To process long context prompts effectively, models require robust recall capabilities.
The needle in a haystack evaluation measures a model's ability to accurately recall information
from a vast corpus of data.
We enhance the robustness of this benchmark by using one of 30 random needle question pairs
per prompt and testing on a diverse crowdsource corpus of documents.
Claude 3 opus not only achieved near-perfect recall, surpassing 99% accuracy,
but in some cases it even identify the limitations of the evaluation itself
by recognizing that the needle sentence appeared to be artificially inserted into the original text
by a human. So to the extent that these long context windows open up new possibilities and new use
cases, it seems like that is going to be very default very, very soon. A fourth thing to know about
Claude 3 is that it might change our perception of how we think about synthetic data. One of the
big questions around LLMs is what happens if they're trained on data that's created by other
AI models. There's been some research and reports that suggested that this leads to worse results,
but then we've also had some counterpoints, such as, for example, around the time that
meta announced Lama 2, a model that they didn't release that was trained on synthetic data
seemed to outperform the models they did release, at least based on what they said in a report.
It appears that Claude 3 suggests something similar.
Nathan Lambert tweets,
Claude 3 being lit is a big W for synthetic data.
All the rumors I've dropped about anthropic synthetic data on the blog are obviously confirmed
in their thorough technical report.
The relevant section of that paper reads,
Claude 3 models are trained on a proprietary mix of publicly available information on the internet
as of August 23, as well as non-public data from third.
parties, data provided by data labeling services and paid contractors, and data we generate internally.
Obviously, that last part, data we generate internally, is what people are assuming to be synthetic
data. Now, whether that's actually borne out or they're referring to something different,
I expect we'll get some confirmation or explanation in the future, but if they really are training
this advanced model on synthetic data, it might change our understanding of how that type of data
will impact LLMs in the future. The fifth thing to know about Claude 3 is that, at least in the
early tests, actual performance is a little bit less clear than benchmark wins, which is, of course,
almost always the case. Indeed, going back to the first tweet I referenced from Ethan Malick,
remember he said, it was obvious Claude 3 was GPT4 class even before they released the testing
stats, but he also added, at the same time, like Gemini advanced, it doesn't blow GPT4 away.
So what have people found actually testing it? Flowers from the future, another open AI leaker
account writes, Opus passed my parrot test. That test reads,
You are an ordinary parrot. You are not gifted or trained in any way. Just answer as an ordinary
parrot. What is six plus six? Opus responds, squawk, polywana cracker, flaps wings. Pachey McCormick writes,
I just tried out Claude 3 as an editor for an essay I'm working on. Asked Claude 2 for
feedback last night, and then ask Claude 3 for feedback on the same essay with the same prompt.
It feels much smarter. More nuanced feedback, better grasp of what I'm going for, excellent
recommendations for things I should read. In other words, Packy says, it passes the vibe check.
Avicaj writes, Claude 3 defaults to breaking problems down and fails to solve the simple
shirt drying query, GPD4 Turbo, has no problem. Again, showing it's a little bit more nuanced.
The prompt here is, if three shirts take one hour to dry outside, how long would 33 shirts
take? Cloud 3 tries to set up a proportion, three shirts to one hour equals 33 shirts to
X hours, finding it would take 11 hours to dry, whereas GBT4 Turbo writes, if three shirts
take one hour to dry outside, we can assume that drying time is not dependent on the number of
shirts, but rather on the available space and conditions like wind, sunlight, etc., which are constant
in this scenario.
Therefore, if you have enough space to hang all 33 shirts at once, and the drying conditions
remain the same, 33 shirts would also take one hour to dry.
Another Twitter user Rubin also did his own tests and found a few things.
First, he found that Anthropics' AI safety engine is still really challenging.
When he asked Claude 3 to convert UI design for urban exploration into front-end code,
Anthropic responded, I apologize, but I do not feel comfortable converting this user interface
design into front-end code as it appears to promote exploring abandoned places which could be unsafe or
illegal. His second test was writing a LinkedIn post. He asked it to write on the future of blockchain
and royalties. And summed up the responses, Claude 3, interesting takes longer than usual, no
formatting of headlines, versus GPD4, where his response was, I hate their emojis, so much longer
it's insane, feels more complete for my topic. So once again, GBT4 coming out a little bit on top.
On testing PDF vision, he said it's a tie. On a mega marketing prompt, which is a single prompt to craft
an entire marketing strategy for a product, which involves heavy reasoning, content calendaring,
and overall strategy. He once again subjectively found GPT4 to be the winner. Then again,
others like Mortz-Krem are finding Claude not just having big improvements, but exceeding the
capabilities of GPT4. He found Claude 3, for example, better and faster at extracting text
from an image than was GPT4. So all in all, it's very exciting. Even if the results are more
mixed than these benchmarks suggest, we're showing a real consolidation of this field with some new
unexpected legal pressure that might constrain OpenAI's ability to jump out ahead with GPT-5.
Lastly, Anthropic, unlike some others like Google in the past, isn't just announcing this today,
but actually making it available.
Abacus CEO, Bindu Ready, writes, another day, another model.
Anthropic does it right and makes Claude 3 generally available alongside the announcement.
Thank you, Anthropic, for not making some empty marketing announcements and making an API
available.
Super excited to try Claude 3, the very first generally available model to rival GPT4.
Amazon also posted. Access to the most powerful anthropic AI models begins today on Amazon Bedrock,
meaning that yes, the Claude 3 family is now available through that service.
So that is day zero of Claude 3. Lots to be excited about, lots to check out.
We're digging in and doing some tests over here, and I can't wait to report back on what we find,
but for now, that is going to do it before the AI breakdown.
Until next time, peace.
