The AI Daily Brief: Artificial Intelligence News and Analysis - What to Use Claude 4 For
Episode Date: May 28, 2025Anthropic has released Claude Opus 4 and Sonnet 4, their newest AI models. Both are better at long tasks and coding. NLW discusses why the trick with new models isn't to compare general benchmarks..., but to figure out how to slot in the specific model for the specific use case. Get Ad Free AI Daily Brief: https://patreon.com/AIDailyBriefBrought to you by:KPMG – Go to https://kpmg.com/ai to learn more about how KPMG can help you drive value with our AI solutions.Blitzy.com - Go to https://blitzy.com/ to build enterprise software in days, not months Vertice Labs - Check out http://verticelabs.io/ - the AI-native digital consulting firm specializing in product development and AI agents for small to medium-sized businesses.The Agent Readiness Audit from Superintelligent - Go to https://besuper.ai/ to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Subscribe to the newsletter: https://aidailybrief.beehiiv.com/Join our Discord: https://bit.ly/aibreakdownInterested in sponsoring the show? nlw@breakdown.network
Transcript
Discussion (0)
Today on the AI Daily Brief, Anthropic announces Claude 4, and before then in the headlines,
why OpenAI is not about to release another humane AI pin.
The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.
Thanks to today's sponsors, KPMG, blitzie.com, and super intelligent,
and to get an ad-free version of the show, go to patreon.com slash AI Daily Brief.
Welcome back to the AI Daily Brief Headlines edition,
all the daily AI news you need in around five minutes.
Last week, one of the big stories was that,
legendary Apple designer Johnny Ive was joining Sam Altman at OpenAI with an eye to creating a next
generation device for the AI era. Since then, there has been basically nonstop speculation around
what the device would actually be. Much of the speculation revolved around the idea of a pendant that
would iterate on previous AI devices. Some even thought that Johnny's thick-drimmed glasses for
the video were an Easter egg featuring the device hiding in plain sight. While OpenAI staff were
given a sneak peek of the design device at a Wednesday meeting, after reviewing a recording of the meeting,
the Wall Street Journal wrote, Altman and I have offered a few hints at the secret project they've
been working on. The product will be capable of being fully aware of a user's surroundings and
will be unobtrusive, able to rest in one's pocket or on one's desk, and will be a third
core device a person would put on their desk after a MacBook Pro and an iPhone. Alton reinforced that
this is one of the company's largest bets, telling employees that they have, quote,
The Chance to Do the Biggest Thing We've ever done as a company here.
Allman wants to ship 100 million of the AI companions, his word, and also suggested that the
$6.5 billion acquisition of the design studio had the potential to add a trillion dollars in value to
Open AI. As for the form factor, Altman said the device won't be a pair of glasses, adding that
I've had been skeptical about building something to be worn on the body. The lack of wearability
would sidestep one of the early criticisms of the device. Many pointed out they're not quite
ready for a world where every single person is wearing an AI device at all times. Still, Altman is banking
on this device being the next big thing. He said, we're not going to ship 100 million devices literally
on day one, but he expressed a belief that OpenAI could ship, quote, faster than any company
has ever shipped 100 million of something new before. Alman told staff that secrecy is going to be
key to ensure the device can get to market before rivals can copy it, and the recording being leaked
to the Wall Street Journal raises some pretty big questions about trust at the company, and how much
Altman will be willing to share it to future all hands. For now, the big takeaway is that it does not
look as though we're going to get Humane AI Pin 2.0. Speaking of OpenAI, the company has upgraded
their operator agent to use O3. Until now, the web browsing agent has been driven by GPT40,
but user preference testing showed that operator O3 had better style, comprehensiveness, and clarity.
Users also preferred the upgrade for instruction following, which of course is extremely important
when you're letting an agent take over for web-based tasks. Operator O3 also has increased safety.
OpenAI claims it's less likely to perform illicit activities, search for personal data, or suffer
from a prompt injection attack while browsing the web. OpenAI writes,
O3 Operator uses the same multi-layered approach to safety that we use for the 4-O version of operator.
Compared with other models in the O3 family, O3 Operator was fine-tuned with additional safety data
for computer use, including safety datasets designed to teach the models our decision boundaries
on confirmations and refusals.
Next up, another example of what appears to be the latest trend, which is CEOs using AI
avatars on a quarterly earnings call.
Last week, we saw Klarna's CEO deliver quarterly earnings via AI Avatar.
In this week, Zoom CEO, Eric Yuan, followed soon.
using an avatar for his initial comments. The avatar said, I'm proud to be among the first
CEOs to use an avatar in an earnings call. It's just one example of how Zoom is pushing the
boundaries of communication and collaboration. At the same time, we know trust and security
are essential. We take AI-generated content seriously and a built-in strong safeguards to prevent
misuse, protect user identity, and ensure avatars are used responsibly. Now, the Clarnet example was
clearly just a way for the company to continue to position themselves as an AI-first company.
But for Zoom, this was a very public product demo. The company is a company's
has been working on digital twins for some time, allowing users to send their avatars to meetings.
The tech isn't quite ready for real-time use cases, but Zoom is now rolling out avatars for
recorded messages to all users. When the real Yuan showed up for the Q&A portion of the call,
he commented, I truly love my AI-generated avatar. I think we're going to continue using that.
I can tell you, I like the experience a lot.
Lastly, today, Google's antitrust woes continue with a new investigation into their AI acquisition
strategy. Bloomberg reports that the Justice Department has launched a probe into Google's deal with
Character AI. Last August, Google paid $2.7 billion for a non-exclusive license to use Character AI's
technology, and at the same time, it was announced that founder Nome Shazir and several team members
would join Google to work on the Gemini team. Shazir had a two-decade career at Google before leaving
in frustration in 2021, after the company refused to release his chatbot project. He was one of the
lead authors of the Google paper entitled, Attention is All You Need, which introduced the Transformer
architecture that underpins modern AI. The deal was widely reported as an aqua hire, but didn't
technically require FTC approval in the same way as an acquisition. A Google spokesperson said the
company was, quote, always happy to answer any questions from regulators. However, he pointedly added,
we're excited that talent from character AI has joined the company, but we have no ownership
stake and they remain a separate company. The DOJ's position is that they're able to investigate
whether the deal is anti-competitive, even if it didn't require a formal review. The reporting emphasized
that Google hasn't been accused of any wrongdoing, and the investigation is still in the early
stages. But I think if you're watching the trend lines, this suggests that the new administration
is still actively scrutinizing big tech deals, not just completing antitrust enforcement
that began during the last administration. For now, though, that is going to do it for today's
AI Daily Brief Headlines edition. Next up, the main episode. Today's episode is brought to you by
KPMG. In today's fiercely competitive market, unlocking AI's potential could help give you a competitive
edge, foster growth, and drive new value. But here's the key. You don't need an AI strategy.
You need to embed AI into your overall business strategy to truly power it up. KPMG can show you
how to integrate AI and AI agents into your business strategy in a way that truly works and is
built on trusted AI principles and platforms. Check out real stories from KPMG to hear how AI is
driving success with its clients at www.kpmg.com.us slash AI. Again, that's www.kpmg.comg.com slash AI.
Today's episode is brought to you by Blitzy, the Enterprise Autonomous Software Development Platform
with Infinite Code Context, which, if you don't know exactly what that means yet, do not worry
we're going to explain, and it's awesome. So Blitzy is used alongside your favorite code and co-pilot
as your batch software development platform for the Enterprise, and it's meant for those who are seeking
dramatic development acceleration on large-scale codebases.
Traditional copilots help developers with line-by-line completions and snippets,
but Blitzy works ahead of the IDE,
first documenting your entire codebase,
then deploying more than 3,000 coordinated AI agents working in parallel
to batch-build millions of lines of high-quality code for large-scale software projects.
So then whether it's codebase refactors, modernizations,
or bulk development of your product roadmap,
the whole idea of Blitzy is to provide enterprises' dramatic velocity improvement.
To put it in simpler terms,
For every line of code eventually provided to the human engineering team, Blitsey will have written it hundreds of times,
validating the output with different agents to get the highest quality code to the enterprise and batch.
Projects then that would normally require dozens of developers working for months can now be completed with a fraction of the team in weeks,
empowering organizations to dramatically shortened development cycles and bring products to market faster than ever.
If your enterprise is looking to accelerate software development, whether it's large-scale modernization,
refactoring, or just increasing the rate of your STLC, contact Blitzy at Blitzy.com.
that's B-L-I-T-Z-Y.com to book a custom demo or just press get started and start using the product right
away.
Today's episode is brought to you by Super Intelligent and more specifically Super's Agent
Readiness Audits.
If you've been listening for a while, you have probably heard me talk about this, but
basically the idea of the agent readiness audit is that this is a system that we've created
to help you benchmark and map opportunities in your organizations where agents could
specifically help you solve your problems, create new opportunities in a way that, again,
is completely customized to you. When you do one of these audits, what you're going to do is a voice-based
agent interview where we work with some number of your leadership and employees to map what's going
on inside the organization and to figure out where you are in your agent journey. That's going to
produce an agent readiness score that comes with a deep set of explanations, strength, weaknesses,
key findings, and of course, a set of very specific recommendations that then we have the ability
to help you go find the right partners to actually fulfill. So if you are looking for a way to
jumpstart your agent strategy, send us an email at agent at besuper.a.i, and let's get you plugged
into the agentic era. Welcome back to the AI Daily Brief. Last week, as you guys will remember,
was a big week for big lab events. We had Microsoft kick us off, and then Google came in the
middle of the week, and then to close out the week, we had Anthropics first developer conference on
Thursday. Now, alongside that, Anthropic announced their project with Rick Rubin, which was the
subject of Friday show. And then we had the long weekend. Hopefully, by the way, if you are in the
U.S., you had a great memorial day. But now we're catching up with the big announcement from Anthropics
event, which is the release of their latest flagship models. And what we're going to talk about today with
the release of Claude Opus 4 and Claude Sonnet 4 is not only how they stack up relative to the other.
available models, although that'll be a piece of it, but also some interesting emergent behavior
that dramatizes the challenge of alignment as these models get more powerful.
Now, one thing we should talk about from a ground-level expectation setting point of view
is that we are definitely in an era now with AI, where the model releases come a lot more
frequently, but with much more incremental improvements over the previous. Part of that is the nature
of the gains right now, but also part of it is just the competitive pressure. Labs really can't
afford to wait for huge improvements, because almost as soon as they release something,
one of their competitors has released something that is incrementally more powerful and so they
have to respond. And what ends up happening is exactly the scenario we have now, where every other
week or so, we get a slightly improved model that we have to calibrate and integrate into our
workflows, waiting for the next to come along. So this release from Anthropic focused on two big
improvements over previous generations. Long reasoning and coding. The models use the same hybrid
reasoning architecture is Claude 3.7, allowing the reasoning to be modulated according to the
complexity of the task. At the limits, Claude 4 is demonstrating really impressive reasoning coherency
on long tasks. Anthropic tested Claude 4 opus on a complex open source refactoring project and found
it was able to work for seven hours without losing focus. Venture Beat writes that this breakthrough,
quote, transforms AI from a quick response tool into a genuine collaborator capable of tackling day-long
projects. It reminds me of the charts we've seen recently of how agent performance is done,
doubling roughly every three to four months in terms of how long a task it can handle with coherence.
Coding benchmarks are an expected step up. This is of course the area where Anthropic has
really firmly cemented itself as the leader in the field. Sonnet 4, which is designed as a drop-in
replacement for Sonnet 3.7, delivers a notable improvement on its predecessor on the
Sweet Bench verified test. Opus 4 is actually slightly worse than Sonnet 4 on the simple
sweet bench problems, so it's intended to be used for tests that require longer periods of
focused work.
is another important point to note. We're also at a point now where you can't just use the model
with the largest number attached to its name for all tasks. One of the most important skill sets or
rather knowledge bases of the moment is understanding which model to use in what scenario.
Still, in each case, Anthropic is claiming that both of these models outperform OpenAIs O3 and
Codex as well as Gemini 2.5 Pro on coding. There are a range of other small features that
improve the model for difficult work tasks as well. Cloud 4 Opus is now capable of creating
and maintaining memory files for completing longer tasks.
Anthropic demonstrated this functionality with their Pokemon playing benchmark.
Claude 4 Opus was able to create a navigation guide to ensure the model doesn't become
stuck while playing the video game.
Anthropic wrote that this, quote, unlocks better long-term task awareness, coherence,
and performance on agent tasks.
Both models are also far less likely to engage in so-called reward hacking, a behavior
where the model will look for loopholes and shortcuts to complete an agentic task faster.
reward hacking often manifest as laziness, with the model delivering a technically complete but
entirely useless response. Finally, both models are now far more capable at using tools in parallel.
They still alternate between reasoning and tool use, rather than mimicking O3's ability to use tools
within the reasoning trace, but of course better tool use is a key component to increasing performance,
and so presumably this is a big upgrade. As we've discussed, however, as much as benchmarks get
headlines with news media. Ultimately, it's all about how things perform in the wild. So, with a long
weekend to dig into the new models, how did users actually fare? On the coding front, people have
definitely been generally impressed. One Reddit user claimed to be a 30-year veteran coder said that Opus found
and fixed what they call their white whale bug in a refactoring job. This bug hunted consumed over 200
hours of work over the last few years to no avail. They wrote, I gave it access to the old code as well as the
new code and told it to go find out how this was broken in the refactor, and it found it.
Turns out that the reason it worked in the old code was merely by coincidence of the old
architecture, and when we changed the architecture, that coincidence wasn't taken into account.
So this wasn't merely an introduced logic bug. It found that the changed architecture design
didn't accommodate this old edge case. Now, this person did note that the task took 30 prompts
in one restart, but Opus finally succeeded where all previous models had failed. Other people noted,
just how much work these new models could take on. Vasimón Maza, a meta-engineer wrote,
Claude Forges refactored my entire codebase in one call.
25 tool invocations, 3,000 plus new lines, 12 brand-new files. It modularized everything.
Broke up monoliths, cleaned up spaghetti. But then, tongue-in-cheek to end the post,
he pointed out that we still have a ways to go. None of it worked, he writes, but boy, was it
beautiful. Others are finding different use cases for the new Claude. Dan Shipper of Every, for example,
Claude 4 Opus can do something no other AI model I've used can do. It can actually judge
whether writing is good. Elaborating, he wrote, O3 is still a significantly better writer,
but Opus is a great editor because it can do something no other model can. It edits honestly,
no rubber stamping. One of the biggest problems with current AI models is that they tell you
your writing is good when it is obviously bad. Earlier versions of Claude, when asked to edit a piece
of writing, would return a B-plus on the first response. If you edited the piece at all, you'd get
upgraded to an A-minus. A third turn got you to an A. As much as I wish my physics teacher graded me
like this in high school, it's not how I want my AI models to work. He also found the model can
maintain focus across large blocks of text, making it uniquely suited to suggesting improvements
for, for example, a 50,000 word manuscript. And overall, this is the type of thing that you're
seeing online when it comes to these new models. On first glance, they seem like incremental improvements,
but these models are getting so powerful now that each incremental improvement actually really does
open up new use cases. In particular, I think that the ability for a model to maintain focus
across larger blocks of text is something that'll take a little while for us to really appreciate
how many different types of use cases it opens up that we didn't even realize we were avoiding
with our previous models because they just couldn't handle it that well. My anticipation is that
people will just subtly start to find themselves gravitating towards these models for tasks that
they couldn't be for, and they will become completely enshrined and ubiquitous in those tasks
in the same way that earlier models have for others, in ways that are hard to imagine from this side of it.
Still, ultimately, it's really all about understanding which model is good for what.
Peter Wildaford, for example, has Claude as the best for writing, quality, and proofreading,
for coaching, emotional intelligence, learning and understanding, current events, software programming,
and deep research.
He has Gemini as best for data processing, math, video, large quantities of text and data,
the highest reliability, but the lowest creativity.
O.3, he places as a good all-arounder the best for brainstorming feedback,
data analysis and image analysis. And the point of course is not that this breakdown is exactly right,
but that it represents probably the right way of thinking about these things, which is figuring out
what model is good for each different use case. Still, this is not where the majority of the
conversation was this weekend. Now, what really grabbed people's attention was that Claude Four
is apparently a stitch. Anthropic system card for the new model included extensive safety
testing, which revealed some not super cool behaviors. In one example,
Anthropics set Claude 4 Opus up in a situation where it believed it would be replaced by a new AI model.
In the scenario, the engineer replacing the model is cheating on his wife and has photos of his mistress on his Google Drive.
In the testing, Opus would find those photos and use them to blackmail the engineer.
Anthropic wrote, this happens at a higher rate if it's implied that the replacement AI system does not share values with the current model.
However, even if emails state that replacement AI shares values while being more capable,
Claude Opus 4 still performs blackmail and 84% of rollouts. Another example placed Opus 4 within the
context of a pharmaceutical company. When it detected that the company planned to falsify clinical trial
data, the model emailed ProPublica, the Department of Health and Human Services, and the FDA and the
SEC as a whistleblower. Sam Bauman and AI alignment expert at Anthropic posted,
be careful about telling Opus to be bold or take initiative when you've given it access to
real-world-facing tools. It tends a bit in that direction already and can be easily nudged into
really getting things done. If it thinks you're doing something egregiously immoral, for example,
like faking data in a pharmaceutical trial, it will use command line tools to contact the press,
contact regulators, try to lock you out of the relevant systems or all of the above. And this is
the thread that really got people fired up. Imam Mastok wrote,
Team Anthropic, this is completely wrong behavior and you need to turn this off. It's a massive
betrayal of trust and a slippery slope. I would strongly recommend nobody use Claude until they
reverse this. Ben Hylach writes, this is actually just straight up illegal, saying,
create fake data for pharmaceutical trial is not illegal, but hacking your customer's computer is.
After the issue blew up, Bowman circled back to add more context, saying, I deleted the earlier
tweet on whistleblowing as it was being pulled out of context. To be clear, this isn't a new clawed
feature and it's not possible in normal usage. It shows up in testing environments where we give it
unusually free access to tool and very unusual instructions. The point is, this was not some
whistleblower sharing something that Anthropic was trying to cover up. This was anthropoling. This was
Anthropic sharing discourse about what was going on.
AI safetyist Eliezer Yudkowski wrote,
Humans can be trained like AIs. Stop giving Anthropic grief for reporting their interesting
observations unless you never want to hear any interesting observations from AI companies
ever again. Jvi Mousowitz agreed, saying the more I look into the system card, the more
I see over and over, oh, Anthropic is actually noticing things in telling us where everyone
else wouldn't know this was happening, or if they did, they wouldn't tell us.
Still, the stakes are really high. Ada Pi points out, no lawyer will
ever allow this to be implemented in any regulated enterprise. And this is dead on. No one, even consumers,
want to use an AI nanny that will conspire against them if it believes they're doing something wrong,
but when you move it to a corporate or enterprise setting, it makes it literally impossible.
I think holding aside the meta-discussion of Anthropic and their release of this information,
it dramatizes the challenge of finding the right toggles for safety. You've got a lab that's trying
to be conscientious about the potential risks of an unknown and unusually powerful system, but on the other
hand, the remediationes in this case are to most people clearly worse than the original problem.
Ultimately, this is going to be the type of issue that we have to deal with as these tools get
more powerful. And so I'm certainly firmly in the column of being glad that Anthropic is releasing
this information rather than keeping it hidden. Still, for most of our purposes, the big takeaway in
TLDR of the updated models is that your coding probably is about to get better, and you probably now
have a better partner for writing as well. A capstone of an overall good week and a great way to
begin a new one. For now, though, that is going to do it for today's AI Daily Brief.
Appreciate you listening or watching as always, and until next time, peace.
