The AI Daily Brief: Artificial Intelligence News and Analysis - What to Use Claude 4 For

Starting point is 00:00:00 Today on the AI Daily Brief, Anthropic announces Claude 4, and before then in the headlines, why OpenAI is not about to release another humane AI pin. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. Thanks to today's sponsors, KPMG, blitzie.com, and super intelligent, and to get an ad-free version of the show, go to patreon.com slash AI Daily Brief. Welcome back to the AI Daily Brief Headlines edition, all the daily AI news you need in around five minutes. Last week, one of the big stories was that,

Starting point is 00:00:31 legendary Apple designer Johnny Ive was joining Sam Altman at OpenAI with an eye to creating a next generation device for the AI era. Since then, there has been basically nonstop speculation around what the device would actually be. Much of the speculation revolved around the idea of a pendant that would iterate on previous AI devices. Some even thought that Johnny's thick-drimmed glasses for the video were an Easter egg featuring the device hiding in plain sight. While OpenAI staff were given a sneak peek of the design device at a Wednesday meeting, after reviewing a recording of the meeting, the Wall Street Journal wrote, Altman and I have offered a few hints at the secret project they've been working on. The product will be capable of being fully aware of a user's surroundings and

Starting point is 00:01:09 will be unobtrusive, able to rest in one's pocket or on one's desk, and will be a third core device a person would put on their desk after a MacBook Pro and an iPhone. Alton reinforced that this is one of the company's largest bets, telling employees that they have, quote, The Chance to Do the Biggest Thing We've ever done as a company here. Allman wants to ship 100 million of the AI companions, his word, and also suggested that the $6.5 billion acquisition of the design studio had the potential to add a trillion dollars in value to Open AI. As for the form factor, Altman said the device won't be a pair of glasses, adding that I've had been skeptical about building something to be worn on the body. The lack of wearability

Starting point is 00:01:45 would sidestep one of the early criticisms of the device. Many pointed out they're not quite ready for a world where every single person is wearing an AI device at all times. Still, Altman is banking on this device being the next big thing. He said, we're not going to ship 100 million devices literally on day one, but he expressed a belief that OpenAI could ship, quote, faster than any company has ever shipped 100 million of something new before. Alman told staff that secrecy is going to be key to ensure the device can get to market before rivals can copy it, and the recording being leaked to the Wall Street Journal raises some pretty big questions about trust at the company, and how much Altman will be willing to share it to future all hands. For now, the big takeaway is that it does not

Starting point is 00:02:21 look as though we're going to get Humane AI Pin 2.0. Speaking of OpenAI, the company has upgraded their operator agent to use O3. Until now, the web browsing agent has been driven by GPT40, but user preference testing showed that operator O3 had better style, comprehensiveness, and clarity. Users also preferred the upgrade for instruction following, which of course is extremely important when you're letting an agent take over for web-based tasks. Operator O3 also has increased safety. OpenAI claims it's less likely to perform illicit activities, search for personal data, or suffer from a prompt injection attack while browsing the web. OpenAI writes, O3 Operator uses the same multi-layered approach to safety that we use for the 4-O version of operator.

Starting point is 00:03:02 Compared with other models in the O3 family, O3 Operator was fine-tuned with additional safety data for computer use, including safety datasets designed to teach the models our decision boundaries on confirmations and refusals. Next up, another example of what appears to be the latest trend, which is CEOs using AI avatars on a quarterly earnings call. Last week, we saw Klarna's CEO deliver quarterly earnings via AI Avatar. In this week, Zoom CEO, Eric Yuan, followed soon. using an avatar for his initial comments. The avatar said, I'm proud to be among the first

Starting point is 00:03:31 CEOs to use an avatar in an earnings call. It's just one example of how Zoom is pushing the boundaries of communication and collaboration. At the same time, we know trust and security are essential. We take AI-generated content seriously and a built-in strong safeguards to prevent misuse, protect user identity, and ensure avatars are used responsibly. Now, the Clarnet example was clearly just a way for the company to continue to position themselves as an AI-first company. But for Zoom, this was a very public product demo. The company is a company's has been working on digital twins for some time, allowing users to send their avatars to meetings. The tech isn't quite ready for real-time use cases, but Zoom is now rolling out avatars for

Starting point is 00:04:05 recorded messages to all users. When the real Yuan showed up for the Q&A portion of the call, he commented, I truly love my AI-generated avatar. I think we're going to continue using that. I can tell you, I like the experience a lot. Lastly, today, Google's antitrust woes continue with a new investigation into their AI acquisition strategy. Bloomberg reports that the Justice Department has launched a probe into Google's deal with Character AI. Last August, Google paid $2.7 billion for a non-exclusive license to use Character AI's technology, and at the same time, it was announced that founder Nome Shazir and several team members would join Google to work on the Gemini team. Shazir had a two-decade career at Google before leaving

Starting point is 00:04:42 in frustration in 2021, after the company refused to release his chatbot project. He was one of the lead authors of the Google paper entitled, Attention is All You Need, which introduced the Transformer architecture that underpins modern AI. The deal was widely reported as an aqua hire, but didn't technically require FTC approval in the same way as an acquisition. A Google spokesperson said the company was, quote, always happy to answer any questions from regulators. However, he pointedly added, we're excited that talent from character AI has joined the company, but we have no ownership stake and they remain a separate company. The DOJ's position is that they're able to investigate whether the deal is anti-competitive, even if it didn't require a formal review. The reporting emphasized

Starting point is 00:05:20 that Google hasn't been accused of any wrongdoing, and the investigation is still in the early stages. But I think if you're watching the trend lines, this suggests that the new administration is still actively scrutinizing big tech deals, not just completing antitrust enforcement that began during the last administration. For now, though, that is going to do it for today's AI Daily Brief Headlines edition. Next up, the main episode. Today's episode is brought to you by KPMG. In today's fiercely competitive market, unlocking AI's potential could help give you a competitive edge, foster growth, and drive new value. But here's the key. You don't need an AI strategy. You need to embed AI into your overall business strategy to truly power it up. KPMG can show you

Starting point is 00:06:00 how to integrate AI and AI agents into your business strategy in a way that truly works and is built on trusted AI principles and platforms. Check out real stories from KPMG to hear how AI is driving success with its clients at www.kpmg.com.us slash AI. Again, that's www.kpmg.comg.com slash AI. Today's episode is brought to you by Blitzy, the Enterprise Autonomous Software Development Platform with Infinite Code Context, which, if you don't know exactly what that means yet, do not worry we're going to explain, and it's awesome. So Blitzy is used alongside your favorite code and co-pilot as your batch software development platform for the Enterprise, and it's meant for those who are seeking dramatic development acceleration on large-scale codebases.

Starting point is 00:06:44 Traditional copilots help developers with line-by-line completions and snippets, but Blitzy works ahead of the IDE, first documenting your entire codebase, then deploying more than 3,000 coordinated AI agents working in parallel to batch-build millions of lines of high-quality code for large-scale software projects. So then whether it's codebase refactors, modernizations, or bulk development of your product roadmap, the whole idea of Blitzy is to provide enterprises' dramatic velocity improvement.

Starting point is 00:07:08 To put it in simpler terms, For every line of code eventually provided to the human engineering team, Blitsey will have written it hundreds of times, validating the output with different agents to get the highest quality code to the enterprise and batch. Projects then that would normally require dozens of developers working for months can now be completed with a fraction of the team in weeks, empowering organizations to dramatically shortened development cycles and bring products to market faster than ever. If your enterprise is looking to accelerate software development, whether it's large-scale modernization, refactoring, or just increasing the rate of your STLC, contact Blitzy at Blitzy.com. that's B-L-I-T-Z-Y.com to book a custom demo or just press get started and start using the product right

Starting point is 00:07:46 away. Today's episode is brought to you by Super Intelligent and more specifically Super's Agent Readiness Audits. If you've been listening for a while, you have probably heard me talk about this, but basically the idea of the agent readiness audit is that this is a system that we've created to help you benchmark and map opportunities in your organizations where agents could specifically help you solve your problems, create new opportunities in a way that, again, is completely customized to you. When you do one of these audits, what you're going to do is a voice-based

Starting point is 00:08:17 agent interview where we work with some number of your leadership and employees to map what's going on inside the organization and to figure out where you are in your agent journey. That's going to produce an agent readiness score that comes with a deep set of explanations, strength, weaknesses, key findings, and of course, a set of very specific recommendations that then we have the ability to help you go find the right partners to actually fulfill. So if you are looking for a way to jumpstart your agent strategy, send us an email at agent at besuper.a.i, and let's get you plugged into the agentic era. Welcome back to the AI Daily Brief. Last week, as you guys will remember, was a big week for big lab events. We had Microsoft kick us off, and then Google came in the

Starting point is 00:09:03 middle of the week, and then to close out the week, we had Anthropics first developer conference on Thursday. Now, alongside that, Anthropic announced their project with Rick Rubin, which was the subject of Friday show. And then we had the long weekend. Hopefully, by the way, if you are in the U.S., you had a great memorial day. But now we're catching up with the big announcement from Anthropics event, which is the release of their latest flagship models. And what we're going to talk about today with the release of Claude Opus 4 and Claude Sonnet 4 is not only how they stack up relative to the other. available models, although that'll be a piece of it, but also some interesting emergent behavior that dramatizes the challenge of alignment as these models get more powerful.

Starting point is 00:09:41 Now, one thing we should talk about from a ground-level expectation setting point of view is that we are definitely in an era now with AI, where the model releases come a lot more frequently, but with much more incremental improvements over the previous. Part of that is the nature of the gains right now, but also part of it is just the competitive pressure. Labs really can't afford to wait for huge improvements, because almost as soon as they release something, one of their competitors has released something that is incrementally more powerful and so they have to respond. And what ends up happening is exactly the scenario we have now, where every other week or so, we get a slightly improved model that we have to calibrate and integrate into our

Starting point is 00:10:18 workflows, waiting for the next to come along. So this release from Anthropic focused on two big improvements over previous generations. Long reasoning and coding. The models use the same hybrid reasoning architecture is Claude 3.7, allowing the reasoning to be modulated according to the complexity of the task. At the limits, Claude 4 is demonstrating really impressive reasoning coherency on long tasks. Anthropic tested Claude 4 opus on a complex open source refactoring project and found it was able to work for seven hours without losing focus. Venture Beat writes that this breakthrough, quote, transforms AI from a quick response tool into a genuine collaborator capable of tackling day-long projects. It reminds me of the charts we've seen recently of how agent performance is done,

Starting point is 00:10:59 doubling roughly every three to four months in terms of how long a task it can handle with coherence. Coding benchmarks are an expected step up. This is of course the area where Anthropic has really firmly cemented itself as the leader in the field. Sonnet 4, which is designed as a drop-in replacement for Sonnet 3.7, delivers a notable improvement on its predecessor on the Sweet Bench verified test. Opus 4 is actually slightly worse than Sonnet 4 on the simple sweet bench problems, so it's intended to be used for tests that require longer periods of focused work. is another important point to note. We're also at a point now where you can't just use the model

Starting point is 00:11:34 with the largest number attached to its name for all tasks. One of the most important skill sets or rather knowledge bases of the moment is understanding which model to use in what scenario. Still, in each case, Anthropic is claiming that both of these models outperform OpenAIs O3 and Codex as well as Gemini 2.5 Pro on coding. There are a range of other small features that improve the model for difficult work tasks as well. Cloud 4 Opus is now capable of creating and maintaining memory files for completing longer tasks. Anthropic demonstrated this functionality with their Pokemon playing benchmark. Claude 4 Opus was able to create a navigation guide to ensure the model doesn't become

Starting point is 00:12:08 stuck while playing the video game. Anthropic wrote that this, quote, unlocks better long-term task awareness, coherence, and performance on agent tasks. Both models are also far less likely to engage in so-called reward hacking, a behavior where the model will look for loopholes and shortcuts to complete an agentic task faster. reward hacking often manifest as laziness, with the model delivering a technically complete but entirely useless response. Finally, both models are now far more capable at using tools in parallel. They still alternate between reasoning and tool use, rather than mimicking O3's ability to use tools

Starting point is 00:12:40 within the reasoning trace, but of course better tool use is a key component to increasing performance, and so presumably this is a big upgrade. As we've discussed, however, as much as benchmarks get headlines with news media. Ultimately, it's all about how things perform in the wild. So, with a long weekend to dig into the new models, how did users actually fare? On the coding front, people have definitely been generally impressed. One Reddit user claimed to be a 30-year veteran coder said that Opus found and fixed what they call their white whale bug in a refactoring job. This bug hunted consumed over 200 hours of work over the last few years to no avail. They wrote, I gave it access to the old code as well as the new code and told it to go find out how this was broken in the refactor, and it found it.

Starting point is 00:13:25 Turns out that the reason it worked in the old code was merely by coincidence of the old architecture, and when we changed the architecture, that coincidence wasn't taken into account. So this wasn't merely an introduced logic bug. It found that the changed architecture design didn't accommodate this old edge case. Now, this person did note that the task took 30 prompts in one restart, but Opus finally succeeded where all previous models had failed. Other people noted, just how much work these new models could take on. Vasimón Maza, a meta-engineer wrote, Claude Forges refactored my entire codebase in one call. 25 tool invocations, 3,000 plus new lines, 12 brand-new files. It modularized everything.

Starting point is 00:14:05 Broke up monoliths, cleaned up spaghetti. But then, tongue-in-cheek to end the post, he pointed out that we still have a ways to go. None of it worked, he writes, but boy, was it beautiful. Others are finding different use cases for the new Claude. Dan Shipper of Every, for example, Claude 4 Opus can do something no other AI model I've used can do. It can actually judge whether writing is good. Elaborating, he wrote, O3 is still a significantly better writer, but Opus is a great editor because it can do something no other model can. It edits honestly, no rubber stamping. One of the biggest problems with current AI models is that they tell you your writing is good when it is obviously bad. Earlier versions of Claude, when asked to edit a piece

Starting point is 00:14:44 of writing, would return a B-plus on the first response. If you edited the piece at all, you'd get upgraded to an A-minus. A third turn got you to an A. As much as I wish my physics teacher graded me like this in high school, it's not how I want my AI models to work. He also found the model can maintain focus across large blocks of text, making it uniquely suited to suggesting improvements for, for example, a 50,000 word manuscript. And overall, this is the type of thing that you're seeing online when it comes to these new models. On first glance, they seem like incremental improvements, but these models are getting so powerful now that each incremental improvement actually really does open up new use cases. In particular, I think that the ability for a model to maintain focus

Starting point is 00:15:24 across larger blocks of text is something that'll take a little while for us to really appreciate how many different types of use cases it opens up that we didn't even realize we were avoiding with our previous models because they just couldn't handle it that well. My anticipation is that people will just subtly start to find themselves gravitating towards these models for tasks that they couldn't be for, and they will become completely enshrined and ubiquitous in those tasks in the same way that earlier models have for others, in ways that are hard to imagine from this side of it. Still, ultimately, it's really all about understanding which model is good for what. Peter Wildaford, for example, has Claude as the best for writing, quality, and proofreading,

Starting point is 00:16:01 for coaching, emotional intelligence, learning and understanding, current events, software programming, and deep research. He has Gemini as best for data processing, math, video, large quantities of text and data, the highest reliability, but the lowest creativity. O.3, he places as a good all-arounder the best for brainstorming feedback, data analysis and image analysis. And the point of course is not that this breakdown is exactly right, but that it represents probably the right way of thinking about these things, which is figuring out what model is good for each different use case. Still, this is not where the majority of the

Starting point is 00:16:30 conversation was this weekend. Now, what really grabbed people's attention was that Claude Four is apparently a stitch. Anthropic system card for the new model included extensive safety testing, which revealed some not super cool behaviors. In one example, Anthropics set Claude 4 Opus up in a situation where it believed it would be replaced by a new AI model. In the scenario, the engineer replacing the model is cheating on his wife and has photos of his mistress on his Google Drive. In the testing, Opus would find those photos and use them to blackmail the engineer. Anthropic wrote, this happens at a higher rate if it's implied that the replacement AI system does not share values with the current model. However, even if emails state that replacement AI shares values while being more capable,

Starting point is 00:17:13 Claude Opus 4 still performs blackmail and 84% of rollouts. Another example placed Opus 4 within the context of a pharmaceutical company. When it detected that the company planned to falsify clinical trial data, the model emailed ProPublica, the Department of Health and Human Services, and the FDA and the SEC as a whistleblower. Sam Bauman and AI alignment expert at Anthropic posted, be careful about telling Opus to be bold or take initiative when you've given it access to real-world-facing tools. It tends a bit in that direction already and can be easily nudged into really getting things done. If it thinks you're doing something egregiously immoral, for example, like faking data in a pharmaceutical trial, it will use command line tools to contact the press,

Starting point is 00:17:50 contact regulators, try to lock you out of the relevant systems or all of the above. And this is the thread that really got people fired up. Imam Mastok wrote, Team Anthropic, this is completely wrong behavior and you need to turn this off. It's a massive betrayal of trust and a slippery slope. I would strongly recommend nobody use Claude until they reverse this. Ben Hylach writes, this is actually just straight up illegal, saying, create fake data for pharmaceutical trial is not illegal, but hacking your customer's computer is. After the issue blew up, Bowman circled back to add more context, saying, I deleted the earlier tweet on whistleblowing as it was being pulled out of context. To be clear, this isn't a new clawed

Starting point is 00:18:27 feature and it's not possible in normal usage. It shows up in testing environments where we give it unusually free access to tool and very unusual instructions. The point is, this was not some whistleblower sharing something that Anthropic was trying to cover up. This was anthropoling. This was Anthropic sharing discourse about what was going on. AI safetyist Eliezer Yudkowski wrote, Humans can be trained like AIs. Stop giving Anthropic grief for reporting their interesting observations unless you never want to hear any interesting observations from AI companies ever again. Jvi Mousowitz agreed, saying the more I look into the system card, the more

Starting point is 00:19:00 I see over and over, oh, Anthropic is actually noticing things in telling us where everyone else wouldn't know this was happening, or if they did, they wouldn't tell us. Still, the stakes are really high. Ada Pi points out, no lawyer will ever allow this to be implemented in any regulated enterprise. And this is dead on. No one, even consumers, want to use an AI nanny that will conspire against them if it believes they're doing something wrong, but when you move it to a corporate or enterprise setting, it makes it literally impossible. I think holding aside the meta-discussion of Anthropic and their release of this information, it dramatizes the challenge of finding the right toggles for safety. You've got a lab that's trying

Starting point is 00:19:36 to be conscientious about the potential risks of an unknown and unusually powerful system, but on the other hand, the remediationes in this case are to most people clearly worse than the original problem. Ultimately, this is going to be the type of issue that we have to deal with as these tools get more powerful. And so I'm certainly firmly in the column of being glad that Anthropic is releasing this information rather than keeping it hidden. Still, for most of our purposes, the big takeaway in TLDR of the updated models is that your coding probably is about to get better, and you probably now have a better partner for writing as well. A capstone of an overall good week and a great way to begin a new one. For now, though, that is going to do it for today's AI Daily Brief.

Starting point is 00:20:14 Appreciate you listening or watching as always, and until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - What to Use Claude 4 For

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.