The AI Daily Brief: Artificial Intelligence News and Analysis - Is o3 Functionally AGI?

Starting point is 00:00:00 Today on the AI Daily Brief, all the most important stories in AI from this past week while I was traveling. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. To join the conversation, follow the Discord link on our show notes. Hello, friends, quick note before we dive in today, I'm of course coming off some travel. And so this week, instead of a long reads episode, I decided to do a bit of a catch-up on some of the most important news. There were some interesting things that went down, so let's dive right in. Welcome back to the AI Daily Brief Headlines edition, all the daily AI news you need in around five minutes.

Starting point is 00:00:36 As you know, I have been out all week, so we have a lot to catch up on. And we are kicking off with some geopolitics where the Trump administration is reportedly considering a deep seek ban. According to the New York Times, further restrictions include banning the startup from purchasing U.S. technology and barring Americans from accessing deepseek's models. Now, the report didn't get into how exactly a government could ban open source models. but functionally simply banning cloud providers from offering them is probably close enough. Congress also apparently has deep seek in its sites.

Starting point is 00:01:09 The House Select Committee on China called the AI startup a, quote, profound threat to U.S. National Security by harvesting American users' data and setting it back to China. Their report states, although it presents itself as just another AI chatbot, offering users a way to generate text and answer questions, closer inspection reveals the app siphons data back to the People's Republic of China, create security vulnerabilities for its users, and relies on a model that covertly censors and manipulates information pursuant to Chinese law.

Starting point is 00:01:35 Now, whether or not this comes to past tells you a lot about the state of the conversation as it relates to China and AI right now. Speaking of, another company in trouble in that area is NVIDIA. On Wednesday, the Trump administration extended export controls to cover NVIDIA's H20 chips, the downrated version of the H-100 that are designed to comply with controls introduced in the Biden era. The administration said that the enhanced rules address concerns that, quote, the covered products may be used in or diverted to a supercomputer in China. Invidia, for their part, warned that they would report 5.5 billion in writedowns associated with inventory and commitments for the chips, which have essentially zero demand outside of China.

Starting point is 00:02:13 In terms of how much this would impact the development of Chinese AI, Biden Commerce Department staff said that the bands would make it around 3 to 6% more costly to develop an AI model in China. Since then, we've had constant reports of Chinese researchers doing more with less, so that figure is very much up for debate. invidia had been outspoken about existing export controls and lobbied against them going further. In January, as Biden's outgoing team imposed the last round of tightening, the company said that export controls, quote, will only harm the U.S. economy, set America back and play into the hands of

Starting point is 00:02:43 U.S. adversaries. Now, earlier this month, CEO Jensen Huang made a trip to Mara Lago to petition the president directly. Following his attendance at a million dollar ahead dinner, NPR reported that Trump had reversed course on new controls on the age 20s. The report stated that bands had been set to go into effect as soon as last week. The quid pro quo had been a dramatic ramp-up of local investment from Nvidia. NPR sources made vague reference to investments in AI data centers, but this week we've seen a wave of reports of Nvidia's commitments to U.S. manufacturing. On Monday, the company announced that they had begun production of Blackwell chips in TSM's Arizona facility. They also committed to producing AI supercomputers at a pair

Starting point is 00:03:20 of facilities in Texas. In total, Nvidia claimed they would produce a half a trillion dollars worth of AI infrastructure in the U.S. over the next four years. They said their manufacturing was, quote, expected to create hundreds of thousands of jobs and drive trillions of dollars in economic security over the coming decades. In the press release, Huang said, the engines of the world's AI infrastructure are being built in the United States for the first time. Adding American manufacturing helps us better meet the incredible and growing demand for AI chips and supercomputers, strengthens our supply chain and boosts our resiliency. Alas, the high-profile announcement doesn't seem to have been enough, with the export control still going into force two days

Starting point is 00:03:55 later. The Financial Times reports the announcement came as a complete surprise to Nvidia, saying that earlier this month, the company had assured Chinese tech giants that supply of H20s would not be interrupted. And so as of Thursday, Huang is visiting Beijing to meet with political and tech leaders. According to state broadcaster CCTV, the CEO said that China was a, quote, very important market for Nvidia, and that his company would, quote, make a significant effort to optimize our products that are compliant with the regulators and continue to serve the Chinese market. Speaking of Deepseek, sources said that his itinerary included meeting with Deepseek founder, Lang Wen Feng, to discuss a new chip design to meet regulatory

Starting point is 00:04:29 requirements set by Washington in Beijing. A public meeting with the China Council for the promotion of international trade was televised, and Huang also reportedly met separately with Chinese vice premier, Healy Feng. The press for their part is reading a lot into the deference shown by Huang, who discarded his trademark leather jacket for a suit and tie to attend high-level meetings in Beijing. Speaking to the whirlwind China visit, President Trump told the press, quote, Jensen's an amazing guy. He's become a friend of mine. He's a person that's very proud of our country. He loves our country. I'm not worried about Jensen at all. Back home, big fundraising continues. Ilius Utskhaver's safe superintelligence has closed a new round of

Starting point is 00:05:05 funding that values the company at a whopping $32 billion. The former OpenAI chief scientist founded the startup less than a year ago. In September, SSI raised a billion dollars at a $5 billion valuation, a price tag that already seemed a little rich to sum for a company with no product and little more than a big name founder and a resident mission statement. But obviously to many people that would be dismissing what SSI actually has. This round, which brought in an additional $2 billion, has marked up the valuation 6x. To put it in perspective, Anthropic was valued at $61.5 billion during last month's funding round, meaning that SSI has already achieved half that valuation. There are a couple of potential reads on the situation. The first take is that for those who are

Starting point is 00:05:45 in this game, venture firms are simply not that price sensitive when it comes to getting into the very small handful of companies that actually have a chance to reach AGI first. First. Reports from February said that SSI was in talks to raise at 20 billion then, so there's been a pretty significant jump in valuation during the negotiations, meaning there's a lot of demand. The second read is that SSI might have made actual progress over the past few months. Certainly everyone is wondering about what the product will look like. James Cham, a partner adventure firm Bloomberg Beta, said, everyone is curious about exactly what Ilya's pushing and exactly what the inside is. It's super high risk, and if it works out, maybe you have the

Starting point is 00:06:20 potential to be part of someone who is changing the world. A couple more before we wrap up, Anthropic is preparing to release their long-awaited voice mode. Bloomberg reports that the feature could be released as soon as this month. Sources said the rollout will feature three different voices for Claude, identified as airy, mellow, and a British-accented version referred to as buttery. CEO Dario Amade first teased voice mode during a January interview with the Wall Street Journal. One of the reasons he gave for the long delay was a desire to ensure that Claude's voice was comfortable and natural enough for long interactions.

Starting point is 00:06:48 The rollout will also be the first big test of Anthropics' new premium $200 per month subscription. In Microsoft Land, the company has enabled a new computer use feature for Copilot Studio. This feature is similar to offerings from OpenAI and Anthropic and allow Copilot to take over the computer to interact with websites and apps. Charles Lamana, the VP of Copilot said, this allows agents to handle tasks even when there is no API available to connect to the system directly. If a person can use the app, the agent can too. Apple, meanwhile, has revealed a convoluted plan to improve their AI without compromising

Starting point is 00:07:20 privacy. In a technical blog post, the company laid out a system that can check their synthetic data against tokenized user data without revealing information. The idea is that the synthetic data that most closely matches real user data can be used as the training set for Apple's next generation of models. This means that the company can technically state that their models aren't trained on user data. Apple says this method can be used to improve the performance of writing assistance, photo editing models, and their generative emoji feature.

Starting point is 00:07:45 Kind of an overcomplicated way to catch up with features that people were excited about back in 2024, but there you are. Separately, the New York Times reports that AI-enhanced Siri would finally arrive this year. Their sources said that current plans are to release the updated assistant in the fall, explaining the features they gave the example of being able to edit a photo and send it to a friend. A fall rollout would be well ahead of previous estimates, though, with Bloomberg tech editor Mark German previously stating that he thought that Siri, quote, won't be ready until 2027 at best.

Starting point is 00:08:12 The information writes, Apple's AIML group has been dubbed aimless internally, while employees are said to refer to Siri as a hot potato that is continually pass between different teams with no significant improvements. So I guess at the end of a week away, the more things change, the more they stay the same. That's going to do it for today's AI Daily Brief Headlines edition. Next up, the main episode. Today's episode is brought to you by Vanta. Vanta is a trust management platform that helps businesses automate security and compliance, enabling them to demonstrate strong security practices and scale. In today's business landscape,

Starting point is 00:08:46 businesses can't just claim security, they have to prove it, achieving compliance, with a framework like SOC2, ISO-2, ISO-2701, HIPAA, GDPR, and more is how businesses can demonstrate strong security practices. And we see how much this matters every time we connect enterprises with agent services providers at Superintelligent. Many of these compliance frameworks are simply not negotiable for enterprises. The problem is that navigating security and compliance is time-consuming and complicated. It can take months of work and use up valuable time and resources.

Starting point is 00:09:15 Vanta makes it easy and faster by automating compliance across 35-plus frameworks. It gets you audit ready in weeks instead of months and saves you up to 85% of associated costs. In fact, a recent IDC White Paper found that Vanta customers achieved $535,000 per year in benefits, and the platform pays for itself in just three months. The proof is in the numbers. More than 10,000 global companies trust Vantala, including Atlassian, Cora, and more. For a limited time, listeners get $1,000 off at vanta.com slash nLW. That's VANTA.com for $1,000 off.

Starting point is 00:09:49 Today's episode is brought to you by Super Intelligent and more specifically Super's Agent Readiness Audits. If you've been listening for a while, you have probably heard me talk about this, but basically the idea of the agent readiness audit is that this is a system that we've created to help you benchmark and map opportunities in your organizations where agents could specifically help you solve your problems, create new opportunities in a way that, again, is completely customized to you. When you do one of these audits, what you're going to do is a voice-based agent interview where we work with some number of your leadership and employees to map what's going on inside the organization and to figure out where you are in your agent

Starting point is 00:10:29 journey. That's going to produce an agent readiness score that comes with a deep set of explanations, strength, weaknesses, key findings, and of course a set of very specific recommendations that then we have the ability to help you go find the right partners to actually fulfill. So if you are looking for a way to jumpstart your agent strategy, send us an email at Agent at B-Super.a.I. And let's get you plugged into the agentic era. Welcome back to the AI Daily Brief. It's pretty clear that the big news this week in AI

Starting point is 00:10:59 was the introduction by OpenAI of a set of new reasoning models. On Wednesday, OpenAI released O3 and O4 Mini. O3 is their most advanced reasoning model to date, while O4 Mini is being pitched as a competitive tradeoff between price, speed, and performance. There's also a high resource version of O4 Mini called O4 Mini-high. So the trend of OpenAI having completely clear names continues. The new batch of reasoning models introduces some new features to the O-Series family.

Starting point is 00:11:27 First, the models can integrate images into their reasoning process. We've seen something along these lines show up as an emergent property of multimodal models like Google's Gemini, but this will be the first time that OpenAI has pushed the limits on what the reasoning modality can do. OpenAI told Venturebeat, these models don't just see an image, they think with it. It unlocks a new class of problem-solving that blends visual and textual reasoning. The other big improvement is tool use, with the new models natively trained on common tools. The company wrote, we've trained them to use tools through reinforcement learning, teaching them not just how to use tools, but to reason about when to use them.

Starting point is 00:12:01 President Greg Brockman commented, they actually use these tools in their chain of thought as they're trying to solve a hard problem. For example, we've seen O3 use like 600 tool calls in a row trying to solve a really hard task. Now, this could represent a big jump in agenda capabilities. For agents, being able to figure out the right tools to use for any given situation is going to be one of their biggest unlocks and is pretty key to enabling ultimately fully autonomous agents. Right now, one of the most common failure states for agents is either failing to recognize when to use a tool or failing to access the tool properly. Now, it wouldn't be a new model release without a whole bunch of benchmarks that you're not exactly sure what they mean or how much to care about. and, in fact, the tool use appears to be showing up here.

Starting point is 00:12:41 O4 Mini, for example, managed to score 99.5% on the AIME 2025 mathematics competition, but only when given access to a Python interpreter. More broadly, OpenAI is claiming that O3 benchmarks is state-of-the-art across standard coding, science, and agendic tasks. However, as you all have heard me say before, I think that given the challenges of benchmarks, it's much more relevant to see what people are actually doing with these tools. Kelsey Piper of Vox's Future Perfect said that O4 Mini High is the first model to pass her own, quote, personal secret benchmark for hallucinations and complex reasoning.

Starting point is 00:13:12 Her test involves presenting the inputs of a complex midgame chess board and the prompt, mate in one. The catch is that there is no checkmate in one move. AI models are trained on extensive chess puzzles of this kind, but their training set doesn't necessarily include this kind of trick question. Piper said that her prior testing showed that models reasoned through thousands of possibilities before hallucinating a solution. This generally involves adding extra pieces to the board or illegal moves.

Starting point is 00:13:35 The models will then add lengthy justifications for why they're hallucinated solution is correct. She had run this test on every clod model to date, as well as Gemini 2.5 Pro, GVT-O3, Mini-Hi and GROC3, with none figuring out that the solution is impossible. Why is this a big deal? I invented this problem because I think it gets at the core of AI's potential and limitations. An AI that can't question its premises will always be limited. An AI that doubles down on its own wrong answers will too. She noted that the reasoning trace was eight minutes long, much longer than any other query she ran, saying, that's a lot of places to potentially make mistakes, and hallucinate a solution. Its expectation that there was a solution was very strong, but it overcame it.

Starting point is 00:14:14 She added in conclusion, however, that said, its explanation of why there was no checkmate, in fact, still contain some chess inaccuracies, which I know it knows better then. So certainly don't trust these things, but no, they're continually getting better. And even more vociferous endorsement came from economist Tyler Cowan, who wrote, I think it's AGI, seriously. Try asking it lots of questions and then ask yourself, just how much smarter was I expecting AGI to be? I've argued in the past AGI, however you define it, is not much of a social event per se. It will still take us a long time to use it properly. Benchmarks, benchmarks, blah, blah, blah.

Starting point is 00:14:46 Maybe AGI is like porn. I know it when I see it, and I've seen it. Now, I haven't had as many reps as I normally would have this week with O3, given the travel, but I am absolutely 100% in the Tyler Cowan camp here. Not necessarily that O3 is AGI, but that it doesn't matter. These models have so far, to me, been an absolute step-change improvement, relative to 01 in what we were using in the past. I've been testing them as a business thought partner,

Starting point is 00:15:12 and the reasoning is so much more thorough, so much more interesting, and just generally better. In fact, I've implored, by which I mean basically demanded that everyone inside super intelligent start playing around with 03 as a brainstorming partner for pretty much everything. I genuinely think it's that good. Now, I think it'll still take some time for us to figure out

Starting point is 00:15:32 exactly what the best use cases for these models are, Although if enough people like me demand that all their colleagues use it for every business interaction from here on out, I'm sure we'll figure it out more quickly. Still, one use case that people jumped on very fast was that O3 appears to be disturbingly good at geogessing. Given basically any photo of a landscape or a building, the model can pinpoint its location on a map. Henrion X wrote, 10 years ago, the CIA would have gotten on their knees for this. Every single human has just been handed an intelligent super weapon. It's only getting stranger. I would implore you if you haven't had a chance yet, go play with it.

Starting point is 00:16:05 this model. Even if you don't have something specific that you're trying to do, try asking it whatever business question you're thinking through at the moment. Use it as a thought and collaboration partner and just see how different it feels as opposed to past models. It is, of course, totally possible that I'm in the few first-day glow of a new toy and that it's actually not all that different, but I kind of don't think so. Now, completely overshadowed by the 03 and 04 mini-releases, OpenAI also rolled out a new update to their non-reasoning model family earlier in the week on Monday. GPT 4.1 will be the successor to GPT40 and is now available to developers through the API. The GPT4.1 family includes three different sizes with a mini and nano variant available alongside

Starting point is 00:16:45 the full-size model. OpenAI says that the nano version will be their smallest, fastest, and cheapest model yet. Another big update, the models have a million token context window matching Google's recently released Gemini 2.5 Pro. As we've discussed before, ultra-long context windows are especially important for coding assistance and agents, allowing users to dump entire codebases into the model, or run longer agendic workflows. And it seems that GPT4.1 is explicitly aimed at coding use cases. An OpenAI spokesperson said, We've optimized GPT41 for real-world use based on direct feedback

Starting point is 00:17:17 to improve in areas that developers care most about. Front-end coding, making fewer extraneous edits, following formats reliably, adhering to response structure and ordering, consistent tool usage, and more. These improvements enable developers to build agents that are considerably better at real-world software engineering tasks. If nothing else, this is definitely open AI competing on price

Starting point is 00:17:35 in a very aggressive way. Michelle Pocgrass, the post-training research lead at OpenAI, said, not all tasks need the most intelligence or top capabilities. Nano is going to be a workhorse model for cases like autocomplete, classification, data extraction, or anything else where speed is the top concern. Entrepreneur Paul Gauthier noted that this week's releases are more than the sum of their parts, posting, using O3 High as architect and GPT4.1 as editor produced a new state-of-the-art of 83% on the ATER-Polyglock coding benchmark. It also substantially reduced costs compared to O3 High alone. Now, speaking of coding, something we've talked a lot about on this show is how for some time,

Starting point is 00:18:12 Anthropics Clod has been the go-to choice for developers. While Open AI is definitely not giving up that fight, because alongside these new models, they also rolled out a new coding agent. Sam Altman posted, O3 and O4 Mini are super good at coding, so we're releasing a new product Codex CLI to make them easier to use. This is a coding agent that runs on your computer. It's fully open source and available today. We expect it to rapidly improve.

Starting point is 00:18:35 prove. Now, because it's open source, of course, there are already forks that enable models from outside the OpenAI ecosystem as well. First reactions seem decent. Gooby said, used Codex CLI with O3, used like 150 in tokens in like an hour, switching to O4 Mini now, LMAO. That being said, O3 was cooking, fixed a couple of long-standing bugs. Roshab Shravastara wrote, vibes for Codec CLI so far, been a bit me meh, Claude Codes still much better. Codex with though 4 Mini has been fantastic for one-shot single-file edits, extremely good at fixing subtle bugs when specifically prompted. Meh at iteration and retaining content and at multi-file edits. Terrible at creating documentation and explaining a codebase. So for now, maybe Claude can breathe

Starting point is 00:19:17 a sigh of relief, but it's pretty clear that OpenAI wants to compete in that space, which is also validated by the fact that on Wednesday, Bloomberg reported that the company is looking to acquire Winsurf. WinSurf is probably the best-known cursor competitor and was valued at 1.2.2 to $2.5 billion back in August and was reportedly in talks to raise at a $3 billion valuation earlier this year. The report states state that OpenAI is looking to make the acquisition at $3 billion, but sources say the deal hasn't been finalized and could still fall apart. Now, if you're wondering, why not just buy a cursor instead? Sam Altman apparently thought of that as well and made two separate attempts to buy the leading agentic coding platform.

Starting point is 00:19:54 One was late last year and another early this year. In fact, CNBC's sources say that Open AI has actually met with 20 companies in the AI coding domain before reportedly finding a deal with Winsurf. All in all, it was an extremely busy week in Open AI land, and I'm ignoring about a half a dozen stories that otherwise might have merited attention. For now, what I will leave you with is my strong instinct to please go try O3, play around with O4 Mini as well. These really do feel like a different quality of model and a different quality of experience, and I think are going to open up some different types of use cases. For now, though, that's going to do it for today's AI Daily Brief. Until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - Is o3 Functionally AGI?

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.