The AI Daily Brief: Artificial Intelligence News and Analysis - GPT-5.2 is Here

Starting point is 00:00:00 This podcast is sponsored by Google. Hey folks, I'm Amar, product and design lead at Google DeepMind. Have you ever wanted to build an app for yourself, your friends, or finally launched that side project you've been dreaming about? Now you can bring any idea to life, no coding background required, with Gemini 3 in Google AI Studio. It's called vibe coding and we're making it dead simple. Just describe your app and Gemini will wire up the right models for you

Starting point is 00:00:24 so you can focus on your creative vision. Head to AI.studio slash build to create your first app. Today on the AI Daily Brief, GPT 5.2 is here, and OpenAI wants you to know it is for professionals. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. All right, friends, quick announcements before we dive in. First of all, thank you to today's sponsors, Gemini, KPMG, Blitzy, Rovo, and Robots and pencils. To get an ad-free version of the show, go to patreon.com. And if you were interested in sponsoring the show, lock in those 2025 rates by emailing us at sponsors at AIDDailybrief.

Starting point is 00:01:05 Welcome back to the AI Daily Brief. This is actually the second AI Daily Brief I've recorded today, because of course in the early afternoon we got GPT 5.2, which means the episode I recorded earlier in the day will become tomorrow's episode. Because obviously we have to talk about this new model. Now, there have been indications for the past week that GPT 5.2 was on the way. This is, of course, part and parcel of OpenAI's declared Code Red. In the lead-up to the release of Gemini 3, OpenAI, Sam Altman had sent a memo to his team, basically expecting there to be some rough vibes in his words as Google released their best ever model. Then on top of that, we got Opus 4.5, which is just continued to impress, and frankly,

Starting point is 00:01:45 if anything, grow in people's esteem. And yet, Chatter has been over the last week that OpenAI's forthcoming response model, codenamed garlic, was likely to be a capable response. Well, today we got the model, and at least at first glance, it's a banger. In the benchmarks they shared, it represented a significant improvement on the coding benchmarks we bench, we bench pro, hitting 55.6 compared to Opus 4.5's 52%. It scored a 52.9% on the Arc AGI2 exam, ahead of Opus 4.5's 37.6%. And on GDPVal, which is OpenAI's internal measure of economically valuable knowledge work tasks, it scored a massive 70.9%, up from 38.8% with GPT5. GDPVal is in some ways the most relevant of the benchmarks, at least in terms of what it seems the goal

Starting point is 00:02:34 of GPT-5-2 is for OpenAI. More so frankly than any model release I've seen from them, there is a clear, clear messaging directive. This is a real-world business model to help professionals get more value. In a briefing with reporters, OpenAI's CEO of applications, Fiji Simo, said that 5.2 was about unlocking even more economic value for people. In her announcement tweet, she reiterated this. GPT5.2 is here and it's the best model out there for everyday professional work. Greg Brockman writes, 5-2 is here the most advanced frontier model for professional work and

Starting point is 00:03:10 long-running agents. It's a big step forward on enterprise tasks including spreadsheets and slides. Head of ChatGPT, Nick Turley writes, today we're introducing GPD 5.2, our most advanced model series for professional work. GPT 5.2 thinking is designed to help with real economically valuable tasks. The kind of work

Starting point is 00:03:27 professionals do every day, building spreadsheets and presentations, writing and reviewing production code, analyzing long documents, coordination tools, and executing complex projects from start to finish. And indeed, you can tell that of all the benchmarks, the one that they really care about is that GDP Val measure of success with professional knowledge work tasks. Simo again writes on GDP Val, the thinking model beats or ties human experts on 70.9% of common professional tasks like spreadsheets, presentations, and document creation. Noam Brown wrote, in my opinion, GDP Val is the

Starting point is 00:03:58 most important result from our 5-2 launch. We outperform in-domain experts and our state-of-the- art among all models on GDP Val, which measures performance on self-contained tasks like making spreadsheets and PowerPoint presentations. Truly, you have never seen a company as excited about spreadsheets and PowerPoints as OpenAI is with the launch of 5.2. All of this was the theme of the announcement post as well. Right at the top, OpenAI harkens back to the Chatchapit Enterprise survey that we discussed earlier in the week, quoting that number where enterprise users were saving between 40 and 60 minutes a day. OpenAI writes, we designed 5.2 to unlock even more economic value for people. It's better at creating spreadsheets, building presentations,

Starting point is 00:04:38 writing code, perceiving images, understanding long contexts, using tools and handling complex multi-step projects. And honestly, when you see the difference between 5-1 thinking and 5-2-thinking on some of these economic tasks, the difference could not be more stark. The examples they give are a workforce planning model including headcount, hiring plan, attrition, and budget impact. the spreadsheet is so massively approved, again, in their cherry-picked example, they also give an example of two different cap tables. While the visual is pretty similar, they note that 5.1 incorrectly calculated Seed Series A and Series B liquidation preferences and left the majority of those rows blank, which led to an incorrect final equity payout calculation. 5.2 got all those calculations

Starting point is 00:05:20 correct. They also gave an example of project management, where 5-2 thinking produced this really professional-looking Gant chart to help describe and summarize progress over the course of a month. Now, these broad-based economically valuable tasks are, like I said, the thing that OpenAI chose to put right at the top of this blog post. Even ahead, it is notable of coding. Yet, as I said, with that 55.6 on Swaybench Pro, there are definitely coding improvements here as well. And once again, they connected this to professional users. For everyday professional use, they write, this translates into a model that can more reliably debug production code, implement feature requests, refactor large codebases and ship fixes end to end with less manual intervention. They also note that it's better at

Starting point is 00:05:59 front end, giving examples of an ocean wave simulation, a holiday card builder, and a typing rain game where you have to type the words before they hit the bottom of the screen. A couple things to call out that were a little bit farther down in the announcement post, but we're still really interesting. The first is that 5-2 seems to do really well with long context. On needles in a haystack test, where the performance of 5.1 degraded from about 90% at 8K context to less than 50% at 256. K context, with 5-2 thinking, it barely nudged down from 108k to something it appears is above 90 on the 256K context. Now, going back to professional use, this matters, I think, because a lot of the next generation of value is going to be unlocked by being able to handle lots and lots of enterprise

Starting point is 00:06:42 context all at once. Another important change? They found that GPT-5-2 had roughly 30 to 40% less hallucination. Again, when you're thinking about professional business users, one of the great enemies of reliance on AI is hallucinations, so seeing a meaningful decrease in hallucinations, again, means a big difference for professional users. But so far, we've just talked about what OpenAI said about their own model. What about some of the folks who had early access? Medical professor Daria Anutmasz writes, I had early access to GPT52 and tested mostly the pro version. Let me just say this. Relative to 5-1 Pro, it has stronger abstraction, clearer, more realistic, balanced, and strategic responses, and shows deeper conceptual insights and vibe. And I would say this

Starting point is 00:07:24 represents one theme that I saw in a lot of these initial early responses, that, yeah, this is just a good model that is a meaningful improvement. Ethan Mollock writes, had early access to 5-2? It's an impressive model. He asked it to build him a graph of humanity's last exam scores over time, which, as he points out, involved looking up in cross-referencing a lot of material and then generating something useful in one shot, which it did. When Box began testing 5-2 with their reasoning tests, CEO Aaron Levy writes, we asked the model to perform a series of enterprise tasks that approximate real-world knowledge work that we see in industries ranging from financial services to health care and life sciences. These tasks require a high degree of analytical capabilities, math, reasoning, and more. Aaron noted that with this

Starting point is 00:08:02 expanded task set, with broader and harder tasks than before, 5-2 scored seven points better than 5.1, and performed the majority of the tasks far faster than previous models. The coding first impressions are likewise pretty good. Elam Arena's Peter Gostov writes, I've spent a lot of time testing this model on the arena, and it's an excellent bump from the 5-1 versions for coding, and a big challenger to Gemini 3 Pro and Opus 4.5. P.H.O. Sherano writes, 5-2 is a seriously forwarding complex reasoning, math, coding, and simulations. It built a full 3D graphics engine in a single file. Interactive controls 4K export, one shot. The pace of progress is unreal. He also argued that it's, quote, the best agentic model OpenAI has shipped,

Starting point is 00:08:44 runs tons of tools in a row without issues and is faster than its predecessor. 52 calls tools with no preamble and doesn't get lost in long sessions. Flavio Adamo wrote a short post called What Actually Changed, and found that the model was noticeably better at creating presentations, generating spreadsheets, producing cleaner tables. He also found a significant improvement in visual design and front end. Overall, he writes, 5-2 isn't a revolution, but the upgrades are hard to miss. It's more accurate, more consistent, and a lot more dependable in tasks that actually matter. Now, not everyone was universally positive. In fact, there were a number of early testers who did point out some of the challenges of 5-2. Dan Shipper

Starting point is 00:09:21 from Evry said it's not as good a writer as Opus on her internal benchmarks, and that it's in his estimation mostly an incremental upgrade, saying that he hasn't found himself explicitly switching to it for day-to-day tasks. That idea of this being an incremental upgrade is Evry's big banner headline. They said while it excels that instruction following and extended tasks, don't expect it to surprise you. Now, one thing that's notable about Everie's tests is that they have a more sophisticated test that they built for writing quality than many others, that uses about 50 requests and scores them on things like reader engagement and AIism avoidance. Meaning in other words that although they're calling this a vibe check,

Starting point is 00:09:55 they're actually one of the best, if not the best, source of early feedback when it comes to the quality of writing out of a new model. 5-2 certainly wasn't bad, matching Sonnet 4.5 at 74% on their tests, but it was below Opus 4.5's 80%. One bright spot they pointed to was that it was less prone to tired AI constructions like it's not X, it's Y. So summing up, Every's critique isn't so much a critique.

Starting point is 00:10:17 It's just a cap on how hype to get, again, calling this an incremental upgrade. Others pointed out things that did well versus not so well. Simon Smith verified that 5.2 is a lot better for professional deliverables, saying that the biggest leap is in structured business outputs like multi-sheet Excel workbooks with proper formatting, and PowerPoint decks with better structure and concise bullets. He said this is the first time chat GPT has made spreadsheets and presentations

Starting point is 00:10:39 I'd consider remotely client-ready. He also argued that 5-2 has better concision of thinking. He argues that 5-1 sometimes rambles producing a spreadsheet. brawl, whereas 5-2 is more deliberate and better calibrated to the task complexity. However, he argues that this isn't universally a good thing. He compares 5-1 thinking to a brilliant, slightly chaotic freelancer, and 5-2 thinking to a polished professional. He agrees with every that 5-2 is less likely to surprise you, whereas the upside of 5-1's slightly chaotic nature is that, while, in his words, you never let it talk to a client, sometimes it surprises you with an outstanding idea or turn of

Starting point is 00:11:15 phrase. Ultimately, Simon comes down on the side of this being a big upgrade. Ali Miller had similar findings. In her test, the thinking and problem solving felt noticeably stronger. She said that it gave her deeper explanations than she's used to seeing. In fact, she writes, at one point it literally wrote code to improve its own OCR in the middle of a task. She also found that idea exploration feels a little bit richer even than what she's seen from Opus 4.5. However, like Simon, she found the tone to be different, and for her, a downside. She said the default voice felt a little bit more rigid, and the length and markdown behavior is extreme. A simple question turned into 58 bullets and numbered points. Ultimately, she argues

Starting point is 00:11:52 that this version is optimized for deeper problem solving, structured analysis, and power users who want to sift through all of those options. Five-two, she says, feels like a step towards AI as a serious analyst and less AI as friendly companion. Hello, friends. If you've been enjoying what we've been discussing on the show, you'll want to check out another podcast that I have had the privilege to host, which is called You Can With AI from KPMG. Season one was designed to be a set of real stories from real leaders, making AI work in their organizations, and now season two is coming and we're back with even bigger conversations. This show is entirely focused on what it's like to actually drive AI change inside your

Starting point is 00:12:35 enterprise, and as case studies, expert panels, and a lot more practical goodness that I hope will be extremely valuable for you as the listener. Search You Can with AI on Apple, Spotify, or YouTube and subscribe today. This episode is brought to you by Blitzy, the Enterprise Autonomous Software Development Platform with infinite code context. Blitzy uses thousands of specialized AI agents that think for hours to understand enterprise-scale code bases with millions of lines of code. Enterprise engineering leaders start every development sprint with the Blitzy platform,

Starting point is 00:13:05 bringing in their development requirements. The Blitzy platform provides a plan, then generates and pre-compiles code for each task. Blitzy delivers 80% plus of the development work autonomously, while providing a guide for the final 20% of human development work required to complete the sprint. Public companies are achieving a 5x engineering velocity increase when incorporating Blitzie as their pre-IDE development tool, pairing it with their coding pilot of choice to bring an AI-native SDLC into their org. Visit blitzy.com and press get a demo to learn how Blitzy transforms your SDLC from AI-assisted to AI-native.

Starting point is 00:13:37 Meet Rovo, your AI-powered teammate. Rovo unleashes the potential of your team with AI-powered search, chat, and agents, or build your own agent with Studio. Rovo is powered by your organization's knowledge and lives on Atlassian's trusted and secure platform, so it's always working in the context of your work. Connect Robo to your favorite SaaS app, so no knowledge gets left behind.

Starting point is 00:14:00 Robo runs on the teamwork graph, Atlassian's intelligence layer that unifies data across all of your apps and delivers personalized AI insights from day one. Robo is already built into Jira, Confluence, and Jira Service Management Standard, premium and enterprise subscriptions. Know the feeling when AI turns from tool to teammate. If you rovo, you know.

Starting point is 00:14:19 Discover Rovo, your new AI teammate powered by Atlassian. Get started at ROV as in VictoryO.com. AI changes fast. You need a partner built for the long game. Robots and pencils work side by side with organizations to turn AI ambition into real human impact. As an AWS certified partner,

Starting point is 00:14:39 they modernize infrastructure, design cloud native systems, and apply AI to create business value, and their partnerships don't end at launch. As AI changes, robots and pencils stays by your side, so you keep pace. The difference is close partnership that builds value and compounds over time. Plus, with delivery centers across the U.S., Canada, Europe, and Latin America, clients get local expertise and global scale. For AI that delivers progress, not promises, visit robots and pencils.com slash AI Daily Brief. Now one person who agrees with all of what has been

Starting point is 00:15:14 said before, but found that 5.2 Pro is so uniquely better at what it does that it has become indispensable for him is Matt Schumer. Now, interestingly, Matt says that he's had access to these models since November 25th, which is a lot longer than most of these folks who are sometimes going on days or even just a couple of hours of early access. His overall review of 5-2 was summed up as incredibly impressive, but too slow. He said 5-2 thinking is a meaningful step forward and instruction following and willingness to attempt hard tasks, co-generational, is a lot better than 5-1, vision and long context are much improved, but speed is a big downside. And speed can be a big deal. He expands the thought, here's something that affects my daily usage.

Starting point is 00:15:57 Standard 5-2 thinking is slow. In my experience, it's been very, very slow for most questions, even straightforward ones. I almost never use instant, thinking is much better, and pro is insanely better, but it means I'm usually paying a speed penalty. In practice, this means I barely use GPT-52 thinking. My actual workflow has become, quick questions go to Claude Opus 4.5, and when I need deep reasoning, I go straight to 52 Pro. The standard thinking model sits in an awkward middle ground, slower than Opus, but without the full reasoning benefits of Pro. However, Matt, more than anyone else that I've seen so far, really extols Pro as something fundamentally different. He writes, more than raw intelligence, what sets Pro apart is its willingness to think. It will

Starting point is 00:16:38 spend far longer than previous pro models working through a problem. For research tasks, it will research an absurdly long time if that's what the task requires. Now, one example he gave to capture what Pro does uniquely among models. He said, I asked it for meal planning help emphasizing that I have no time to cook. I wanted a seven-day plan with three meals and two snacks per day. Pro came back with amazing recipe plans, but what stood out was the ingredients list. Much simpler than what the other model suggested. It understood that I have no time wasn't just a constraint on cooking time,

Starting point is 00:17:09 it was a constraint on shopping complexity, prep work, and mental overhead. It grasped my mentality, not just my literal request. I had sent the same prompt to all of the other frontier models, and none of them accounted for this. This is the kind of understanding that makes pro feel different. Indeed, so enthused was he that he wrote another full review, called his 5-2 Pro Deep Tive, where he said this is undoubtedly the world's best model I can't live without it. Now again, he warns of the cost of speed, which isn't just that you have to wait around for an answer, but as he points out, every so often it will think for a long time and still make a big mistake,

Starting point is 00:17:43 wasting a lot of time. That means in his estimation that prompting matters more than ever. Be explicit, add constraints, and refine prompts before you send them. Still ultimately, he writes, after using Pro for two weeks, I can't live without it. It's my go-to for everything I do that requires deep thinking, research, or coding, or almost any prompt I run that doesn't require an instant answer. I think Galley Miller actually has a pretty good rundown of what this amounts to four different user profiles. For general users, she writes, I think they'll be incrementally more pleased. She writes the idea space that 5-2 explores is better than 5-1, so they might like problem-solving

Starting point is 00:18:17 a little bit more. For devs, she wasn't sure. She said that while the models seem to fare well on One Shot asks, she suspected that the max-code models within Codex are still better, and that Claude and Gemini are either right up there or even a head still. When it came to business users, although she said that she didn't feel all that big of a leap, everything else around the benchmarks suggests a huge jump. Researchers, however, she suggested were going to be the most.

Starting point is 00:18:39 pleased group overall, which comports with Matt Schumer's argument that this is a slow genius. Now, going back to that question of coding in direct head-to-head comparison, we do, of course, have some ways to see what people prefer in direct head-to-head ways. Elm Arena shared that 5-2 was at number 6 in web dev, and that 5-2 high had jumped all the way up to number two ahead of Opus 4.5 in Gemini 3 Pro, but behind Opus 4.5 thinking. On front end in the design arena, 52-high remains behind Gemini 3 Pro and Opus 4.5, but came in at third. So let's talk now about some of the larger implications of this release. Some pointed out that there are implications for what we believe around training. Ben Poulodian writes, GPT52 is the clearest signal yet that pre-training

Starting point is 00:19:21 scaling isn't slowing down. Bigger corpuses, longer contexts, hotter training run. Every jump like this means one thing. Invita's curve is nowhere near flattening. We're still early in the compute super cycle. Now, if this becomes conventional wisdom, it could have meaningful impacts on the spectrum of boom to bubble in the same way that GPT3 being released pushed people more towards boom, at least for the moment. TDM on Twitter also noted that in the our partner section of the OpenAI announcement, they said that 5-2 was built on Nvidia GPUs including H-100s, H-200s, and GB-200s. Another interesting implication, I think, just has to do with the pace of change. You heard before in the benchmarks that 5-2 scored very high on the ARC AGI exams, both one and two.

Starting point is 00:20:03 While ARC-PRIZ tweeted, a year ago we verified a preview of an unreleased version of OpenAI 03 that scored 88% on ARCGI. The catch was, of course, did that version cost $4,000 and a half thousand dollars a task? Today they write, we verified a new 5.2 Pro extra high state-of-the-art score of 90.5% at $11.64 cents a task. For those quickly doing the math, or asking chat you, to do the math, they point out that that represents a 390x efficiency improvement in one year. Now, in terms of what it means in their battle with Anthropic and Google, it's too early to know for

Starting point is 00:20:38 sure how people are going to feel as they really get their hands on it, but it does seem likely to me to stem some of the bleeding. Even though it isn't claiming to be universally better in all ways, and a lot of the first responses that we shared have some caveating or at least nuance to where and how and in what ways it's good, it's clearly a really good model that is a big step up from what Open AI offer before, and it is likely going to compete with Gemini 3 Pro and Opus 4.5 on a lot of different use cases. Sam Altman also tweeted just after the release, also we have a few little Christmas presents for you next week, which many are speculating right away means the next version of images. Given all the rumors that we've shared and talked about in previous

Starting point is 00:21:14 episodes this week, about new image models being tested under pseudonymous names. Summing up and touching on something that we haven't even had a chance to talk about yet, Roheet writes, Code read to the best model in a partnership with Disney in one week, damn. What they're referring to, of course, is a new partnership that was announced this morning, where the Walt Disney Company is not only not going after OpenAI in court, but is instead granting them a three-year license to use something like 200 Disney characters in Sora Generations. The details are one.

Starting point is 00:21:44 It's a three-year licensing agreement, including one-year exclusivity, where Sora users will be able to generate videos that use more than 200 different Disney, Marvel, Pixar, and Star Wars characters. Creating an incentive for people to actually go do that, some number of those SORA videos will then actually stream on Disney Plus, and at the top it all off, Disney is also going to become a major customer of OpenAI, both deploying ChatGBT for its employees and using API to build new products, and finally, Disney's going to make a billion-dollar equity investment into OpenAI as well.

Starting point is 00:22:13 Now, I got to give a big shout out to Andrew Curran, who's one of the best AI news aggregators on Twitter slash X, But all the way back in August, when Sam Altman tweeted an image of a faded death star, Andrew wrote, Sometimes I read too much into things it's my nature. However, seeing as I think we're getting a Sauru two announcement, I'm predicting the mouse has finally made up its mind. And it wasn't just that one prediction.

Starting point is 00:22:35 Back in November, he also wrote, Disney is becoming an AI company. At this point, it's simply a matter of who they choose as a partner. A deal between Open AI and Disney has seemed close many times over the last year, but it looks like it's coming down to this week. To me, this decision is a huge signal for who, who will be leading the race a year from now. It's far bigger than the IP.

Starting point is 00:22:53 It's also the fact that as soon as Disney forms this partnership and starts using AI for user-created content, which will begin with video shorts on Disney Plus, it will use its immense media power to broadcast that AI is a legitimate creative tool and will actively encourage its use. To me, this is the biggest decision of the year and whoever wins it will have immense main character energy in 2026.

Starting point is 00:23:12 Adding some validity to that argument, on the same day that they announced this Open AI deal, Disney sent a cease and desist letter to Google, accusing them of copyright infringement on a massive scale. Now, there will be a lot more to get into on that particular deal, but bringing you back to GPT5, I was someone who felt like the 5-1 update was very meaningful. For my use cases of iterative brainstorming and business strategy collaboration, 5-2 in general, but 5-2 pro especially, felt like a major upgrade. In fact, like Matt Schumer, although in the context of a different model,

Starting point is 00:23:43 I found myself much more than I was before, skipping the thinking model and going straight to 5-1 Pro. I was finding myself just naturally redesigning my work so that I could let that process take place while I was doing other things and then come back to it when it was ready. However, it feels to me like in most ways, OpenAI sees 5-2 as the big next step post-GPT-5. It almost feels as though 5-1 in personality and capabilities was kind of what they wanted 5 to be, and 5.2, at least at first glance, appears to be what they wanted that next intermediate model to be. 5-2 should be rolling out to all paid subscribers over the next day or so,

Starting point is 00:24:20 so we will have some fun figuring out what it does well. For now, that is going to do it for the AI Daily Brief. Appreciate you listening or watching as always, until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - GPT-5.2 is Here

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.