The AI Daily Brief: Artificial Intelligence News and Analysis - GPT-5.2 is Here
Episode Date: December 11, 2025Today’s episode breaks down GPT-5.2, OpenAI’s most work-focused model yet, with major gains in reasoning stability, long-context performance, and real professional tasks like coding, spreadsheets,... and presentations. The conversation looks at early benchmarks and tester reactions, what OpenAI’s emphasis on economic value signals about its strategy, and how the model’s launch coincides with a blockbuster new Disney partnership that expands OpenAI’s reach across enterprise, media, and IP.Brought to you by:KPMG – Discover how AI is transforming possibility into reality. Tune into the new KPMG 'You Can with AI' podcast and unlock insights that will inform smarter decisions inside your enterprise. Listen now and start shaping your future with every episode. https://www.kpmg.us/AIpodcastsGemini - Build anything with Gemini 3 Pro in Google AI Studio - http://ai.studio/buildRovo - Unleash the potential of your team with AI-powered Search, Chat and Agents - https://rovo.com/AssemblyAI - The best way to build Voice AI apps - https://www.assemblyai.com/briefLandfallIP - AI to Navigate the Patent Process - https://landfallip.com/Blitzy.com - Go to https://blitzy.com/ to build enterprise software in days, not months Robots & Pencils - Cloud-native AI solutions that power results https://robotsandpencils.com/The Agent Readiness Audit from Superintelligent - Go to https://besuper.ai/ to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Interested in sponsoring the show? sponsors@aidailybrief.ai
Transcript
Discussion (0)
This podcast is sponsored by Google.
Hey folks, I'm Amar, product and design lead at Google DeepMind.
Have you ever wanted to build an app for yourself, your friends,
or finally launched that side project you've been dreaming about?
Now you can bring any idea to life, no coding background required,
with Gemini 3 in Google AI Studio.
It's called vibe coding and we're making it dead simple.
Just describe your app and Gemini will wire up the right models for you
so you can focus on your creative vision.
Head to AI.studio slash build to create your first app.
Today on the AI Daily Brief, GPT 5.2 is here, and OpenAI wants you to know it is for professionals.
The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.
All right, friends, quick announcements before we dive in.
First of all, thank you to today's sponsors, Gemini, KPMG, Blitzy, Rovo, and Robots and pencils.
To get an ad-free version of the show, go to patreon.com.
And if you were interested in sponsoring the show, lock in those 2025 rates by emailing us at sponsors at AIDDailybrief.
Welcome back to the AI Daily Brief.
This is actually the second AI Daily Brief I've recorded today, because of course in the early afternoon we got GPT 5.2, which means the episode I recorded earlier in the day will become tomorrow's episode.
Because obviously we have to talk about this new model.
Now, there have been indications for the past week that GPT 5.2 was on the way.
This is, of course, part and parcel of OpenAI's declared Code Red.
In the lead-up to the release of Gemini 3, OpenAI, Sam Altman had sent a memo to his team, basically
expecting there to be some rough vibes in his words as Google released their best ever model.
Then on top of that, we got Opus 4.5, which is just continued to impress, and frankly,
if anything, grow in people's esteem. And yet, Chatter has been over the last week that OpenAI's
forthcoming response model, codenamed garlic, was likely to be a capable response.
Well, today we got the model, and at least at first glance, it's a banger. In the benchmarks they
shared, it represented a significant improvement on the coding benchmarks we bench, we bench pro,
hitting 55.6 compared to Opus 4.5's 52%. It scored a 52.9% on the Arc AGI2 exam,
ahead of Opus 4.5's 37.6%. And on GDPVal, which is OpenAI's internal measure of
economically valuable knowledge work tasks, it scored a massive 70.9%, up from 38.8% with GPT5.
GDPVal is in some ways the most relevant of the benchmarks, at least in terms of what it seems the goal
of GPT-5-2 is for OpenAI. More so frankly than any model release I've seen from them,
there is a clear, clear messaging directive. This is a real-world business model to help
professionals get more value. In a briefing with reporters, OpenAI's CEO of applications, Fiji
Simo, said that 5.2 was about unlocking even more economic value for people. In her announcement
tweet, she reiterated this. GPT5.2 is here and it's the best model out there for everyday professional
work. Greg Brockman writes,
5-2 is here the most advanced
frontier model for professional work and
long-running agents. It's a big step
forward on enterprise tasks including spreadsheets
and slides. Head of ChatGPT,
Nick Turley writes, today we're
introducing GPD 5.2, our most
advanced model series for professional work.
GPT 5.2 thinking is designed to help with real
economically valuable tasks. The kind of work
professionals do every day, building
spreadsheets and presentations, writing and reviewing
production code, analyzing long
documents, coordination tools, and executing complex projects from start to finish.
And indeed, you can tell that of all the benchmarks, the one that they really care about is that
GDP Val measure of success with professional knowledge work tasks. Simo again writes on GDP Val,
the thinking model beats or ties human experts on 70.9% of common professional tasks like
spreadsheets, presentations, and document creation. Noam Brown wrote, in my opinion, GDP Val is the
most important result from our 5-2 launch. We outperform in-domain experts and our state-of-the-
art among all models on GDP Val, which measures performance on self-contained tasks like making
spreadsheets and PowerPoint presentations. Truly, you have never seen a company as excited about
spreadsheets and PowerPoints as OpenAI is with the launch of 5.2. All of this was the theme of the
announcement post as well. Right at the top, OpenAI harkens back to the Chatchapit Enterprise survey
that we discussed earlier in the week, quoting that number where enterprise users were saving
between 40 and 60 minutes a day. OpenAI writes, we designed 5.2 to unlock
even more economic value for people. It's better at creating spreadsheets, building presentations,
writing code, perceiving images, understanding long contexts, using tools and handling complex
multi-step projects. And honestly, when you see the difference between 5-1 thinking and 5-2-thinking
on some of these economic tasks, the difference could not be more stark. The examples they give
are a workforce planning model including headcount, hiring plan, attrition, and budget impact.
the spreadsheet is so massively approved, again, in their cherry-picked example, they also give
an example of two different cap tables. While the visual is pretty similar, they note that 5.1
incorrectly calculated Seed Series A and Series B liquidation preferences and left the majority of those
rows blank, which led to an incorrect final equity payout calculation. 5.2 got all those calculations
correct. They also gave an example of project management, where 5-2 thinking produced this really
professional-looking Gant chart to help describe and summarize progress over the course of a month.
Now, these broad-based economically valuable tasks are, like I said, the thing that OpenAI
chose to put right at the top of this blog post. Even ahead, it is notable of coding. Yet, as I said,
with that 55.6 on Swaybench Pro, there are definitely coding improvements here as well. And once again,
they connected this to professional users. For everyday professional use, they write, this translates
into a model that can more reliably debug production code, implement feature requests, refactor large
codebases and ship fixes end to end with less manual intervention. They also note that it's better at
front end, giving examples of an ocean wave simulation, a holiday card builder, and a typing rain game
where you have to type the words before they hit the bottom of the screen. A couple things to call
out that were a little bit farther down in the announcement post, but we're still really interesting.
The first is that 5-2 seems to do really well with long context. On needles in a haystack test,
where the performance of 5.1 degraded from about 90% at 8K context to less than 50% at 256.
K context, with 5-2 thinking, it barely nudged down from 108k to something it appears is above 90
on the 256K context. Now, going back to professional use, this matters, I think, because a lot of the next
generation of value is going to be unlocked by being able to handle lots and lots of enterprise
context all at once. Another important change? They found that GPT-5-2 had roughly 30 to 40% less
hallucination. Again, when you're thinking about professional business users, one of the great enemies of
reliance on AI is hallucinations, so seeing a meaningful decrease in hallucinations, again,
means a big difference for professional users. But so far, we've just talked about what OpenAI
said about their own model. What about some of the folks who had early access? Medical professor
Daria Anutmasz writes, I had early access to GPT52 and tested mostly the pro version. Let me just say
this. Relative to 5-1 Pro, it has stronger abstraction, clearer, more realistic, balanced,
and strategic responses, and shows deeper conceptual insights and vibe. And I would say this
represents one theme that I saw in a lot of these initial early responses, that, yeah, this is just a
good model that is a meaningful improvement. Ethan Mollock writes, had early access to 5-2? It's an impressive
model. He asked it to build him a graph of humanity's last exam scores over time, which, as he points out,
involved looking up in cross-referencing a lot of material and then generating something useful in one
shot, which it did. When Box began testing 5-2 with their reasoning tests, CEO Aaron Levy writes,
we asked the model to perform a series of enterprise tasks that approximate real-world knowledge work that we see in
industries ranging from financial services to health care and life sciences. These tasks require a
high degree of analytical capabilities, math, reasoning, and more. Aaron noted that with this
expanded task set, with broader and harder tasks than before, 5-2 scored seven points better than
5.1, and performed the majority of the tasks far faster than previous models. The coding first
impressions are likewise pretty good. Elam Arena's Peter Gostov writes, I've spent a lot of time
testing this model on the arena, and it's an excellent bump from the 5-1 versions for coding,
and a big challenger to Gemini 3 Pro and Opus 4.5. P.H.O. Sherano writes,
5-2 is a seriously forwarding complex reasoning, math, coding, and simulations. It built a full
3D graphics engine in a single file. Interactive controls 4K export, one shot. The pace of progress
is unreal. He also argued that it's, quote, the best agentic model OpenAI has shipped,
runs tons of tools in a row without issues and is faster than its predecessor.
52 calls tools with no preamble and doesn't get lost in long sessions. Flavio Adamo wrote
a short post called What Actually Changed, and found that the model was noticeably better at
creating presentations, generating spreadsheets, producing cleaner tables. He also found a significant
improvement in visual design and front end. Overall, he writes, 5-2 isn't a revolution,
but the upgrades are hard to miss. It's more accurate, more consistent, and a lot more
dependable in tasks that actually matter. Now, not everyone was universally positive. In fact,
there were a number of early testers who did point out some of the challenges of 5-2. Dan Shipper
from Evry said it's not as good a writer as Opus on her internal benchmarks, and that it's in his
estimation mostly an incremental upgrade, saying that he hasn't found himself explicitly switching to it
for day-to-day tasks. That idea of this being an incremental upgrade is Evry's big banner headline.
They said while it excels that instruction following and extended tasks, don't expect it to surprise
you. Now, one thing that's notable about Everie's tests is that they have a more sophisticated test
that they built for writing quality than many others, that uses about 50 requests and scores them
on things like reader engagement and AIism avoidance.
Meaning in other words that although they're calling this a vibe check,
they're actually one of the best, if not the best,
source of early feedback when it comes to the quality of writing out of a new model.
5-2 certainly wasn't bad,
matching Sonnet 4.5 at 74% on their tests,
but it was below Opus 4.5's 80%.
One bright spot they pointed to was that it was less prone
to tired AI constructions like it's not X, it's Y.
So summing up, Every's critique isn't so much a critique.
It's just a cap on how hype to get,
again, calling this an incremental upgrade.
Others pointed out things that did well versus not so well.
Simon Smith verified that 5.2 is a lot better for professional deliverables,
saying that the biggest leap is in structured business outputs like
multi-sheet Excel workbooks with proper formatting,
and PowerPoint decks with better structure and concise bullets.
He said this is the first time chat GPT has made spreadsheets and presentations
I'd consider remotely client-ready.
He also argued that 5-2 has better concision of thinking.
He argues that 5-1 sometimes rambles producing a spreadsheet.
brawl, whereas 5-2 is more deliberate and better calibrated to the task complexity. However, he argues
that this isn't universally a good thing. He compares 5-1 thinking to a brilliant, slightly chaotic
freelancer, and 5-2 thinking to a polished professional. He agrees with every that 5-2 is less likely
to surprise you, whereas the upside of 5-1's slightly chaotic nature is that, while, in his words,
you never let it talk to a client, sometimes it surprises you with an outstanding idea or turn of
phrase. Ultimately, Simon comes down on the side of this being a big upgrade.
Ali Miller had similar findings. In her test, the thinking and problem solving felt noticeably stronger.
She said that it gave her deeper explanations than she's used to seeing. In fact, she writes,
at one point it literally wrote code to improve its own OCR in the middle of a task.
She also found that idea exploration feels a little bit richer even than what she's seen from
Opus 4.5. However, like Simon, she found the tone to be different, and for her, a downside. She said
the default voice felt a little bit more rigid, and the length and markdown behavior is
extreme. A simple question turned into 58 bullets and numbered points. Ultimately, she argues
that this version is optimized for deeper problem solving, structured analysis, and power users
who want to sift through all of those options. Five-two, she says, feels like a step towards
AI as a serious analyst and less AI as friendly companion. Hello, friends. If you've been enjoying
what we've been discussing on the show, you'll want to check out another podcast that I have had
the privilege to host, which is called You Can With AI from KPMG.
Season one was designed to be a set of real stories from real leaders, making AI work in
their organizations, and now season two is coming and we're back with even bigger conversations.
This show is entirely focused on what it's like to actually drive AI change inside your
enterprise, and as case studies, expert panels, and a lot more practical goodness that I hope
will be extremely valuable for you as the listener. Search You Can with AI on Apple, Spotify,
or YouTube and subscribe today.
This episode is brought to you by Blitzy,
the Enterprise Autonomous Software Development Platform with infinite code context.
Blitzy uses thousands of specialized AI agents that think for hours
to understand enterprise-scale code bases with millions of lines of code.
Enterprise engineering leaders start every development sprint with the Blitzy platform,
bringing in their development requirements.
The Blitzy platform provides a plan, then generates and pre-compiles code for each task.
Blitzy delivers 80% plus of the development work autonomously,
while providing a guide for the final 20% of human development work required to complete the sprint.
Public companies are achieving a 5x engineering velocity increase when incorporating Blitzie
as their pre-IDE development tool, pairing it with their coding pilot of choice to bring an AI-native
SDLC into their org. Visit blitzy.com and press get a demo to learn how Blitzy transforms your
SDLC from AI-assisted to AI-native.
Meet Rovo, your AI-powered teammate.
Rovo unleashes the potential of your team with AI-powered search, chat,
and agents, or build your own agent with Studio.
Rovo is powered by your organization's knowledge
and lives on Atlassian's trusted and secure platform,
so it's always working in the context of your work.
Connect Robo to your favorite SaaS app,
so no knowledge gets left behind.
Robo runs on the teamwork graph,
Atlassian's intelligence layer that unifies data across all of your apps
and delivers personalized AI insights from day one.
Robo is already built into Jira,
Confluence, and Jira Service Management Standard,
premium and enterprise subscriptions.
Know the feeling when AI turns from tool to teammate.
If you rovo, you know.
Discover Rovo, your new AI teammate powered by Atlassian.
Get started at ROV as in VictoryO.com.
AI changes fast.
You need a partner built for the long game.
Robots and pencils work side by side
with organizations to turn AI ambition
into real human impact.
As an AWS certified partner,
they modernize infrastructure,
design cloud native systems,
and apply AI to
create business value, and their partnerships don't end at launch. As AI changes, robots and pencils
stays by your side, so you keep pace. The difference is close partnership that builds value and
compounds over time. Plus, with delivery centers across the U.S., Canada, Europe, and Latin America,
clients get local expertise and global scale. For AI that delivers progress, not promises,
visit robots and pencils.com slash AI Daily Brief. Now one person who agrees with all of what has been
said before, but found that 5.2 Pro is so uniquely better at what it does that it has become
indispensable for him is Matt Schumer. Now, interestingly, Matt says that he's had access to these
models since November 25th, which is a lot longer than most of these folks who are sometimes going on
days or even just a couple of hours of early access. His overall review of 5-2 was summed up as
incredibly impressive, but too slow. He said 5-2 thinking is a meaningful step forward and
instruction following and willingness to attempt hard tasks, co-generational,
is a lot better than 5-1, vision and long context are much improved, but speed is a big downside.
And speed can be a big deal. He expands the thought, here's something that affects my daily usage.
Standard 5-2 thinking is slow. In my experience, it's been very, very slow for most questions,
even straightforward ones. I almost never use instant, thinking is much better, and pro is
insanely better, but it means I'm usually paying a speed penalty. In practice, this means I barely use
GPT-52 thinking. My actual workflow has become, quick questions go to Claude Opus 4.5, and when I need
deep reasoning, I go straight to 52 Pro. The standard thinking model sits in an awkward middle ground,
slower than Opus, but without the full reasoning benefits of Pro. However, Matt, more than
anyone else that I've seen so far, really extols Pro as something fundamentally different.
He writes, more than raw intelligence, what sets Pro apart is its willingness to think. It will
spend far longer than previous pro models working through a problem. For research tasks, it will research
an absurdly long time if that's what the task requires.
Now, one example he gave to capture what Pro does uniquely among models.
He said, I asked it for meal planning help emphasizing that I have no time to cook.
I wanted a seven-day plan with three meals and two snacks per day.
Pro came back with amazing recipe plans, but what stood out was the ingredients list.
Much simpler than what the other model suggested.
It understood that I have no time wasn't just a constraint on cooking time,
it was a constraint on shopping complexity, prep work, and mental overhead.
It grasped my mentality, not just my literal request.
I had sent the same prompt to all of the other frontier models, and none of them accounted for this.
This is the kind of understanding that makes pro feel different.
Indeed, so enthused was he that he wrote another full review, called his 5-2 Pro Deep Tive,
where he said this is undoubtedly the world's best model I can't live without it.
Now again, he warns of the cost of speed, which isn't just that you have to wait around for an answer,
but as he points out, every so often it will think for a long time and still make a big mistake,
wasting a lot of time. That means in his estimation that prompting matters more than ever.
Be explicit, add constraints, and refine prompts before you send them.
Still ultimately, he writes, after using Pro for two weeks, I can't live without it.
It's my go-to for everything I do that requires deep thinking, research, or coding,
or almost any prompt I run that doesn't require an instant answer.
I think Galley Miller actually has a pretty good rundown of what this amounts to four different user profiles.
For general users, she writes, I think they'll be incrementally more pleased.
She writes the idea space that 5-2 explores is better than 5-1, so they might like problem-solving
a little bit more.
For devs, she wasn't sure.
She said that while the models seem to fare well on One Shot asks, she suspected that the
max-code models within Codex are still better, and that Claude and Gemini are either right
up there or even a head still.
When it came to business users, although she said that she didn't feel all that big of a leap,
everything else around the benchmarks suggests a huge jump.
Researchers, however, she suggested were going to be the most.
pleased group overall, which comports with Matt Schumer's argument that this is a slow genius.
Now, going back to that question of coding in direct head-to-head comparison, we do, of course, have some
ways to see what people prefer in direct head-to-head ways. Elm Arena shared that 5-2 was at number
6 in web dev, and that 5-2 high had jumped all the way up to number two ahead of Opus 4.5 in
Gemini 3 Pro, but behind Opus 4.5 thinking. On front end in the design arena, 52-high remains behind
Gemini 3 Pro and Opus 4.5, but came in at third. So let's talk now about some of the
larger implications of this release. Some pointed out that there are implications for what we believe
around training. Ben Poulodian writes, GPT52 is the clearest signal yet that pre-training
scaling isn't slowing down. Bigger corpuses, longer contexts, hotter training run. Every jump like this
means one thing. Invita's curve is nowhere near flattening. We're still early in the compute
super cycle. Now, if this becomes conventional wisdom, it could have meaningful impacts on the
spectrum of boom to bubble in the same way that GPT3 being released pushed people more towards
boom, at least for the moment. TDM on Twitter also noted that in the our partner section of the
OpenAI announcement, they said that 5-2 was built on Nvidia GPUs including H-100s, H-200s,
and GB-200s. Another interesting implication, I think, just has to do with the pace of change.
You heard before in the benchmarks that 5-2 scored very high on the ARC AGI exams, both one and two.
While ARC-PRIZ tweeted, a year ago we verified a preview of an unreleased version of OpenAI
03 that scored 88% on ARCGI.
The catch was, of course, did that version cost $4,000 and a half thousand dollars a task?
Today they write, we verified a new 5.2 Pro extra high state-of-the-art score of 90.5%
at $11.64 cents a task.
For those quickly doing the math, or asking chat you,
to do the math, they point out that that represents a 390x efficiency improvement in one year.
Now, in terms of what it means in their battle with Anthropic and Google, it's too early to know for
sure how people are going to feel as they really get their hands on it, but it does seem likely to
me to stem some of the bleeding. Even though it isn't claiming to be universally better in all ways,
and a lot of the first responses that we shared have some caveating or at least nuance to where
and how and in what ways it's good, it's clearly a really good model that is a big step up from what
Open AI offer before, and it is likely going to compete with Gemini 3 Pro and Opus 4.5
on a lot of different use cases. Sam Altman also tweeted just after the release, also we have
a few little Christmas presents for you next week, which many are speculating right away means
the next version of images. Given all the rumors that we've shared and talked about in previous
episodes this week, about new image models being tested under pseudonymous names. Summing up and touching
on something that we haven't even had a chance to talk about yet, Roheet writes,
Code read to the best model in a partnership with Disney in one week, damn.
What they're referring to, of course, is a new partnership that was announced this morning,
where the Walt Disney Company is not only not going after OpenAI in court,
but is instead granting them a three-year license to use something like 200 Disney characters
in Sora Generations.
The details are one.
It's a three-year licensing agreement, including one-year exclusivity,
where Sora users will be able to generate videos that use more than 200 different Disney,
Marvel, Pixar, and Star Wars characters.
Creating an incentive for people to actually go do that,
some number of those SORA videos will then actually stream on Disney Plus,
and at the top it all off, Disney is also going to become a major customer of OpenAI,
both deploying ChatGBT for its employees and using API to build new products,
and finally, Disney's going to make a billion-dollar equity investment into OpenAI as well.
Now, I got to give a big shout out to Andrew Curran,
who's one of the best AI news aggregators on Twitter slash X,
But all the way back in August, when Sam Altman tweeted an image of a faded death star,
Andrew wrote,
Sometimes I read too much into things it's my nature.
However, seeing as I think we're getting a Sauru two announcement,
I'm predicting the mouse has finally made up its mind.
And it wasn't just that one prediction.
Back in November, he also wrote,
Disney is becoming an AI company.
At this point, it's simply a matter of who they choose as a partner.
A deal between Open AI and Disney has seemed close many times over the last year,
but it looks like it's coming down to this week.
To me, this decision is a huge signal for who,
who will be leading the race a year from now.
It's far bigger than the IP.
It's also the fact that as soon as Disney forms this partnership
and starts using AI for user-created content,
which will begin with video shorts on Disney Plus,
it will use its immense media power to broadcast
that AI is a legitimate creative tool
and will actively encourage its use.
To me, this is the biggest decision of the year
and whoever wins it will have immense main character energy in 2026.
Adding some validity to that argument,
on the same day that they announced this Open AI deal,
Disney sent a cease and desist letter to Google, accusing them of copyright infringement on a massive scale.
Now, there will be a lot more to get into on that particular deal, but bringing you back to GPT5,
I was someone who felt like the 5-1 update was very meaningful.
For my use cases of iterative brainstorming and business strategy collaboration, 5-2 in general,
but 5-2 pro especially, felt like a major upgrade.
In fact, like Matt Schumer, although in the context of a different model,
I found myself much more than I was before, skipping the thinking model and going straight to
5-1 Pro. I was finding myself just naturally redesigning my work so that I could let that process
take place while I was doing other things and then come back to it when it was ready.
However, it feels to me like in most ways, OpenAI sees 5-2 as the big next step post-GPT-5.
It almost feels as though 5-1 in personality and capabilities was kind of what they wanted 5 to be,
and 5.2, at least at first glance,
appears to be what they wanted that next intermediate model to be.
5-2 should be rolling out to all paid subscribers over the next day or so,
so we will have some fun figuring out what it does well.
For now, that is going to do it for the AI Daily Brief.
Appreciate you listening or watching as always, until next time, peace.
