The AI Daily Brief: Artificial Intelligence News and Analysis - AGI for Christmas
Episode Date: December 24, 2024Explore OpenAI's latest achievements with O3, the reasoning model that sparked conversations about its proximity to AGI. This episode unpacks its groundbreaking performance on benchmarks like ARC, Cod...eforces, and math challenges while addressing the implications for jobs, coding, and society. Hear expert insights on whether O3 signals the dawn of AGI or a significant milestone in AI’s evolution. Brought to you by: Vanta - Simplify compliance - https://vanta.com/nlw The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614 Subscribe to the newsletter: https://aidailybrief.beehiiv.com/ Join our Discord: https://bit.ly/aibreakdown
Transcript
Discussion (0)
Today on the AI Daily Brief, did we just get AGI for Christmas?
The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.
To join the conversation, follow the Discord link in our show notes.
Hello, friends.
For our last regular AI Daily Brief episode of the year, we are skipping the headlines.
It's mostly small things, a couple of new AI appointments to the White House, stuff like that.
And instead, we are going to spend all of our time on the big discussion from the last three days,
which is whether OpenAI just gave us AGI.
What's going on, guys? It has been a very interesting 12 days of shipmiss from OpenAI.
We kicked it off with a full version of 01, maybe the biggest announcement was Sora, but then by
the end of last week, it seemed like just maybe we were actually going to get an entire new model.
If you listen to my Friday episode, you heard all of the evidence that we were going to get
03, and indeed, that is what happened. Specifically on Friday, OpenAI announced their second generation
of reasoning models of O3 and 03 Mini. Now, if you're wondering what the hell the name is about,
the company skipped O2 in order to avoid an intellectual property dispute with the large British
telco. Sam Altman said that the company was simply upholding its tradition of being truly bad at
names. And just to cut to the chase, while the announcement itself was relatively muted,
the conversation that has followed has been all about whether this actually represents
something close to AGI. So today we're going to explore all of those arguments and what we
should actually think about this. Now, certainly from the numbers they shared, the model seems very
good. On a standard coding benchmark, 03 better to 1 by almost 23 percentage points, it also bested
the company's chief scientist on the competitive coding platform code forces. In fact, right now,
there's less than 200 people in the world with a better score on code forces, 174 to be exact.
In somewhat understated fashion, then, Altman said that the model is, quote, incredible at coding.
The model also achieved a near-perfect score on the AIME math exam missing only a single question.
It achieved an 87.7% on the expert level science benchmark GPQA Diamond, far exceeding top human performance.
Still, while those benchmark results are practical and important, a significant amount of focus, at least in the announcement, was on how O3 performed on the ARC AGI test.
That test attempts to measure a model's ability to deal with novel problems that are difficult to pre-train.
It's viewed as testing reasoning capability at the bare minimum and is one plausible benchmark
for when AGI has been achieved.
O3 crossed the 85% human performance threshold for AGI, tripling the score achieved by O1.
This year's ARCGI prize winner scored 53.5% using a fine-tuned model of a novel design
and only a handful of attempts have managed to score higher than 30%, just to give a sense of how
far the bar was raised.
One of the interesting things about the test is that it's relatively easy to solve for humans
using basic logic and reasoning, but as so far stumped AI models. You might have seen these
grids of red and blue boxes over the past few days, and this is one of the hardest problems on the
test. Frenzsche-Chle, a legend in machine learning, and the creator of the test wrote,
Today, OpenAI announced O3, its next-gen reasoning model. We've worked with OpenAI to test it
on ARC-AGI, and we believe it represents a significant breakthrough in getting AI to adapt to novel
tasks. It scores 75.7% on the semi-private eval in low compute mode for $20 per task and compute,
and 87.5% in high compute mode, which costs thousands of dollars per task.
It's very expensive, but it's not just brute.
These capabilities are new territory and they demand serious scientific attention.
Now, when it comes to the question of AGI, OpenAI is not claiming that title.
They are using big language.
OpenAI co-founder and president Greg Brockman wrote,
O3 is a breakthrough with a step function improvement on our hardest benchmarks.
But that's different, of course, than claiming AGI.
One of the first noteworthy opinions on whether this is,
AGI came from Swaleh himself. In the thread announcing the test results, he commented, so is this
AGI. While the new model is very impressive and represents a big milestone on the way towards AGI,
I don't believe this is AGI. There's still a fair number of very easy ARC AGI1 tasks that O3
can't solve. And we have early indicators that ARC AGI2 will remain extremely challenging for
03. This shows that it's still feasible to create unsaturated, interesting benchmarks that are
easy for humans yet impossible for AI, without involving specialist knowledge.
we will have AGI when creating such avals becomes outright impossible.
As a total aside on the Ark Prize itself, that test is run on a fully private set of questions
and must be completed using just 10 cents of compute per task.
The team is committed to keeping those parameters until someone releases an open source
model that can achieve an 85% score.
Cholet believes version one of this test is now saturated and no longer a useful benchmark,
but expects version 2 to present a much greater challenge.
He added, what does this mean for the future of AGI research?
For me, the main open question is where the scaling bottlenecks for the techniques behind O3 are going to be.
If human annotated training data is a major bottleneck, for instance, capabilities would start
to plateau quickly like they did for LLMs until the next architecture.
If the only bottleneck is time-test search, we will see continue scaling in the future.
I think 2025 will be the year of the open-source reproduction of these techniques.
The Ark Prize 2025 leaderboards will be the best place to monitor reproduction attempts.
Today's episode is brought to you by Vanta.
Whether you're starting or scaling your company's security program, demonstrating top-notch security practices, and establishing trust is more important than ever.
Vanta automates compliance for ISO-2-GDPR and leading AI frameworks like ISO-42,1, and NIST AI risk management framework, saving you time and money while helping you build customer trust.
Plus, you can streamline security reviews by automating questionnaires and demonstrating your security posture with a customer-facing trust center, all powered by Vanta AI.
Over 8,000 global companies like Langchain, Lila AI, and factory AI use Vanta to demonstrate AI trust and prove security in real time.
Learn more at Vanta.com slash NLW. That's Vanta.com slash NLW.
If there is one thing that's clear about AI in 2025, it's that the agents are coming.
Vertical agents by industry, horizontal agent platforms, agents per function.
If you are running a large enterprise, you will be experimenting with agents.
next year. And given how new this is, all of us are going to be back in pilot mode. That's why
Super Intelligence is offering a new product for the beginning of this year. It's an agent readiness
and opportunity audit. Over the course of a couple quick weeks, we dig in with your team to
understand what type of agents make sense for you to test, what type of infrastructure support
you need to be ready, and to ultimately come away with a set of actionable recommendations that get you
prepared to figure out how agents can transform your business. If you are interested in the agent readiness
and opportunity audit, reach out directly to me, NLW at B-Super.A.I. Put the word agent in the subject
line so I know what you're talking about. And let's have you be a leader in the most dynamic part
of the AI market. Now, of course, the question of whether a thing is or isn't AGI is ultimately
less relevant to how good is it at doing things that people currently do now and what's that
going to mean for jobs, the economy, and society? The big place where this conversation was taking
place was around developers. Florian Mai writes, O3 is better than 99.95% of programmers.
The public needs to wake up to what's happening so we can act responsibly.
For that to happen, we first need the scientific community to acknowledge the evidence.
This is the most important problem of our time.
Eh, entrepreneur Sully writes, yeah, it's over for coding with O3. This is mind-boggling.
Looks like the first big jump since GPT4, because these numbers make zero sense.
Still, some are pointing out that coding competitions don't necessarily
translate to real-life problems. Machine learning instructor Santiago wrote,
O3 is better than 99.95% of programmers solving code forces problems. 99.99% of professional
programmers don't need to do code forces problems to make a living. There's absolutely no
proof that O3 is capable of doing what those professional programmers do to make money.
He continued, I'm not downplaying how much the world is changing. My argument is about what exactly
performing well on software engineering benchmarks tells us and how it's related to the current
work of software engineers. The other benchmarks were no less impressive
paradigm shifting. Didi Das, Avicay at Menlo Ventures, tried to describe just how wild the math
benchmark is, commenting, 99.99% of people cannot comprehend how insane frontier math is.
The problems are created by math professors and not in any training data.
Math legend Terry Tao said these are extremely challenging. I think they will resist
AIs for several years at least. OpenAI.03 did 25% on this. At this stage, no other model has
completed more than a single question. And this is where we started to see some big think implication
type conversations. Stability AI co-founder Iman Mostock wrote,
My take on O3, the global economy is cooked. We need a new economic and societal framework.
Any work that could be done on the other side of a computer screen, AI will be able to do at a
fraction of the price. Harry Law, a Google Deep Mine in Cambridge University alumni wrote,
at $3,000 per task, O3 is already a more cost-effective solution than hiring McKinsey.
And while I think the analogy is not even close to perfect, there is an important point here
that a lot of these numbers only seem expensive when they're placed in the context of software,
not so much when their labor replacement.
Nick Kamarata writes,
I set my AI expectations to unrealistically high bonkers AI world,
and I still underestimated recent progress.
And while I don't want to go deep into the debates around what this means for the singularity
and hard takeoff and all these sort of theories, at least not in this particular episode,
what's important relative to 03 is that they're a part of the conversation.
One thing that was notable was about how big gap between the internal AI conversation and what's being reported.
Adam DiAngelo writes, wild that the O3 results are public and yet the market still isn't pricing in AGI.
Bloomberg reported it as just another leg of the race between OpenAI and Google.
The Wall Street Journal ran a feature story about delays to GPT5 with the headline,
The Next Great Leap in AI is behind schedule and crazy expensive.
And yet, for all the discussion around how cooked people are and all this sort of stuff,
I think it's really important to have some perspective here as well.
Replit CEO Amjad Masad said,
the idea that O3 will automate software engineers is silly.
Object Zero writes,
reminds me at the time we mechanized agriculture
and made 80% of all jobs obsolete,
showing a chart, of course,
that showed how the invention set the stage
for humanity's population to grow exponentially
over the following centuries,
as food supply was no longer a constraint.
Matt Griswold pointed out that the replacement of developers
is progressing at a much slower pace
than the advancement of the technology,
commenting,
Hot take, there are millions of GPT4-level human software engineers working.
If O3 scares you, you're probably fine.
Professor Ethan Malik makes a point that I do all the time.
The reason everything will not change quickly, he writes,
even if AI generally exceeds human capabilities across fields,
is in large part the nature of systems.
Organizational and societal change is much slower than technological change,
even when the incentives to change quickly are there.
Human social and organizational inertia is going to be a slowdown force
that helps us have time to adapt.
Julian McCoy flips it around and says,
Hype About O3 misses the plot.
This isn't about AI getting smarter.
It's about humans getting freer.
No more data entry, no more mundane tasks, no more trading time for money.
Cushy has a similar point.
If you view the O3 launch as anything less than irrefutable evidence that this is the most exciting time to be alive,
you may need to take a deep breath and rekindle your optimism.
And for those trying to figure out where they spend their time now that this exists,
Boyantung-Tungo's rights,
I've been telling you for a while not even to try to compete with machines at being
a better machine. Instead, try competing with humans at being a better human. Look, call me optimistic,
but at the end of the day, I just think that the entire history of human experience points
towards the output of this explosion of intelligence being a massive increase in human creation.
We're going to make more stuff. We're going to make more code. We're going to make more products.
We're going to make more entertainment. None of which is to say that the disruptions along the way
won't be painful and we do need to deal with them. But I continue to think that the future is
going to be even more exciting than the present. And that seems to me to be a pretty good way to
close out 2024. Now, this will be the last regular AI Daily Brief episode of the year. From here on,
I've got a number of end of year episodes, which I'm really excited about. We've got the 15 most
important AI products of the year, 25 predictions for agents, and a bunch more. For now, though,
can't tell you how much I appreciate you guys watching, listening, hanging out with me here every day.
I hope that you are headed into a wonderful holiday season. Until next time, peace.
Peace.
