The AI Daily Brief: Artificial Intelligence News and Analysis - AGI for Christmas

Starting point is 00:00:00 Today on the AI Daily Brief, did we just get AGI for Christmas? The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. To join the conversation, follow the Discord link in our show notes. Hello, friends. For our last regular AI Daily Brief episode of the year, we are skipping the headlines. It's mostly small things, a couple of new AI appointments to the White House, stuff like that. And instead, we are going to spend all of our time on the big discussion from the last three days, which is whether OpenAI just gave us AGI.

Starting point is 00:00:36 What's going on, guys? It has been a very interesting 12 days of shipmiss from OpenAI. We kicked it off with a full version of 01, maybe the biggest announcement was Sora, but then by the end of last week, it seemed like just maybe we were actually going to get an entire new model. If you listen to my Friday episode, you heard all of the evidence that we were going to get 03, and indeed, that is what happened. Specifically on Friday, OpenAI announced their second generation of reasoning models of O3 and 03 Mini. Now, if you're wondering what the hell the name is about, the company skipped O2 in order to avoid an intellectual property dispute with the large British telco. Sam Altman said that the company was simply upholding its tradition of being truly bad at

Starting point is 00:01:16 names. And just to cut to the chase, while the announcement itself was relatively muted, the conversation that has followed has been all about whether this actually represents something close to AGI. So today we're going to explore all of those arguments and what we should actually think about this. Now, certainly from the numbers they shared, the model seems very good. On a standard coding benchmark, 03 better to 1 by almost 23 percentage points, it also bested the company's chief scientist on the competitive coding platform code forces. In fact, right now, there's less than 200 people in the world with a better score on code forces, 174 to be exact. In somewhat understated fashion, then, Altman said that the model is, quote, incredible at coding.

Starting point is 00:02:01 The model also achieved a near-perfect score on the AIME math exam missing only a single question. It achieved an 87.7% on the expert level science benchmark GPQA Diamond, far exceeding top human performance. Still, while those benchmark results are practical and important, a significant amount of focus, at least in the announcement, was on how O3 performed on the ARC AGI test. That test attempts to measure a model's ability to deal with novel problems that are difficult to pre-train. It's viewed as testing reasoning capability at the bare minimum and is one plausible benchmark for when AGI has been achieved. O3 crossed the 85% human performance threshold for AGI, tripling the score achieved by O1. This year's ARCGI prize winner scored 53.5% using a fine-tuned model of a novel design

Starting point is 00:02:47 and only a handful of attempts have managed to score higher than 30%, just to give a sense of how far the bar was raised. One of the interesting things about the test is that it's relatively easy to solve for humans using basic logic and reasoning, but as so far stumped AI models. You might have seen these grids of red and blue boxes over the past few days, and this is one of the hardest problems on the test. Frenzsche-Chle, a legend in machine learning, and the creator of the test wrote, Today, OpenAI announced O3, its next-gen reasoning model. We've worked with OpenAI to test it on ARC-AGI, and we believe it represents a significant breakthrough in getting AI to adapt to novel

Starting point is 00:03:20 tasks. It scores 75.7% on the semi-private eval in low compute mode for $20 per task and compute, and 87.5% in high compute mode, which costs thousands of dollars per task. It's very expensive, but it's not just brute. These capabilities are new territory and they demand serious scientific attention. Now, when it comes to the question of AGI, OpenAI is not claiming that title. They are using big language. OpenAI co-founder and president Greg Brockman wrote, O3 is a breakthrough with a step function improvement on our hardest benchmarks.

Starting point is 00:03:51 But that's different, of course, than claiming AGI. One of the first noteworthy opinions on whether this is, AGI came from Swaleh himself. In the thread announcing the test results, he commented, so is this AGI. While the new model is very impressive and represents a big milestone on the way towards AGI, I don't believe this is AGI. There's still a fair number of very easy ARC AGI1 tasks that O3 can't solve. And we have early indicators that ARC AGI2 will remain extremely challenging for 03. This shows that it's still feasible to create unsaturated, interesting benchmarks that are easy for humans yet impossible for AI, without involving specialist knowledge.

Starting point is 00:04:26 we will have AGI when creating such avals becomes outright impossible. As a total aside on the Ark Prize itself, that test is run on a fully private set of questions and must be completed using just 10 cents of compute per task. The team is committed to keeping those parameters until someone releases an open source model that can achieve an 85% score. Cholet believes version one of this test is now saturated and no longer a useful benchmark, but expects version 2 to present a much greater challenge. He added, what does this mean for the future of AGI research?

Starting point is 00:04:55 For me, the main open question is where the scaling bottlenecks for the techniques behind O3 are going to be. If human annotated training data is a major bottleneck, for instance, capabilities would start to plateau quickly like they did for LLMs until the next architecture. If the only bottleneck is time-test search, we will see continue scaling in the future. I think 2025 will be the year of the open-source reproduction of these techniques. The Ark Prize 2025 leaderboards will be the best place to monitor reproduction attempts. Today's episode is brought to you by Vanta. Whether you're starting or scaling your company's security program, demonstrating top-notch security practices, and establishing trust is more important than ever.

Starting point is 00:05:31 Vanta automates compliance for ISO-2-GDPR and leading AI frameworks like ISO-42,1, and NIST AI risk management framework, saving you time and money while helping you build customer trust. Plus, you can streamline security reviews by automating questionnaires and demonstrating your security posture with a customer-facing trust center, all powered by Vanta AI. Over 8,000 global companies like Langchain, Lila AI, and factory AI use Vanta to demonstrate AI trust and prove security in real time. Learn more at Vanta.com slash NLW. That's Vanta.com slash NLW. If there is one thing that's clear about AI in 2025, it's that the agents are coming. Vertical agents by industry, horizontal agent platforms, agents per function. If you are running a large enterprise, you will be experimenting with agents. next year. And given how new this is, all of us are going to be back in pilot mode. That's why

Starting point is 00:06:27 Super Intelligence is offering a new product for the beginning of this year. It's an agent readiness and opportunity audit. Over the course of a couple quick weeks, we dig in with your team to understand what type of agents make sense for you to test, what type of infrastructure support you need to be ready, and to ultimately come away with a set of actionable recommendations that get you prepared to figure out how agents can transform your business. If you are interested in the agent readiness and opportunity audit, reach out directly to me, NLW at B-Super.A.I. Put the word agent in the subject line so I know what you're talking about. And let's have you be a leader in the most dynamic part of the AI market. Now, of course, the question of whether a thing is or isn't AGI is ultimately

Starting point is 00:07:06 less relevant to how good is it at doing things that people currently do now and what's that going to mean for jobs, the economy, and society? The big place where this conversation was taking place was around developers. Florian Mai writes, O3 is better than 99.95% of programmers. The public needs to wake up to what's happening so we can act responsibly. For that to happen, we first need the scientific community to acknowledge the evidence. This is the most important problem of our time. Eh, entrepreneur Sully writes, yeah, it's over for coding with O3. This is mind-boggling. Looks like the first big jump since GPT4, because these numbers make zero sense.

Starting point is 00:07:41 Still, some are pointing out that coding competitions don't necessarily translate to real-life problems. Machine learning instructor Santiago wrote, O3 is better than 99.95% of programmers solving code forces problems. 99.99% of professional programmers don't need to do code forces problems to make a living. There's absolutely no proof that O3 is capable of doing what those professional programmers do to make money. He continued, I'm not downplaying how much the world is changing. My argument is about what exactly performing well on software engineering benchmarks tells us and how it's related to the current work of software engineers. The other benchmarks were no less impressive

Starting point is 00:08:15 paradigm shifting. Didi Das, Avicay at Menlo Ventures, tried to describe just how wild the math benchmark is, commenting, 99.99% of people cannot comprehend how insane frontier math is. The problems are created by math professors and not in any training data. Math legend Terry Tao said these are extremely challenging. I think they will resist AIs for several years at least. OpenAI.03 did 25% on this. At this stage, no other model has completed more than a single question. And this is where we started to see some big think implication type conversations. Stability AI co-founder Iman Mostock wrote, My take on O3, the global economy is cooked. We need a new economic and societal framework.

Starting point is 00:08:53 Any work that could be done on the other side of a computer screen, AI will be able to do at a fraction of the price. Harry Law, a Google Deep Mine in Cambridge University alumni wrote, at $3,000 per task, O3 is already a more cost-effective solution than hiring McKinsey. And while I think the analogy is not even close to perfect, there is an important point here that a lot of these numbers only seem expensive when they're placed in the context of software, not so much when their labor replacement. Nick Kamarata writes, I set my AI expectations to unrealistically high bonkers AI world,

Starting point is 00:09:23 and I still underestimated recent progress. And while I don't want to go deep into the debates around what this means for the singularity and hard takeoff and all these sort of theories, at least not in this particular episode, what's important relative to 03 is that they're a part of the conversation. One thing that was notable was about how big gap between the internal AI conversation and what's being reported. Adam DiAngelo writes, wild that the O3 results are public and yet the market still isn't pricing in AGI. Bloomberg reported it as just another leg of the race between OpenAI and Google. The Wall Street Journal ran a feature story about delays to GPT5 with the headline,

Starting point is 00:09:59 The Next Great Leap in AI is behind schedule and crazy expensive. And yet, for all the discussion around how cooked people are and all this sort of stuff, I think it's really important to have some perspective here as well. Replit CEO Amjad Masad said, the idea that O3 will automate software engineers is silly. Object Zero writes, reminds me at the time we mechanized agriculture and made 80% of all jobs obsolete,

Starting point is 00:10:21 showing a chart, of course, that showed how the invention set the stage for humanity's population to grow exponentially over the following centuries, as food supply was no longer a constraint. Matt Griswold pointed out that the replacement of developers is progressing at a much slower pace than the advancement of the technology,

Starting point is 00:10:35 commenting, Hot take, there are millions of GPT4-level human software engineers working. If O3 scares you, you're probably fine. Professor Ethan Malik makes a point that I do all the time. The reason everything will not change quickly, he writes, even if AI generally exceeds human capabilities across fields, is in large part the nature of systems. Organizational and societal change is much slower than technological change,

Starting point is 00:10:55 even when the incentives to change quickly are there. Human social and organizational inertia is going to be a slowdown force that helps us have time to adapt. Julian McCoy flips it around and says, Hype About O3 misses the plot. This isn't about AI getting smarter. It's about humans getting freer. No more data entry, no more mundane tasks, no more trading time for money.

Starting point is 00:11:16 Cushy has a similar point. If you view the O3 launch as anything less than irrefutable evidence that this is the most exciting time to be alive, you may need to take a deep breath and rekindle your optimism. And for those trying to figure out where they spend their time now that this exists, Boyantung-Tungo's rights, I've been telling you for a while not even to try to compete with machines at being a better machine. Instead, try competing with humans at being a better human. Look, call me optimistic, but at the end of the day, I just think that the entire history of human experience points

Starting point is 00:11:44 towards the output of this explosion of intelligence being a massive increase in human creation. We're going to make more stuff. We're going to make more code. We're going to make more products. We're going to make more entertainment. None of which is to say that the disruptions along the way won't be painful and we do need to deal with them. But I continue to think that the future is going to be even more exciting than the present. And that seems to me to be a pretty good way to close out 2024. Now, this will be the last regular AI Daily Brief episode of the year. From here on, I've got a number of end of year episodes, which I'm really excited about. We've got the 15 most important AI products of the year, 25 predictions for agents, and a bunch more. For now, though,

Starting point is 00:12:23 can't tell you how much I appreciate you guys watching, listening, hanging out with me here every day. I hope that you are headed into a wonderful holiday season. Until next time, peace. Peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - AGI for Christmas

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.