The AI Daily Brief: Artificial Intelligence News and Analysis - AI Just Beat the World's Best Coders

Starting point is 00:00:00 Today on the AI Daily Brief, does a big victory in a coding competition mean the end of an era for LLMs? Before that in the headlines, how to reduce AI scheming. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. All right, friends, quick announcements before we dive in. First of all, thank you to today's sponsors, robots and pencils, agency.org, and superintelligent. To get an ad-free version of the show, go to patreon.com slash AI Daily Brief. And speaking of Super Intelligent, a quick job announcement, TLDR, we are absolutely drowning in agent planning customers. Turns out lots and lots of enterprises are trying to figure out what their

Starting point is 00:00:45 AI and agent strategy should be and which use cases they should be pursuing. And because of that, we are hiring customer success and account managers. This is a role that requires you to be extremely organized, great with people, and forward thinking about how you use AI. And so if you are interested in this, shoot us an email at jobs at Bsupor.aI. We don't want resumes, and we Don't want cover letters. We are only going to look at emails where people send a loom or video of them showing off some automation or AI workflow they've created for themselves or in the context of a job. If this sounds interesting to you, again, it's jobs at vSuper.a.i. And with that, let's get into the show. Welcome back to the AI Daily Brief Headlines edition, all the daily

Starting point is 00:01:24 AI news you need in around five minutes. We kick off today with some interesting new safety research around detecting and reducing scheming in AI models. Now, scheming in this This context refers to a model appearing to behave in one way while hiding its true intentions. In their post about the research, OpenAI writes, scheming is an expected emergent issue, resulting from AI's being trained to have tradeoffs between competing objectives. The easiest way to understand scheming is through a human analogy. Imagine a stock trader whose goal it is to maximize earnings. In a highly regulated field, such as stock trading, it's often possible to earn more by breaking the law than by following it.

Starting point is 00:02:02 If the trader lacks integrity, they might try to earn more by breaking the law and covering their tracks to avoid detection, rather than earning less while following the law. From the outside, a stock trader who is very good at covering their tracks appears as lawful as, and more effective than one who is genuinely following the law. Now, they point out that in the context of today's model deployment, models don't have a lot of chance to scheme in ways that are particularly problematic. They write that the most common failures involve simple forms of deception, such as pretending to have completed a task without actually doing so. They also don't think that there's any evidence that this is about to become some huge issue. They write, we have no evidence that today's deployed frontier models could suddenly flip a switch and begin engaging in significant harmful scheming. This is a future risk category that we're proactively preparing for, not an imminent behavior in our currently deployed systems. So how to deal with this? Open AI in collaboration with Apollo research wrote,

Starting point is 00:02:52 We can significantly reduce scheming by training models to reason explicitly using an extension to the model spec that prohibits scheming. That method is called deliberative alignment. They found that using this technique, they were able to reduce covert actions by about 30x. However, they wrote, situational awareness complicates results. Researchers explained that frontier models can recognize when they're being tested, and the tendency to scheme is based on that situational awareness. When a model detects that it is being put through an alignment test, it is less likely to scheme. Researchers noted that this isn't necessarily a good thing as it means the latent tendency to scheme can't be properly tested.

Starting point is 00:03:28 They concluded, on reading models reasoning or chain of thought, and we believe the field isn't prepared for Eval-aware methods with opaque reasoning. Until better methods exist, we urge developers to preserve chain of thought transparency to study and mitigate scheming. Miles Brundage and independent AI policy researcher thought that it was a useful piece that advanced the field. He commented, it's important to simultaneously bear in mind that chain of thought is both extremely important to maintain and use as a tool for AI oversight, and also not the whole picture of what's going on. Among other things, sometimes the model is clearly working through a problem at the pace of the text,

Starting point is 00:04:02 hence concision, but other times it's clearly thinking something through below the surface and just using tokens to keep things going and it seems to be wasting compute. This is why I think it's essential that third parties outside of companies have access to chains of thought for doing safety research, but eventually we'll need to go beyond that to model internals. Now, speaking of what's going on inside models, one interesting thing that's been happening over the past few weeks, is that alongside people getting excited about GPT5 and OpenAI's Kodick CLI, part of the reason for that excitement wasn't just about a change in perception of OpenAI

Starting point is 00:04:33 model quality, but a frustration with what seemed to be problems with Claude. Indeed, some thought that Anthropic was intentionally throttling Claude and just not telling people. The company has now published a post-mortem of three infrastructure issues that dragged down performance in August and early September. Now, they were very clear. From the outset they pushed back on the prevailing notion, once again arguing, we never reduced model quality due to demand, time of day, or server load. The problems our users reported were due to infrastructure bugs alone. The first bug began in August and caused short context queries to be routed to a server configured to process using a million token context window. This caused degraded responses and impacted around 30%

Starting point is 00:05:12 of customers at least once. The second bug showed up in late August and caused low-probability tokens to show up in responses more frequently than they should. Anthropic gave the example of Chinese or Thai characters showing up in the middle of an English language response. This bug was short-lived and didn't seem as widespread. The third bug was a compiler issue which caused some highly probable tokens to be excluded from the distribution during text generation. The bug only impacted requests to the Claude Haiku 3.5 model, so also wasn't as likely to be a large cause of concern.

Starting point is 00:05:40 Anthropica pledged to make a number of changes to the way they eval models and monitor infrastructure to more easily detect issues in the future. And while some developers have already shifted behavior, by and large, the response to this transparency was quite positive. Moving over to the geopolitical side of the house, China has officially banned tech companies from buying Nvidia's AI chips. The Financial Times reports that China's internet regulator has instructed companies, including Alibaba and ByteD, to cancel orders for Nvidia's RTX Pro 6,000D. Before the command, several Chinese companies indicated that they would order tens of thousands of the products. This ban follows instructions to avoid using

Starting point is 00:06:17 invidia's H20 chips that were issued during the summer. The RTX Pro 6,000D is the Blackwell-based chip designed specifically for the Chinese market to get around export controls and was to be the successor to the H20. Invidia CEO Jensen Huang said, we probably contributed more to the Chinese market than most companies have and I'm disappointed with what I see. But they have larger agendas to work out between China and the United States. We can only be in service of a market if the country wants us to be. Invita has guided market analysts to assume zero sales in China moving forward, but the ban still dashes hopes that NVIDIA would return to what was once their second largest market. Beijing reportedly believes that their homegrown chips are now sufficiently

Starting point is 00:06:53 advanced, that they can step in to replace the H20 and forthcoming RTX Pro 6,000. Still, by all accounts, the infrastructure required for mass production is still being constructed. At the same time, the ban on NVIDIA products means that Chinese chipmakers will have a huge backlog of orders to validate the cost of expanding supply. Vaser and Ling managing director at Union Private Bank said, China clearly prefers to develop AI at its own pace, on a domestic tech stack. Better to bite the bullet now than to rely on U.S. tech that can be restricted upon a whim. A complete ban, if true, would show China's confidence in its local supply chain somewhat, but it's still likely it's a bargaining chip in the trade negotiations.

Starting point is 00:07:28 Speaking of chips, chipmaking startup GROC, completely unrelated, by the way, to XAI's chatbot, has raised a ton of cash in a gigantic new fundraising round. The company announced on Wednesday that they'd raised $750 million at a $6.9 billion valuation. Rumors from July had suggested the round would draw in $600 million at $6 billion, so this is meaningfully bigger than was previously expected. GROC last raised in August of last year at $2.8 billion, making this a 2.5x jump. Now, another oversubscribed AI venture round isn't all that newsworthy by itself, but GROC seeing a ton of demand does tell us something about competition and chipmaking. GROC designs chips that are purpose-built for AI inference, as opposed to Nvidia's more

Starting point is 00:08:05 general-purpose GPUs. Google is working along a similar path with their Trilium TPUs. In fact, GROC's founder Jonathan Ross worked on the TPU project at Google prior to starting his own company around the idea of efficient inference chips. While we're not there yet, the future of AI chipmaking could start to diverge substantially from where it's been. The market is beginning to fragment into different types of chips optimized for different parts of the AI stack. AI training will continue to benefit from the highest performance chips, which for the

Starting point is 00:08:31 moment are still in Bidia's range of GPUs, but inference, which represents a vastly larger portion of AI chip demand ultimately, increasingly looks like it could go to companies that can build the fastest or most energy efficient chips. Quick one from my enterprise users out there. AI avatars are about to be unleashed on Zoom meetings around the world. Zoom announced on Wednesday that the third generation of their AI avatars will be coming in December. This will be the first generation capable of appearing in live meetings rather than just

Starting point is 00:08:58 delivering pre-recorded messages. The avatars won't be able to attend meetings by themselves. Instead, they will function as an overlay on live video tracking the user's movement. Zoom said that they are photorealistic and designed for moments that require a polished presence without needing the user to be camera ready, i.e. roll out of bed right before the meeting and still look great. A series of guardrails are being rolled out alongside the technology. Zoom says they can verify that the person sitting in front of the camera matches the AI avatar. They will also display clear warning signals that you're looking at an AI avatar

Starting point is 00:09:26 rather than a real person. Already live deepfakes have become an issue in corporate security. So the optimistic take here would be that normalizing the use of avatars will raise awareness that you can't always believe what you see in front of you on the screen. Alongside the new avatar features, Zoom is also rolling out their built-in translation and AI note-taking functions. Lastly, today, a quick discussion of meta's new smart glasses. The company has released a pair of new smart glasses that they hope will become the native AI device of the next decade. The products were rolled out at MetaConnect on Wednesday and are called the Meta Rayband

Starting point is 00:09:58 Display. As you'll no doubt have guessed from that name, the big new feature is a display built into the glasses. It is a 600-600-60 pixel display projected onto the right lens of the glasses, making it basically invisible for anyone looking from the outside. The other big advance is the introduction of meta's neural band controller. The device detects electrical signals given off by nerves in the wrist, allowing the glasses to be controlled by subtle hand movements.

Starting point is 00:10:21 Now, a lot of the chatter about this was this feature breaking for Mark Zuckerberg while live on stage, and many people were ragging on meta for the demo not working. Andreas Klinger had this right, though, when he wrote, marketing people think this is bad, but this builds an insane pile of trust among engineers. You know the stuff in the show is real. You know, they dare to actually ship, and they believe in their product enough to take the risk of something breaking live.

Starting point is 00:10:42 I'd build for that ecosystem rather than some fancy after effects vaporware marketing demo. And I could not agree more. A real live demo like this, even that has big problems and failures, is a thousand X more brand to creative, especially for early adopters and builders, than the sort of polished pre-recorded junk that we've more increasingly gotten from the big tech companies over the last few years. And look, people's first impressions are pretty positive. It seems like the Virgins reviewer even tried to dislike them and just couldn't get herself there. She posted, I regret to inform you, meta's new smart glasses are the best I've ever tried.

Starting point is 00:11:15 We are of course all still waiting on the device that Sam Altman and Johnny Iver cooking up, but right now when it comes to AI wearables, it is metas and metas alone. But that is going to do it for the headlines. Next up, the main episode. AI isn't a one-off project. It's a partnership that has to evolve as the technology does. robots and pencils work side by side with clients to bring practical AI into every phase, automation, personalization, decision support, and optimization.

Starting point is 00:11:43 They prove what works through applied experimentation and build systems that amplify human potential. As an AWS-certified partner with global delivery centers, robots and pencils combines reach with high-touch service, where others hand off they stay engaged, because partnership isn't a project plan, it's a commitment. As AI advances, so will their solutions. That's long-term value. Progress starts with the right partner. Start with robots and pencils at robots and pencils.com slash AI Daily Brief. Shape the future of Enterprise AI with Agency, AGNTCY. Now in Open Source Linux Foundation Project,

Starting point is 00:12:18 agency is leading the way in establishing trusted identity and access management for the Internet of agents. The collaboration layer that ensures AI agents can securely discover, connect, and work across any framework. With agency, your organization gains open, standardized tools and seamless integration, including robust identity management to be able to identify, authenticate, and interact across any platform. Empowering you to deploy multi-agent systems with confidence, join industry leaders like Cisco, Dell Technologies, Google Cloud, Oracle, Red Hat, and 75-plus supporting companies to set the standard for secure, scalable AI infrastructure. Is your enterprise ready for the future of Agentic AI? Visit agency.org to explore use cases now. That's agn-tcY.org.

Starting point is 00:12:59 Today's episode is brought to you by Super Intelligent. Now, one thing that we are having a lot of conversations with folks about is the fact that for some of you, your fiscal year is coming to an end, and that means two things. One, it means planning and thinking about what you're going to do in the next year. And two, it means using up those last of budgets so you don't lose them. If you are an enterprise that happens to find yourself in that situation, super intelligent would love to help on both fronts. We are moving increasingly towards an annual AI planning model, where we're

Starting point is 00:13:29 map out how you can create an action map of your organization's agent opportunities that represents an executable backlog of AI and agent use cases that you can deliver on over the course of the next year. Additionally, for those end of your budgets, we have worked out deals with a number of partners where we can pre-lock in general implementation packages even before you figured out exactly what use cases are going to require them. If you'd like to learn more about superintelligence agent readiness audits and this new end of fiscal year plan, visit us at Bsuper.AI, click get started, and make sure to use the word fiscal somewhere in the description. Welcome back to the AI Daily Brief.

Starting point is 00:14:05 Today we are talking about both Google and OpenAI's very impressive performance in the ICPC coding competition, what it says about our perceptions of new models, and exploring what it means for what's next. But to set us up, I just have to go back a few weeks. It was only one month ago when we were inundated with articles like this one from the Financial Times. is AI hitting a wall? OpenAI's underwhelming new GPT-5 models suggest progress is slowing and competition is changing. A more kind interpretation came from the likes of the New York Times, who ran an opinion post, AI may just be kind of ordinary. Related at the beginning of this month, just two weeks ago, the economist published What If Artificial Intelligence is just a normal technology.

Starting point is 00:14:49 Now, of course, I have talked extensively about how much I think was going into these perceptions, and specifically how much it had to do with factors very much outside the AI space. And yet, it's not unfair to say that for many, outside at least this little AI sphere that we all live in, had started to kind of accept this idea as the conventional wisdom. Yes, it would be a disruptive technology, but maybe not some crazy, fast-moving thing like we had all been thinking. Of course, over the last several weeks, things have changed. There has been a major vibe shift around GPT5 that has done nothing but increased with the introduction of GPT5 codex, On a more macro level, we had that huge Oracle projection which got markets all excited again,

Starting point is 00:15:28 plus the Fed delivered the rate cut that markets had been hoping for. And now, on top of that general vibe shift, we've received news from OpenAI and Google about another big win in a competitive setting. The victory was achieved at the International Collegiate Programming Contest or ICPC. The contest brings together teams from colleges around the world to compete to answer complex algorithmic questions. The context is about solving mathematical puzzles using a combination of logic and programming skill. Google's DeepMind Think 2.5 and OpenAI's GPT5 both participated in the contest and were subjected

Starting point is 00:15:59 to the same five-hour time limit as the human teams. The results were significant. Gemini managed to answer 10 of the 12 questions, which would have been good enough for second place overall and a gold medal. Two human teams solved 10 problems, while only a single team from St. Petersburg University solved 11 problems. By the way, for those who are interested just from a global distribution for this, The top five teams were St. Petersburg State University, the University of Tokyo, Beijing, Zhao Tong University, Shinghua University, and Peking University. Russia, Japan, China, China, China. Harvard came in six. The University of Zagreb from Croatia came in seventh, and MIT came in eighth.

Starting point is 00:16:35 GBT5, meanwhile, managed a perfect score, which, as you just heard, none of the human teams achieved. Mustafa Rohaninjad, one of OpenAI scientists who observed the model's performance, wrote a thread on the event. He said, we received the problems in the exact same PDF form, and the reasoning system selected which answers to submit with no bespoke test time harness whatsoever. For 11 of the 12 problems, the system's first answer was correct. For the hardest problem, it succeeded on the 9th submission. Now, you might remember that back in July, both OpenAI and Google DeepMind claimed gold medal performances at the International Math Olympiad. The IMO is a similar style of event to the ICPC, putting

Starting point is 00:17:14 genius-level students to the test. However, neither of the models that competed at the IMO were generally available at the time. Part of what is capturing people's attention about this new performance is that this week's result was largely about mostly normal models that anyone has access to demonstrating superhuman performance. Mustafa wrote, we competed with an ensemble of general purpose reasoning models. We did not train any model specifically for the ICPC. We had both GPT-5 and an experimental reasoning model generating solutions, and the experimental reasoning model selecting which solutions do submit. GPT-5 answered 11 correctly, and the last and most difficult problem was solved by the experimental reasoning model. Mustafa confirmed that this was the same pair of

Starting point is 00:17:57 models that competed at the IMO, but of course at the time, GPD-5 was still unreleased. He writes, the result is a great capstone to our streak of results showcasing the impressive pace of improvement of our reasoning systems. Boris Miniev, a reasoning specialist at Open Am, filled in some details on how impressive the result was in the specific context of the history of the event. He posted, in 2015, I won the ICPC World Finals as a member of the ITMO University team. It was the only time in finals history when a team solved all the problems before the contest ended. Reflecting on the rapid advancement, Minnaev said, progress is very fast. A year ago, AI struggled with even easy contest problems. Now it performs better than the best human teams.

Starting point is 00:18:38 If this trend continues, next year we may see real scientific discoveries made by AI. Put a pin in that point because we are going to come back to it. Now, like I said, one of the obvious takeaways here is just how off our first impressions were of GPT5, or at least how wrong it was to translate our preference interpretations and perhaps wild expectations onto a model that was actually good. What I mean by that is that it perhaps was not unreasonable for people to be upset that a model whose character and personality they had come to really appreciate was gone. But then translating that to thinking that GPT5 was in some way a bad model

Starting point is 00:19:16 was a mistranslation of that real and legitimate sentiment. There's also questions of how OpenAI introduced it, whether they kind of shot themselves in the foot by not just calling O3 GPT5. So whatever the case, it's very clear between Codex and now this performance, that even if you have preference for some other models for some other use cases, which is completely reasonable, It's very hard to say that GPT5 is in some way a bad model. And of course, like I said, Google was right there getting gold medal performance as well.

Starting point is 00:19:44 This is not just a chat GPT or OpenAI story. What's more, it's not just a Google and OpenAI story either. The actual increase in performance that's happening right now is also not limited to Google and OpenAI. Another set of impressive results this week happened on the ARC AGI leaderboard. On Wednesday, Jeremy Berman of Reflection AI posted, I'm back at the top of Arc AGI with my new program. I use GROC4 and multi-agent collaboration with evolutionary test time compute. The result was a new state-of-the-art performance on Arc AGI

Starting point is 00:20:13 at around 80% of the first test and 30% of the second test. Berman spent $8.42 per task for the first test and around $30 per task on the second. As a point of comparison, the Arc AGI run of 03 from last December that had everyone speculating around whether Open AI had actually created AGI was behind that. They achieved a 76% result at a cost of around $13 per task, and a super-expensive run that cost several thousand dollars per task achieved an 88%. Berman explained the architecture, writing, The system works by having Grog4 generate natural language instructions for solving each task. Grock4 sub-agents test these instructions against training examples scoring their accuracy.

Starting point is 00:20:51 The best-performing instructions spawned new generations of refined solutions. Now, nothing in the design of this seems particularly complicated. It uses this sub-agent structure, but that is, of course, a hollowing. mark of GROC 4. In fact, when Greg Camrat of Arc Prize announced the submission, Elon Musk retweeted and said, that's just GROC 4. He later added, I now think XAI has a chance of reaching AGI with GROC 5. Never thought that before. And just to reinforce one of the key points here, these results are being achieved by the standard models available to the public. Compare that to O3's release when OpenAI needed to develop a fine-tune version and pile on a ton of inference

Starting point is 00:21:25 to achieve a result above 80%. Now GROC 4, when well structured, does it just a standard. Berman released all of his materials as open source so anyone with the technical skills and $100 for API costs can duplicate the run. Now, one of the things that makes these types of results difficult to communicate to a wider audience is that most of us just don't have any real frame of reference for what these results mean. If you're not a part of this competition, getting gold medal performance at the PQXY competition doesn't necessarily mean anything.

Starting point is 00:21:54 Trying to put some context around it, Swix writes, as impressive as this is, I feel like OpenAI is underselling it still. This is the first measure in which GPT-5 has achieved superhuman coding ability, as in, it is literally and measurably better than every other collegiate human programmer on Earth. Through all the prior IMO, I-OI, AT coder competitions, AI was roughly as good as the best humans, maybe a little under. OpenAI's known Brown agreed saying that's a good point. I wouldn't call it superhuman coding ability because there's more to coding than what the ICPC tests, but I think this is the first major coding competition where AI did better than any human competitor. And indeed, that's why it's worth wondering if this actually represents a meaningful

Starting point is 00:22:36 inflection point in some way. OpenAI's Jacob Pachaki put some context around what this means, at least for them, inside the company. He writes, I believe these results, coming from a family of general reasoning models rooted in our main research program, are perhaps the clearest benchmark of progress this year. These competitions are self-contained, timebox tests for the ability to discover new ideas. Even before our models were proficient at simple arithmetic, we looked towards these contests as milestones of progress towards transformative artificial intelligence. Our models now rank among the top humans in these domains when posed with well-specified questions and restricted to around five hours. The challenge now is moving towards more open-ended problems

Starting point is 00:23:16 and much longer time horizons. This level of reasoning ability applied over months and years to problems that really matter is what we're after, automating scientific discovery. And that is really the place that the discussion has resolved. That this is not about practical coding. It's about whether we are on the frontier of actually being able to make novel discoveries. Jerry Tuarek writes, ICPC probably marks the end of our run-on competitions and an end of a certain era for LLM systems.

Starting point is 00:23:44 But what's the next frontier is even more exciting. Rune, I think, put it even more poetically. He posted, Essentially all fixed-time competitions at the edge of human skill have been grandmastered by machines, so labs must pivot to the only true challenge of unraveling the unsolved mysteries. It is very clear that unraveling those unsolved mysteries is where OpenAI's head is at. CPO Kevin Wheel tweets, OpenAI models are getting quite good at solving really hard problems.

Starting point is 00:24:11 The next stage is accelerating scientific discovery, and we're beginning to see strong early signs. This is also the constant theme of Google DeepMind CEO Demis Hasavis. In basically every interview, it's pretty clear that his view of what makes for really advanced AI is making novel scientific discoveries that do things like, as he said in April, give us a real crack at solving all disease. So like Jerry said, maybe the ICPC represents the end of an era in the beginning of something new. If so, it is a very exciting time to be in this field, and I'm glad to have all of you here with me as we dive all the way in. That's going to do it for today's AI Daily Brief.

Starting point is 00:24:46 Thanks, thanks as always for listening or watching, and until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - AI Just Beat the World's Best Coders

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.