The AI Daily Brief: Artificial Intelligence News and Analysis - Can AI Predict the Future?

Episode Date: August 20, 2025

A breakthrough benchmark is testing whether AI can actually predict future events by analyzing real-world data. Researchers at the University of Chicago just launched Profit Arena, a new AI evaluation... platform that measures "predictive intelligence" by having models forecast outcomes on live prediction markets like Kalshi and Polymarket. Early results show AI models like GPT-4 and Claude are already performing as well as or better than human forecasters, with some models finding real market edges - like one AI that correctly predicted a Toronto FC soccer win when the market only gave it 11% odds. This represents a major shift from traditional saturated benchmarks toward dynamic, real-world testing that could reshape how we measure AI progress.Brought to you by:KPMG – Discover how AI is transforming possibility into reality. Tune into the new KPMG 'You Can with AI' podcast and unlock insights that will inform smarter decisions inside your enterprise. Listen now and start shaping your future with every episode. ⁠⁠⁠https://www.kpmg.us/AIpodcasts⁠⁠⁠Blitzy.com - Go to ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://blitzy.com/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ to build enterprise software in days, not months Vanta - Simplify compliance - ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://vanta.com/nlw⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Plumb - The automation platform for AI experts and consultants ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://useplumb.com/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠The Agent Readiness Audit from Superintelligent - Go to ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://besuper.ai/ ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Subscribe to the newsletter: https://aidailybrief.beehiiv.com/Interested in sponsoring the show? nlw@breakdown.network

Transcript
Discussion (0)
Starting point is 00:00:00 Today on the AI Daily Brief, a new benchmark all about how well AI can predict the future. Before that, in the headlines, the U.S. government is investing in our AI future, sort of. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. All right, friends, welcome back to another AI Daily Brief. Quick announcements before we dive in. First of all, thank you to today's sponsors, KPMG, Blitzy, and Superintelligent. To get an ad-free version of the show, go to Patreon.com slash AI Daily Brief. monthly ad free starts just $3 a month. And if you were interested in sponsoring the show, shoot us a note at sponsors at AIDailybreef.
Starting point is 00:00:42 It sounds like we are very close to sold out for the year. So if you have interest and urgency on timing, now is a good time. With that, let's get into today's topics. Welcome back to the AI Daily Brief Headlines edition, all the daily AI news you need in around five minutes. Well, it appears that the U.S. is investing in our AI future, at least kind of. plans for the U.S. government to take a stake in Intel appear to be moving forward. Bloomberg reports that the Trump administration plans to acquire a 10% stake in Intel, a move which would make the government the largest single shareholder of the company.
Starting point is 00:01:16 Sources said that the rationale for the deal was that Intel is slated to receive around 10.9 billion in grants under the Chips Act. The government plans to take equity in equivalent size, which amounts to a roughly 10% stake. The reporting did note that the exact size at the stake as well as the final decision to move forward are still to be finalized, and Bloomberg's White House source also floated the idea of converting other Chips Act awards into equity stakes, which could impact firms like TSM, Samsung, and Texas Instruments. Now, the entire premise of this deal was to rescue Intel's troubled manufacturing project in Ohio. The project was enabled by the Chips Act, but the actual distribution of funds was tied to hitting particular
Starting point is 00:01:53 benchmarks. So far, Intel has received $2.2 billion in grant disbursements, and progress on the Ohio facility has stalled. The reporting pondered whether the deal would allow an acceleration of funding in exchange for the equity stake, but there is a whole lot of uncertainty around how the deal will function. The only thing that's clear is that the administration apparently wants equity in return for distributing Chips Act grants. Alongside the new reporting on the government's dealmaking, Bloomberg reports that SoftBank has taken a surprised interest in Intel. The company has agreed to buy two billion worth of Intel stock in a private fundraising round. CEO Masa Sun said in a statement, For more than 50 years, Intel has been a trusted leader in innovation.
Starting point is 00:02:31 This strategic investment reflects our belief that advanced semiconductor manufacturing and supply will further expand in the United States with Intel playing a critical role. The deal comes as SoftBank doubles down on their commitment to building AI infrastructure in the United States. Last week, SoftBank acquired the Foxcon Electric Vehicle Plant in Ohio for $375 million, with a view to converting it into a manufacturing facility for AI server equipment. Now, many commentators were already alarmed at the idea of the government taking a stake in Intel in exchange for a bailout.
Starting point is 00:02:59 But this new reporting cast the deal in a totally different light. This does not sound like the government aiding a struggling company with a more generous package. It sounds like taking a pound of flesh in exchange for following through with existing commitments. Reason editor Nick Gillespie writes, if Nippon, U.S. deal, invidia, and Intel deals mean anything, it's that the Trump's industrial policy is paid to play. Terrible for people who want to live in a world of permissionless innovation, rule of law, and limited government.
Starting point is 00:03:25 The editorial board of the Wall Street Journal is also not optimistic for the chance of success, writing, This is corporate statism and rarely does it end well, political control hamstrings, innovation, and investment, as managers look to their government overlords for approval. To the extent anyone thinks this will work, the logic is that the government will be able to influence other U.S. companies to buy from Intel, giving them the revenue they need to justify the Ohio facility, although that, frankly to me, is not all that appealing as an outcome either. And there are many, for whom it's not just the idea of the U.S. government being involved in the sector,
Starting point is 00:03:56 it's the particular horse they're betting on. Aidan Gold writes, Intel is a sinking ship that will not secure America's chip dominance for decades to come. USG buying 10% is poor capital allocation. We need to deploy capital to innovative chip companies, not bureaucracies. Ultimately, the big question is whether Intel already has the capital required to finish construction of a cutting-edge chip fab or not. And if not, we'll quickly find out if the government is interested in making the necessary investments or just focused on taking a cut. Now, staying on the theme of big nationally relevant infrastructure, the Tennessee Valley Authority has agreed to buy electricity from a small nuclear power plant
Starting point is 00:04:32 that's expected to be completed by 2030. Cairo's power is building a demonstration reactor in Tennessee that will operate at 50 megawatts. The reactor will be one of the first Gen 4 small modular reactors to begin operation in the U.S. It's part of a broader partnership with Google that aims to deploy 500 megawatts of small nuclear reactors across a longer time frame. The TVA agreeing to buy the power allows the project to go ahead, with excess power not used by Google's data center feeding into the grid. The contract is the first of its kind and a major milestone in the process of bringing new nuclear power generation online in the U.S. Don Mule, the CEO of the TVA, said, nuclear is the bedrock of the future of energy security.
Starting point is 00:05:09 Google stepping in and help shoulder the burden of the cost and risk of a first-of-a-kind nuclear projects not only helps Google get to those solutions, but it keeps us from having to burden our customers with development of that technology. So it's not just good for Google, it's good for TVA's 10 million customers. It's good for the United States. And it couldn't come at a better time because there is a lot of increased chatter around the issue of data center energy usage. A wave of reporting this summer has highlighted that the existing U.S. energy supply is not keeping up with the accelerated AI buildout. Last week, the energy watchdog for the Northeast grid run by PGM Interconnection warned that the grid is already tapped out and recommended that new data centers should be, quote, required to bring
Starting point is 00:05:46 their own generation. In that light, of course, rapidly moving ahead with experiments in small modular reactors is a timely decision. While this type of reactor is unproven in the U.S., China has multiple test reactors and is about to open their first commercial reactor. We're still a long way from solving the issue of AI energy demand, but the contract is a promising early step towards bringing nuclear power back to the U.S. Next, in the headlines, we move over to a product update. Shashir, the CEO of Gramerly writes, our Grammarly AI agents are here. Today, we're launching eight new AI agents designed for students and professionals. These agents help with everything from finding credible sources to predicting reader reactions. We created many of these agents with students in mind because
Starting point is 00:06:24 they're the first generation entering a job market where employers expect both subject expertise and AI fluency. So TLDR, after merging with Coda last year, Gramerly has now released their new document-based interface to flesh the service out into a full productivity app. Users can now complete drafting and layout tasks in a new interface that Gramerly is calling their AI native writing surface. Speaking to some of the challenges we were talking to about on this week's Long Read Sunday, Jenny Maxwell, the head of Grammarly for Education wrote, students today need AI that enhances their capabilities without undermining their learning. Gramerly's new agents fill this gap, acting as real partners that guide students to produce
Starting point is 00:06:58 better work while ensuring they develop real skills that will serve them throughout their careers. By teaching students how to work effectively with AI now, we're preparing them for a workplace where AI literacy will be essential. The new agents include greater, which can provide feedback based on an instructor's guidelines, reader reactions, which allows the user to define a reader persona and get feedback from that perspective, expert review, which offers subject matter expertise and topic-specific feedback to elevate writing in a particular field, citation finder to ensure the references section is correct, proofreader which offers in-line suggestions to improve clarity, and paraphraser,
Starting point is 00:07:30 which can change the tone of the writing overall. I think it's good that the companies in this space are trying to explore this line of how to deal with how AI negatively impacts some of the foundations of education while also helping students be AI literate, although I think ultimately it's going to take more than even a well-thought-out software suite to get there. Lastly, today, an update from the open source fields. OpenAI's open source model has been out for a few weeks now, and the tinkering has begun in earnest. One of the more interesting modifications came from Jack Morris and AI researcher at Meta. He claimed to have stripped out all the reasoning to recreate the base model. In other words,
Starting point is 00:08:04 a base model that just predicts the next token in a string of text based on pre-training alone. Now, in the process of stripping away reasoning, Morris seems to have removed all of the alignment training as well, resulting in a much less censored and unconstrained version of the model. He wrote, turning GBTOSS back into a base model appears to have trivially reversed its alignment. It will now tell us how to build a bomb. It will list all the curse words it knows. It will plan a robbery for me. Now, at this stage, this is less about having some practical use case and more about just exploring what can be done with an open weights model. But to the extent that this is something that OpenAI or other US labs are going to prioritize, these sort of experiments are going to be important to understand
Starting point is 00:08:41 what really happens when you release an open weights model into the world. For now, though, that's going to do it for today's AI Daily Brief Headlines edition. Next up, the main episode. What if AI wasn't just a buzzword, but a business imperative? On You Can with AI, we take you inside the boardrooms and strategy sessions of the world's most forward-thinking enterprises. Hosted by me, Nathaniel Wittamore, and powered by KPMG, the seven-part series delivers real-world insights from leaders who are scaling AI with purpose. From aligning culture and leadership to building trust, data readiness, and deploying AI agents. Whether you're a C-suite executive, strategist, or innovator, this podcast is your front row seat
Starting point is 00:09:19 to the future of Enterprise AI. So go check it out at www.kpmG.org.us slash AI podcasts or search you Penn with AI on Spotify, Apple Podcast, or wherever you get your podcasts. This episode is brought to you by Blitzy, the Enterprise at Talks, software development platform with infinite code context. Blitzy uses thousands of specialized AI agents that think for hours to understand enterprise-scale code bases with millions of lines of code. Enterprise engineering leaders start every development sprint with the Blitzy platform, bringing in their development requirements. The Blitzy platform provides a plan, then generates
Starting point is 00:09:55 and pre-compiles code for each task. Blitzy delivers 80% plus of the development work autonomously while providing a guide for the final 20% of human development work required to complete the sprint. Public companies are achieving a 5x engineering velocity increase when incorporating Blitzie as their pre-I-D-E development tool, pairing it with their coding co-pilot of choice to bring an S-DLC into their org. Blitzy is providing a limited time, 30-day free proof of concept for qualifying enterprises. The team will provide a 5x velocity increase on a real development project in your org. Visit blitzy.com and press book demo to learn how Blitzie transforms your STLC from AI-assisted to AI Native. That's B-L-I-T-ZY-E-E-Y.
Starting point is 00:10:34 If you are a regular listener, you will have heard about superintelligence agent readiness audits at this point. But I wanted to tell you today about the full suite of agent readiness products that go beyond just the initial readiness report. Over the last six months, Super Intelligence has built out an entire agent planning suite. We help you move from discovery to planning to implementation. After you've completed your agent readiness audits, we help you double click on your most important use cases with what we call our use case planning reports. These reports are going to help you understand what sort of technical preparation you need to do
Starting point is 00:11:09 to be ready for a use case, what challenges you might face in implementation, and whether you should be thinking about building, buying, partnering, or some combination. After that, you can even get a spec document in what we call our technical blueprint that gives either your developers or the developers of the partner you work with
Starting point is 00:11:24 what they need to build exactly the agent that you're looking for. If you want to learn more about superintelligence agent planning suite, we've built a custom GPT to answer your question. Just go to bit.ly slash super super agent. That's bit.l.ly slash super super agent, all one word. And if you have any questions, the agent can even help you book an appointment with our team. Welcome back to the AI Daily Brief. Today we are exploring a new way of trying to understand the frontier of AI capabilities. Right now, there are a lot of questions around AI progress in the future. Now, as I've mentioned before, I think this is a summer phenomenon. Each year we get a different sort of AI-skeptic narrative.
Starting point is 00:12:07 This year, the initial perceived lack of progress of GPT-5, open up the space for a whole set of usually mainstream media articles like this one from the New York Times. Companies are pouring billions into AI. It has yet to pay off. There was another one in the New Yorker. What if AI doesn't get much better than this? Now, part of the challenge with this, if you are a regular listener, you will have heard me talk about this before, is actually not a problem of model capacity, but a problem of the saturation of benchmarks. In other words, as we get closer to the upper bounds of performance on all of our standard benchmarking tests, our perception of increased performance is impacted by the fact that we only went from a 92 to a 93 versus previous iterations of AI that
Starting point is 00:12:48 might have gone from a 70 to an 80. And this is a real problem, not from a beauty contest kind of standpoint, but because it's hard for us to understand progress if we don't have good ways to measure it. Now, luckily, there are lots of interesting efforts to create new benchmark. benchmarks that are not yet saturated. The ARCGI prize, for example, is one we talk about a lot here, and ARCGI itself is now actually three different versions. The most recent ARCGI3 was just announced a couple of months ago and is what they call an interactive reasoning benchmark. They write, traditionally, to measure intelligence, static benchmarks have been the yardstick, but they do not have the bandwidth required to measure the full spectrum of intelligence.
Starting point is 00:13:26 Interactive reasoning benchmarks or IRBs test for a much broader scope of capabilities, exploration, perception plan action, memory, goal acquisition, and alignment. And basically, RKGI leverages game environments to test this sort of interactivity. Okay, so the point here is that there is finally some interesting work being done on new benchmarks, but it's still really nascent. Now, meanwhile, there's been this other interesting thing going on. I'm sure that if you've spent any time in and around any sort of media recently for the last year, at least going back to before last year's U.S. presidential election, an important social phenomenon has been the rise of prediction platforms. Two of the best known are Kalshi and Polymarket, and these provide platforms for
Starting point is 00:14:07 people to bet on a huge range of different phenomenon. For example, if you want to get a pulse of what people are thinking about the AI race, go check out the tech section of Polymarket. With 6 million in volume, for example, which company has the best AI model at the end of August, Google is at 94%. And what's interesting, of course, about prediction markets is that they often tell a different story than the conventional wisdom. Now, the people who are very bullish on prediction markets like the fact that this is, in fact, a market, that people are putting up real money behind their predictions, basically thinking that that is a pure expression of actual belief than just what you say on Twitter or Instagram or whatever. The Wall Street Journal has recently noted
Starting point is 00:14:47 that the prediction markets are very interested in the future of AI. In an article over the weekend, they wrote, gamblers now bet on AI models like racehorses, prediction platforms, are turning the AI arms race into a high-stakes game. The article begins, now that AI developers are getting paid like pro athletes, it's fitting that fans are placing big bets on how well they're doing their jobs. Now, it's important to note that while the trend may be rising enough to capture the Wall Street Journal's attention,
Starting point is 00:15:12 the market is still fairly small. Trading volume across AI prediction markets, so far in August, is around $20 million. Not nothing, but also not breaking the bank. And yet still, the volume on AI-related trades is up 1,000% since the beginning of the year. The article also explains why people actually like this as a source of signal outside of just the fact that people are putting their money on it. Basically, the fact that people are putting their money on these markets
Starting point is 00:15:37 tends to lead them to go down rabbit holes looking for interesting information that might influence their decision. In other words, the prediction market bulls basically argued that it's not just collective wisdom. It's the collective wisdom of a group who has a financial incentive to really, really research and try to understand what's going on. Okay, so all of this is the end. interesting background to today's story. We've got on the one hand, saturation of benchmarks,
Starting point is 00:16:02 and the emergence of some new ones, and on the other hand, the rise of prediction markets. Well then, friends, why not combine the two? Just a few days ago, a new project out of the University of Chicago called Profit Arena launched. They tweeted, introducing Profit Arena, the AI benchmark for general predictive intelligence. That is, can AI truly predict the future by connecting today's dots. In their introductory blog post, they write, forecasting is one of humanity's most original and most powerful intellectual pursuits, the spark that gave rise to science and the engine behind modern economics and finance. While today's AI models can ace bar exams and outperform humans in math competitions, a deeper question remains poorly understood. Can AI systems reliably
Starting point is 00:16:46 predict the future by connecting the dots across existing real-world information? And that is the goal of profit arena. It's a new benchmark that evaluates this sort of predictive intelligence through, as they call it, live updated real-world forecasting tasks. Now, the team behind Profit Arena point out that forecasting has for some time been one of the main points of machine learning. Obviously, these systems are ubiquitous across the enterprise in these very small and discrete sort of ways. So they write, what's new and what makes the challenge this time different. The new frontier, they say, is building general-purpose AI systems that make accurate forecasts across a wide range of domains, potentially without domain-specific fine-tuning.
Starting point is 00:17:25 or access to specialized data sets. They argue that this kind of open domain forecasting requires capabilities that are on the very edge of what today's AI systems can do, including probabilistic reasoning, i.e. quantifying uncertainty, maintaining calibration and performing statistical thinking, causality, causal reasoning and modeling of how events unfold and influence one another, and critical thinking, i.e. curating relevant information and assessing the credibility of sources. So how does the platform actually work in practice? Well, Profit Arena is taking advantage of these new prediction platforms like CalShe. First, they curate events from
Starting point is 00:18:01 those platforms, selecting for events that are popular, i.e., there's a lot of people participating, so they're going to give better signal around how AI does against humans, because there's enough humans doing it to actually get that information. They're also looking for a certain diversity of different events that are balanced across domains, including politics, economics, sports, science, entertainment, and they're looking for events that are recurring. Things like weekly price movement or earnings announcement, too, as they put it, support consistency. and comparability. Then for each event, the AI models are allowed to go gather relevant information, and with the same context, each AI model then submits a structured forecast, which they describe
Starting point is 00:18:37 as a probability distribution over all possible outcomes accompanied by a detailed rationale. Quote, these rationales are made visible to users who can assess their value, share feedback on the usefulness of news sources, and contribute alternative information to observe how forecast shift in response. Those two steps are then repeated for each event over time until the outcome actually happens. Now, when it comes to performance, they have two sets of evaluation metrics, absolute metrics and relative metrics. The absolute metrics utilizes the Breyer score, which they describe as a widely adopted proper scoring rule that measures how accurately and confidently AI models predict probabilistic outcomes. And then they also have this concept of average return. They write,
Starting point is 00:19:14 to bridge predictions with real-world action ability, we also introduce an innovative class of average return metrics derived from utility theory. These metrics simulate a scenario where practitioners consistently use AI-generated probabilities to inform their betting decisions in real prediction markets. Users can flexibly adjust risk preferences to explore various betting strategies, offering a practical insight into the economic value generated by LLM-driven forecasts. So basically, the absolute metrics in the Breyer score are just about how accurate they were, and the relative metrics or average return is all about how well people would be able to use those metrics to make money in the markets. They also have a couple other supplementary metrics,
Starting point is 00:19:51 and go in deep in a secondary blog on their approach to scoring for those of you who are interested. Interestingly, they point out that the Breyer score or the measure of statistical accuracy and calibration does not always match real-world betting performance. For example, GROC's model score much higher in the statistical accuracy than they do in the average return assessment. So what are some of their early findings? Overall, right now, O3 Mini ranks highest in the average return above second place GPT-5, while GPT-5 ranks highest on the Breyer score. They've found that models show distinct personalities, with some being more aggressive and others being more conservative, and one of the areas where there's a big gradation is that different
Starting point is 00:20:31 models show differences in how they handle uncertainty in their sources. One example that they shared was around a prediction for Major League Soccer. In that particular example, while the market was pricing in the chance of a Toronto FC win at 11 percent, O3 Mini saw a 30 percent win chance. Arena writes, the model bet on Toronto because of positive expected value and earned a 9x return when Toronto won. This is AI finding a real edge over human crowds. And by the way, I think this also demonstrates why the average return profile could look different than just the raw accuracy. If a model is not just good in pure terms, but good at figuring out when the existing conventional wisdom is off, there's more room to
Starting point is 00:21:11 arbitrage that into expected value when it comes to these prediction markets. I think one thing that's super interesting about this to me is that given how saturated all of these other models feel right near the top of all the benchmarks, they really have wildly divergent approaches to their reasoning. For example, on the question of whether AI regulation would become federal law this year, Quen 3 saw a 75% chance with their aggressive interpretation, Lama 4 Maverick had only a 35% chance, a conservative approach that cited the complexity, and GPT4-1 gave it a 60% chance that was basically a balanced middle ground sort of call.
Starting point is 00:21:47 And again, all of this is using the same data just really different approaches to reasoning. The response of this has been really positive. Tonk Yaljan summed up about 1,000 versions of this tweet when he said, the kind of benchmark that's actually useful, very cool. And indeed, if there was just one takeaway from the AI community so far,
Starting point is 00:22:05 is that this is exactly the sort of interesting new type of benchmark we really need right now. Neon Blue CEO Stephen L. Hodge writes, The reason recent model releases are disappointing is because of benchmark hacking. Company optimizes their model for the benchmark and says this is 50% smarter than the others. Profit Arena is cool because I think we can all agree we have AGI when AI can predict the future. Simon Smith writes, Love this new AI benchmark based on predictions because it's one, practical, two, always up to date,
Starting point is 00:22:31 and three, impossible to game. Also interesting to see that it's not always the smartest models that are most profitable. Model personality matters too. Some of the folks in the AI safety community use this as a moment to decry the strand of denialists who think that AI isn't real. Quoting a theoretical skeptic, the AI safety memes account says, AI just memorizes and regurgitates. Okay, buddy, then explain to me how it can predict the literal future better than humans. Dan Hendricks, meanwhile, points out, quote, a new benchmark shows that AI's out of the box can perform similarly to or better than prediction markets at forecasting
Starting point is 00:23:04 future world events. Interestingly, some forecasters will say AI's are not yet human level, despite AIs doing better than the typical forecaster or prediction market for a long time. Now, I thought Open AIs Nome Brown had a really interesting comment. I recently chatted with the VC who believed AGI was coming and would disrupt a lot of jobs, but not their job. Of course AIs could write code and review contracts, but making accurate, calibrated predictions about the future, that's uniquely human. And this is why I always say on this show that I really believe that AI is in fact
Starting point is 00:23:37 coming for all of our jobs. not that it will replace us, but I do think that all of our jobs will look different in the future. Now, the one other strand of conversation, which does seem important and I think we'll get larger in the future, is the relationship between predictions and self-fulfilling prophecy. Alan Zhao writes, the decisions made by humans will increasingly get influenced by AI. Instead of asking, did the model predict correctly, we may need to ask, did the model's prediction cause the outcome. Termer Glick also writes, Providence Arena is a strong benchmark, yet predictions influence outcomes.
Starting point is 00:24:08 When models publish probabilities, incentives shift, traders react, and dynamics change. The feedback loop persists until the signal is arbitraged away and performance saturates. In other words, the presence of the benchmark and the approach itself actually creates a new set of challenges. Still, overall, I am very excited to see this sort of exploration and experimentation in the benchmark space. I think it's extremely important for us to actually understand not only how models are progressing, but how they work. So great job to the team at Profit Arena. I will be very much looking forward to seeing how the project evolves and how new models perform.
Starting point is 00:24:40 For now, though, that's going to do it for today's AI Daily Brief. Until next time, peace.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.