The AI Daily Brief: Artificial Intelligence News and Analysis - Can AI Predict the Future?
Episode Date: August 20, 2025A breakthrough benchmark is testing whether AI can actually predict future events by analyzing real-world data. Researchers at the University of Chicago just launched Profit Arena, a new AI evaluation... platform that measures "predictive intelligence" by having models forecast outcomes on live prediction markets like Kalshi and Polymarket. Early results show AI models like GPT-4 and Claude are already performing as well as or better than human forecasters, with some models finding real market edges - like one AI that correctly predicted a Toronto FC soccer win when the market only gave it 11% odds. This represents a major shift from traditional saturated benchmarks toward dynamic, real-world testing that could reshape how we measure AI progress.Brought to you by:KPMG – Discover how AI is transforming possibility into reality. Tune into the new KPMG 'You Can with AI' podcast and unlock insights that will inform smarter decisions inside your enterprise. Listen now and start shaping your future with every episode. https://www.kpmg.us/AIpodcastsBlitzy.com - Go to https://blitzy.com/ to build enterprise software in days, not months Vanta - Simplify compliance - https://vanta.com/nlwPlumb - The automation platform for AI experts and consultants https://useplumb.com/The Agent Readiness Audit from Superintelligent - Go to https://besuper.ai/ to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Subscribe to the newsletter: https://aidailybrief.beehiiv.com/Interested in sponsoring the show? nlw@breakdown.network
Transcript
Discussion (0)
Today on the AI Daily Brief, a new benchmark all about how well AI can predict the future.
Before that, in the headlines, the U.S. government is investing in our AI future, sort of.
The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.
All right, friends, welcome back to another AI Daily Brief. Quick announcements before we dive in.
First of all, thank you to today's sponsors, KPMG, Blitzy, and Superintelligent.
To get an ad-free version of the show, go to Patreon.com slash AI Daily Brief.
monthly ad free starts just $3 a month.
And if you were interested in sponsoring the show, shoot us a note at sponsors at AIDailybreef.
It sounds like we are very close to sold out for the year.
So if you have interest and urgency on timing, now is a good time.
With that, let's get into today's topics.
Welcome back to the AI Daily Brief Headlines edition, all the daily AI news you need in around five minutes.
Well, it appears that the U.S. is investing in our AI future, at least kind of.
plans for the U.S. government to take a stake in Intel appear to be moving forward.
Bloomberg reports that the Trump administration plans to acquire a 10% stake in Intel,
a move which would make the government the largest single shareholder of the company.
Sources said that the rationale for the deal was that Intel is slated to receive around
10.9 billion in grants under the Chips Act. The government plans to take equity in
equivalent size, which amounts to a roughly 10% stake. The reporting did note that the exact size
at the stake as well as the final decision to move forward are still to be finalized, and Bloomberg's
White House source also floated the idea of converting other Chips Act awards into equity stakes,
which could impact firms like TSM, Samsung, and Texas Instruments. Now, the entire premise of
this deal was to rescue Intel's troubled manufacturing project in Ohio. The project was
enabled by the Chips Act, but the actual distribution of funds was tied to hitting particular
benchmarks. So far, Intel has received $2.2 billion in grant disbursements, and progress on the Ohio
facility has stalled. The reporting pondered whether the deal would allow an acceleration of funding
in exchange for the equity stake, but there is a whole lot of uncertainty around how the deal will
function. The only thing that's clear is that the administration apparently wants equity in return
for distributing Chips Act grants. Alongside the new reporting on the government's dealmaking,
Bloomberg reports that SoftBank has taken a surprised interest in Intel. The company has agreed to buy
two billion worth of Intel stock in a private fundraising round. CEO Masa Sun said in a statement,
For more than 50 years, Intel has been a trusted leader in innovation.
This strategic investment reflects our belief that advanced semiconductor manufacturing and supply
will further expand in the United States with Intel playing a critical role.
The deal comes as SoftBank doubles down on their commitment to building AI infrastructure in the United
States.
Last week, SoftBank acquired the Foxcon Electric Vehicle Plant in Ohio for $375 million,
with a view to converting it into a manufacturing facility for AI server equipment.
Now, many commentators were already alarmed at the idea of the government
taking a stake in Intel in exchange for a bailout.
But this new reporting cast the deal in a totally different light.
This does not sound like the government aiding a struggling company with a more generous package.
It sounds like taking a pound of flesh in exchange for following through with existing commitments.
Reason editor Nick Gillespie writes,
if Nippon, U.S. deal, invidia, and Intel deals mean anything,
it's that the Trump's industrial policy is paid to play.
Terrible for people who want to live in a world of permissionless innovation,
rule of law, and limited government.
The editorial board of the Wall Street Journal is also not optimistic for the chance of success,
writing,
This is corporate statism and rarely does it end well, political control hamstrings, innovation, and investment,
as managers look to their government overlords for approval.
To the extent anyone thinks this will work, the logic is that the government will be able to
influence other U.S. companies to buy from Intel, giving them the revenue they need to justify
the Ohio facility, although that, frankly to me, is not all that appealing as an outcome either.
And there are many, for whom it's not just the idea of the U.S. government being involved in the sector,
it's the particular horse they're betting on. Aidan Gold writes, Intel is a sinking ship that will not
secure America's chip dominance for decades to come. USG buying 10% is poor capital allocation.
We need to deploy capital to innovative chip companies, not bureaucracies.
Ultimately, the big question is whether Intel already has the capital required to finish
construction of a cutting-edge chip fab or not. And if not, we'll quickly find out if the government
is interested in making the necessary investments or just focused on taking a cut.
Now, staying on the theme of big nationally relevant infrastructure,
the Tennessee Valley Authority has agreed to buy electricity from a small nuclear power plant
that's expected to be completed by 2030.
Cairo's power is building a demonstration reactor in Tennessee that will operate at 50 megawatts.
The reactor will be one of the first Gen 4 small modular reactors to begin operation in the U.S.
It's part of a broader partnership with Google that aims to deploy 500 megawatts of small nuclear
reactors across a longer time frame. The TVA agreeing to buy the power allows the project to go ahead,
with excess power not used by Google's data center feeding into the grid. The contract is the first of
its kind and a major milestone in the process of bringing new nuclear power generation online in the U.S.
Don Mule, the CEO of the TVA, said, nuclear is the bedrock of the future of energy security.
Google stepping in and help shoulder the burden of the cost and risk of a first-of-a-kind nuclear
projects not only helps Google get to those solutions, but it keeps us from having to burden our
customers with development of that technology. So it's not just good for Google, it's good for TVA's 10 million
customers. It's good for the United States. And it couldn't come at a better time because there is a
lot of increased chatter around the issue of data center energy usage. A wave of reporting this summer
has highlighted that the existing U.S. energy supply is not keeping up with the accelerated AI buildout.
Last week, the energy watchdog for the Northeast grid run by PGM Interconnection warned that the
grid is already tapped out and recommended that new data centers should be, quote, required to bring
their own generation. In that light, of course, rapidly moving ahead with experiments in small modular
reactors is a timely decision. While this type of reactor is unproven in the U.S., China has multiple
test reactors and is about to open their first commercial reactor. We're still a long way from solving
the issue of AI energy demand, but the contract is a promising early step towards bringing nuclear power
back to the U.S. Next, in the headlines, we move over to a product update. Shashir, the CEO of
Gramerly writes, our Grammarly AI agents are here. Today, we're launching eight new AI agents
designed for students and professionals. These agents help with everything from finding credible
sources to predicting reader reactions. We created many of these agents with students in mind because
they're the first generation entering a job market where employers expect both subject
expertise and AI fluency. So TLDR, after merging with Coda last year, Gramerly has now
released their new document-based interface to flesh the service out into a full productivity app.
Users can now complete drafting and layout tasks in a new interface that Gramerly is calling
their AI native writing surface. Speaking to some of the challenges we were talking to
about on this week's Long Read Sunday, Jenny Maxwell, the head of Grammarly for Education
wrote, students today need AI that enhances their capabilities without undermining their learning.
Gramerly's new agents fill this gap, acting as real partners that guide students to produce
better work while ensuring they develop real skills that will serve them throughout their careers.
By teaching students how to work effectively with AI now, we're preparing them for a workplace
where AI literacy will be essential.
The new agents include greater, which can provide feedback based on an instructor's guidelines,
reader reactions, which allows the user to define a reader persona and get feedback from that
perspective, expert review, which offers subject matter expertise and topic-specific feedback
to elevate writing in a particular field, citation finder to ensure the references section is
correct, proofreader which offers in-line suggestions to improve clarity, and paraphraser,
which can change the tone of the writing overall.
I think it's good that the companies in this space are trying to explore this line of how
to deal with how AI negatively impacts some of the foundations of education while also
helping students be AI literate, although I think ultimately it's going to take more than even
a well-thought-out software suite to get there. Lastly, today, an update from the open source fields.
OpenAI's open source model has been out for a few weeks now, and the tinkering has begun in earnest.
One of the more interesting modifications came from Jack Morris and AI researcher at Meta.
He claimed to have stripped out all the reasoning to recreate the base model. In other words,
a base model that just predicts the next token in a string of text based on pre-training alone.
Now, in the process of stripping away reasoning, Morris seems to have removed all of the alignment
training as well, resulting in a much less censored and unconstrained version of the model.
He wrote, turning GBTOSS back into a base model appears to have trivially reversed its alignment.
It will now tell us how to build a bomb. It will list all the curse words it knows. It will plan a robbery for me.
Now, at this stage, this is less about having some practical use case and more about just exploring
what can be done with an open weights model. But to the extent that this is something that OpenAI or other
US labs are going to prioritize, these sort of experiments are going to be important to understand
what really happens when you release an open weights model into the world. For now, though, that's
going to do it for today's AI Daily Brief Headlines edition. Next up, the main episode.
What if AI wasn't just a buzzword, but a business imperative? On You Can with AI, we take you
inside the boardrooms and strategy sessions of the world's most forward-thinking enterprises.
Hosted by me, Nathaniel Wittamore, and powered by KPMG, the seven-part series delivers real-world insights
from leaders who are scaling AI with purpose.
From aligning culture and leadership to building trust, data readiness, and deploying AI agents.
Whether you're a C-suite executive, strategist, or innovator, this podcast is your front row seat
to the future of Enterprise AI.
So go check it out at www.kpmG.org.us slash AI podcasts or search you Penn with AI on Spotify,
Apple Podcast, or wherever you get your podcasts.
This episode is brought to you by Blitzy, the Enterprise at Talks,
software development platform with infinite code context. Blitzy uses thousands of specialized AI
agents that think for hours to understand enterprise-scale code bases with millions of lines of code.
Enterprise engineering leaders start every development sprint with the Blitzy platform,
bringing in their development requirements. The Blitzy platform provides a plan, then generates
and pre-compiles code for each task. Blitzy delivers 80% plus of the development work
autonomously while providing a guide for the final 20% of human development work required to complete
the sprint. Public companies are achieving a 5x engineering velocity increase when incorporating Blitzie
as their pre-I-D-E development tool, pairing it with their coding co-pilot of choice to bring an
S-DLC into their org. Blitzy is providing a limited time, 30-day free proof of concept for qualifying
enterprises. The team will provide a 5x velocity increase on a real development project in your
org. Visit blitzy.com and press book demo to learn how Blitzie transforms your STLC from AI-assisted
to AI Native. That's B-L-I-T-ZY-E-E-Y.
If you are a regular listener, you will have heard about superintelligence agent readiness audits at this point.
But I wanted to tell you today about the full suite of agent readiness products that go beyond just the
initial readiness report. Over the last six months, Super Intelligence has built out an entire
agent planning suite. We help you move from discovery to planning to implementation. After you've
completed your agent readiness audits, we help you double click on your most important use cases
with what we call our use case planning reports.
These reports are going to help you understand
what sort of technical preparation you need to do
to be ready for a use case,
what challenges you might face in implementation,
and whether you should be thinking about building,
buying, partnering, or some combination.
After that, you can even get a spec document
in what we call our technical blueprint
that gives either your developers
or the developers of the partner you work with
what they need to build exactly the agent that you're looking for.
If you want to learn more about superintelligence agent planning suite,
we've built a custom GPT to answer your question.
Just go to bit.ly slash super super agent. That's bit.l.ly slash super super agent, all one word. And if you have any
questions, the agent can even help you book an appointment with our team. Welcome back to the AI Daily Brief.
Today we are exploring a new way of trying to understand the frontier of AI capabilities.
Right now, there are a lot of questions around AI progress in the future. Now, as I've mentioned before,
I think this is a summer phenomenon. Each year we get a different sort of AI-skeptic narrative.
This year, the initial perceived lack of progress of GPT-5, open up the space for a whole set of
usually mainstream media articles like this one from the New York Times. Companies are pouring
billions into AI. It has yet to pay off. There was another one in the New Yorker. What if AI doesn't
get much better than this? Now, part of the challenge with this, if you are a regular listener,
you will have heard me talk about this before, is actually not a problem of model capacity,
but a problem of the saturation of benchmarks. In other words, as we get closer to the upper bounds of
performance on all of our standard benchmarking tests, our perception of increased performance
is impacted by the fact that we only went from a 92 to a 93 versus previous iterations of AI that
might have gone from a 70 to an 80. And this is a real problem, not from a beauty contest kind of
standpoint, but because it's hard for us to understand progress if we don't have good ways to measure
it. Now, luckily, there are lots of interesting efforts to create new benchmark.
benchmarks that are not yet saturated. The ARCGI prize, for example, is one we talk about a lot here,
and ARCGI itself is now actually three different versions. The most recent ARCGI3 was just
announced a couple of months ago and is what they call an interactive reasoning benchmark.
They write, traditionally, to measure intelligence, static benchmarks have been the yardstick,
but they do not have the bandwidth required to measure the full spectrum of intelligence.
Interactive reasoning benchmarks or IRBs test for a much broader scope of capabilities, exploration,
perception plan action, memory, goal acquisition, and alignment. And basically, RKGI leverages
game environments to test this sort of interactivity. Okay, so the point here is that there is finally
some interesting work being done on new benchmarks, but it's still really nascent. Now,
meanwhile, there's been this other interesting thing going on. I'm sure that if you've spent
any time in and around any sort of media recently for the last year, at least going back to before
last year's U.S. presidential election, an important social phenomenon has been the rise of
prediction platforms. Two of the best known are Kalshi and Polymarket, and these provide platforms for
people to bet on a huge range of different phenomenon. For example, if you want to get a pulse of
what people are thinking about the AI race, go check out the tech section of Polymarket. With
6 million in volume, for example, which company has the best AI model at the end of August,
Google is at 94%. And what's interesting, of course, about prediction markets is that they often
tell a different story than the conventional wisdom. Now, the people who are very bullish on prediction
markets like the fact that this is, in fact, a market, that people are putting up real money
behind their predictions, basically thinking that that is a pure expression of actual belief than
just what you say on Twitter or Instagram or whatever. The Wall Street Journal has recently noted
that the prediction markets are very interested in the future of AI. In an article over the
weekend, they wrote, gamblers now bet on AI models like racehorses, prediction platforms,
are turning the AI arms race into a high-stakes game.
The article begins,
now that AI developers are getting paid like pro athletes,
it's fitting that fans are placing big bets on how well they're doing their jobs.
Now, it's important to note that while the trend may be rising
enough to capture the Wall Street Journal's attention,
the market is still fairly small.
Trading volume across AI prediction markets,
so far in August, is around $20 million.
Not nothing, but also not breaking the bank.
And yet still, the volume on AI-related trades is up 1,000% since the beginning of the year.
The article also explains why people actually like this as a source of signal
outside of just the fact that people are putting their money on it.
Basically, the fact that people are putting their money on these markets
tends to lead them to go down rabbit holes looking for interesting information
that might influence their decision.
In other words, the prediction market bulls basically argued that it's not just collective
wisdom.
It's the collective wisdom of a group who has a financial incentive to really, really
research and try to understand what's going on.
Okay, so all of this is the end.
interesting background to today's story. We've got on the one hand, saturation of benchmarks,
and the emergence of some new ones, and on the other hand, the rise of prediction markets.
Well then, friends, why not combine the two? Just a few days ago, a new project out of the University
of Chicago called Profit Arena launched. They tweeted, introducing Profit Arena, the AI benchmark
for general predictive intelligence. That is, can AI truly predict the future by connecting
today's dots. In their introductory blog post, they write, forecasting is one of humanity's most
original and most powerful intellectual pursuits, the spark that gave rise to science and the engine
behind modern economics and finance. While today's AI models can ace bar exams and outperform
humans in math competitions, a deeper question remains poorly understood. Can AI systems reliably
predict the future by connecting the dots across existing real-world information? And that is the
goal of profit arena. It's a new benchmark that evaluates this sort of
predictive intelligence through, as they call it, live updated real-world forecasting tasks.
Now, the team behind Profit Arena point out that forecasting has for some time been one of the
main points of machine learning. Obviously, these systems are ubiquitous across the enterprise
in these very small and discrete sort of ways. So they write, what's new and what makes the
challenge this time different. The new frontier, they say, is building general-purpose AI systems that
make accurate forecasts across a wide range of domains, potentially without domain-specific fine-tuning.
or access to specialized data sets.
They argue that this kind of open domain forecasting requires capabilities that are on the
very edge of what today's AI systems can do, including probabilistic reasoning, i.e.
quantifying uncertainty, maintaining calibration and performing statistical thinking, causality,
causal reasoning and modeling of how events unfold and influence one another, and critical
thinking, i.e. curating relevant information and assessing the credibility of sources.
So how does the platform actually work in practice? Well, Profit Arena is
taking advantage of these new prediction platforms like CalShe. First, they curate events from
those platforms, selecting for events that are popular, i.e., there's a lot of people participating,
so they're going to give better signal around how AI does against humans, because there's enough
humans doing it to actually get that information. They're also looking for a certain diversity
of different events that are balanced across domains, including politics, economics, sports,
science, entertainment, and they're looking for events that are recurring. Things like weekly
price movement or earnings announcement, too, as they put it, support consistency.
and comparability. Then for each event, the AI models are allowed to go gather relevant information,
and with the same context, each AI model then submits a structured forecast, which they describe
as a probability distribution over all possible outcomes accompanied by a detailed rationale.
Quote, these rationales are made visible to users who can assess their value, share feedback
on the usefulness of news sources, and contribute alternative information to observe how forecast
shift in response. Those two steps are then repeated for each event over time until the outcome actually
happens. Now, when it comes to performance, they have two sets of evaluation metrics, absolute
metrics and relative metrics. The absolute metrics utilizes the Breyer score, which they describe as a
widely adopted proper scoring rule that measures how accurately and confidently AI models predict
probabilistic outcomes. And then they also have this concept of average return. They write,
to bridge predictions with real-world action ability, we also introduce an innovative class of
average return metrics derived from utility theory. These metrics simulate a scenario where
practitioners consistently use AI-generated probabilities to inform their betting decisions in real
prediction markets. Users can flexibly adjust risk preferences to explore various betting strategies,
offering a practical insight into the economic value generated by LLM-driven forecasts.
So basically, the absolute metrics in the Breyer score are just about how accurate they were,
and the relative metrics or average return is all about how well people would be able to use
those metrics to make money in the markets. They also have a couple other supplementary metrics,
and go in deep in a secondary blog on their approach to scoring for those of you who are interested.
Interestingly, they point out that the Breyer score or the measure of statistical accuracy and
calibration does not always match real-world betting performance. For example, GROC's model
score much higher in the statistical accuracy than they do in the average return assessment.
So what are some of their early findings? Overall, right now, O3 Mini ranks highest in the average
return above second place GPT-5, while GPT-5 ranks highest on the Breyer score.
They've found that models show distinct personalities, with some being more aggressive and others
being more conservative, and one of the areas where there's a big gradation is that different
models show differences in how they handle uncertainty in their sources.
One example that they shared was around a prediction for Major League Soccer.
In that particular example, while the market was pricing in the chance of a Toronto FC win at 11
percent, O3 Mini saw a 30 percent win chance. Arena writes, the model bet on Toronto because of
positive expected value and earned a 9x return when Toronto won. This is AI finding a real edge over
human crowds. And by the way, I think this also demonstrates why the average return profile
could look different than just the raw accuracy. If a model is not just good in pure terms,
but good at figuring out when the existing conventional wisdom is off, there's more room to
arbitrage that into expected value when it comes to these prediction markets.
I think one thing that's super interesting about this to me is that given how saturated all of these
other models feel right near the top of all the benchmarks, they really have wildly divergent
approaches to their reasoning. For example, on the question of whether AI regulation would become
federal law this year, Quen 3 saw a 75% chance with their aggressive interpretation,
Lama 4 Maverick had only a 35% chance, a conservative approach that cited the complexity,
and GPT4-1 gave it a 60% chance
that was basically a balanced middle ground sort of call.
And again, all of this is using the same data
just really different approaches to reasoning.
The response of this has been really positive.
Tonk Yaljan summed up about 1,000 versions of this tweet
when he said,
the kind of benchmark that's actually useful, very cool.
And indeed, if there was just one takeaway
from the AI community so far,
is that this is exactly the sort of interesting
new type of benchmark we really need right now.
Neon Blue CEO Stephen L. Hodge writes,
The reason recent model releases are disappointing is because of benchmark hacking.
Company optimizes their model for the benchmark and says this is 50% smarter than the others.
Profit Arena is cool because I think we can all agree we have AGI when AI can predict the future.
Simon Smith writes,
Love this new AI benchmark based on predictions because it's one, practical, two, always up to date,
and three, impossible to game.
Also interesting to see that it's not always the smartest models that are most profitable.
Model personality matters too.
Some of the folks in the AI safety community use this as a moment to decry the strand of denialists
who think that AI isn't real. Quoting a theoretical skeptic, the AI safety memes account says,
AI just memorizes and regurgitates. Okay, buddy, then explain to me how it can predict the literal
future better than humans. Dan Hendricks, meanwhile, points out, quote, a new benchmark shows
that AI's out of the box can perform similarly to or better than prediction markets at forecasting
future world events. Interestingly, some forecasters will say AI's are not yet human level,
despite AIs doing better than the typical forecaster or prediction market for a long time.
Now, I thought Open AIs Nome Brown had a really interesting comment.
I recently chatted with the VC who believed AGI was coming and would disrupt a lot of jobs,
but not their job.
Of course AIs could write code and review contracts, but making accurate, calibrated
predictions about the future, that's uniquely human.
And this is why I always say on this show that I really believe that AI is in fact
coming for all of our jobs.
not that it will replace us, but I do think that all of our jobs will look different in the future.
Now, the one other strand of conversation, which does seem important and I think we'll get
larger in the future, is the relationship between predictions and self-fulfilling prophecy.
Alan Zhao writes, the decisions made by humans will increasingly get influenced by AI.
Instead of asking, did the model predict correctly, we may need to ask, did the model's prediction
cause the outcome. Termer Glick also writes,
Providence Arena is a strong benchmark, yet predictions influence outcomes.
When models publish probabilities, incentives shift, traders react, and dynamics change.
The feedback loop persists until the signal is arbitraged away and performance saturates.
In other words, the presence of the benchmark and the approach itself actually creates a new set of challenges.
Still, overall, I am very excited to see this sort of exploration and experimentation in the benchmark space.
I think it's extremely important for us to actually understand not only how models are progressing, but how they work.
So great job to the team at Profit Arena.
I will be very much looking forward to seeing how the project evolves
and how new models perform.
For now, though, that's going to do it for today's AI Daily Brief.
Until next time, peace.
