The AI Daily Brief: Artificial Intelligence News and Analysis - So Far, AI Can Only Automate 2.5% of Jobs
Episode Date: November 1, 2025That is, according to a new study designed to how well AI agents can perform real-world freelance projects. The results show just how far full automation remains. NLW breaks down the new “Remote Lab...or Index,” how it compares to OpenAI’s GDP-V metric, and what it reveals about the difference between automating tasks versus entire jobs. Plus: Amazon’s strong AI-driven earnings, Meta’s record-breaking bond sale for data centers, YouTube’s leadership shakeup, and Nvidia’s big bet on Poolside.Learn more: https://www.remotelabor.ai/Brought to you by:KPMG – Discover how AI is transforming possibility into reality. Tune into the new KPMG 'You Can with AI' podcast and unlock insights that will inform smarter decisions inside your enterprise. Listen now and start shaping your future with every episode. https://www.kpmg.us/AIpodcastsAssemblyAI - The best way to build Voice AI apps - https://www.assemblyai.com/briefBlitzy.com - Go to https://blitzy.com/ to build enterprise software in days, not months Robots & Pencils - Cloud-native AI solutions that power results https://robotsandpencils.com/The Agent Readiness Audit from Superintelligent - Go to https://besuper.ai/ to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Interested in sponsoring the show? sponsors@aidailybrief.ai
Transcript
Discussion (0)
Today on the AI Daily Brief, a new measure for how much work AI can actually automate,
before that in the headlines, Amazon's Cloud Group outperforms.
The AI Daily Brief is a daily podcast and video about the most important news and discussions
in AI.
All right, friends, quick announcements before we dive in.
First of all, thank you to today's sponsors, robots and pencils, Blitzy, KPMG, and
area.
To get an ad-free version of the show, go to patreon.com slash AI Daily Brief or sign up on Apple
podcasts.
To learn about sponsoring the show, shoot us a note at sponsors at AIdailydlybrief.
to learn about speaking in jobs, go to AIDailyBrief.aI
and to participate in our ROI benchmarking survey
and get access to a report around
how much value everyone is getting out of the now hundreds
and hopefully thousands of different use cases
that people have contributed, go to ROISurvey.aI.
And now with all those dot AIs out of the way,
let's get to the show.
Welcome back to the AI Daily Brief Headlines edition,
all the daily AI news you need in around five minutes.
We kick off today with an upside surprise from Amazon.
Now, I think at this point,
it's fair to say that 2025 has been a bit of a rough one for Amazon.
Alexa finally made it to market, but it was met with pretty underwhelming reviews.
Google appeared to swoop in on their big anthropic partnership in a 10-figure deal that highlighted Google's custom silicone,
while there was initial excitement around Amazon Nova that buzz has petered off pretty significantly,
and of course for the past week, they have become the poster boy for AI layoffs.
Turns out that is very much not the full story.
AWS posted quarterly revenues of $33 billion, up 20% from last year, and recording their
highest growth rate since 2022.
Analysts had only forecast 18% growth, and there's nothing analysts love more than underestimating
a fast-growing tech stock.
E-commerce sales were also a slight outperformance, coming in at $180 billion for 13% growth.
As we saw with Google's earnings on Wednesday, R-OI justifies CAPEX and Amazon is the current
leader in AI spending.
They bumped up the forecast for this year's CAP-X to $125 billion.
exceeding the analyst forecast of $119 billion.
That's up 55% year-over-year with the expectation it will continue increasing next year as well.
CEO Andy Jassy said,
We continue to see strong demand and AI and core infrastructure,
and we've been focused on accelerating capacity,
adding more than 3.8 gigawatts in the past 12 months.
He also had some fighting words for the other hypers commenting,
percentage growth is a relative term.
It's very different having 20% year-over-year growth on a $132 billion annualized run rate
than to have a higher percentage growth rate on a meaningfully smaller annual revenue,
which is the case with our competitors.
The markets like the spunk and they like the news,
and the stock was up 13% in after-hours trading,
making Amazon surprisingly the best performer during this big week of tech earnings.
Now, one important note that came from the earnings call
was Jassy disavowing the notion that Amazon's 14,000 white-collar layoffs
were anything to do with AI.
He told investors the layoffs were, quote,
not really financially driven, and it's not even really AI-driven,
not right now. It's culture. He explained that Amazon had added a ton of headcount, a lot of locations,
and a lot of business lines over recent years. So he said, you end up with a lot more people than you
had before, and you end up with a lot more layers. Sometimes without realizing it, you can weaken
the ownership of the people that you have who are doing the actual work. He continued,
it can lead to slowing you down as a leadership team. We are committed to operating like the
world's largest startup, and that means removing layers. Now, AI certainly has a part of this story,
but it's a much more nuanced part than it's been reported.
I think it's clear that they're positioning for a future
where you don't need as many people to do the same amount of work.
At the same time, it's very clearly not AI replacing people right now.
In fact, if you just have to pin it down to one thing,
Jassy is basically saying this is a reversal of overhiring after the pandemic.
Pretty much peer hit full stop.
Now, one of the substories of the way that analysts talk about this
that's so funny to me is that they're just having such a hard time
understanding that there might be a change in how companies want to spend their money.
Jack Farley, who hosts the great Monetary Matters podcast if you're interested in macro,
writes very strong quarter from Amazon, no doubt.
But at the same time, Amazon's free cash flow is collapsing.
AI CapEx is consuming so much capital.
But also, they're spending free cash flows rather than raising debt.
It is such a foreign concept right now that there might be a better way to invest profits
to get a better return than buybacks.
I don't know, man, for my money, companies investing in a future that they're very convinced in
seems a lot better than just artificially juicing their stock price.
But what do I know?
Now, speaking of funding structures, while Amazon might be using free cash flow,
data center funding markets are on fire as Meta closes a record corporate bond deal.
Meta closed a $30 billion bond sale this week to fund their data center buildout.
The deal was the highest large-grade corporate debt issuance this year.
Sources said the deal attracted orders for $125 billion, showing $1,000,000,000,
massively outsized demand from investors to fund AI CAPEX.
According to Bloomberg data, this is the largest demand ever for corporate bond sale.
Now, these bonds aren't paying wildly excessive returns.
They range between 5 and 40 years in maturity and are expected to pay 1.4 percentage points
over treasury yields for the longer dated instruments.
Interestingly, the sale was reported shortly after Meta's earnings report,
which saw the stock tumble on fears of ramping CAPEX with no clear connection to ROI.
Turns out that the bond market has a much longer term horizon.
And again, if you were trying to be a nuanced watcher of AI bubbles, keep an eye on whether
we start to see a shift in how much demand for corporate debt around this stuff there is.
If all of a sudden we see these sort of bond sales starting to scrape the bottom of the barrel,
that's when things might be looking a little tricky.
Next up, the latest in the layoff theme, YouTube has offered staff voluntary buyouts as they
restructure their organization around, you guessed it, AI.
Tech reporter Alex Heath broke the news on his sources blog, obtaining a copy of an internal
memo from YouTube CEO Neil Mohan. Mohan wrote,
Looking to the future, the next frontier for YouTube is AI, which has the potential to transform
every part of the platform. We need to set ourselves up to make the most of this opportunity.
He referred to this as the first update to YouTube's core leadership structure in a decade.
The structure will shift three product groups to report directly to Mohan, along with additional
reshuffling of divisions. YouTube said that no layoffs would take place as part of the shakeup.
Instead, U.S.-based staff who are, quote, ready for a new challenge, will be offered severance as part
of a voluntary exit program.
While it would be very easy to be cynical about the corporate speak here, I think it's actually
part and parcel of the same story we saw with Jassy's comments about Amazon layoffs, that when
we say AI layoffs, we might be talking about a couple very different things.
On the one hand, R just straight up, we don't need you anymore because AI can do your job
sort of layoffs.
On the other hand are, we anticipate that 10 years down the line, because of a new digital
hybridized workforce, that includes both agents and humans,
we're going to need less people than we do now for the current set of activities.
That doesn't mean that we won't need more people for new things that haven't come around yet,
but for the stuff that we offer now,
we need to start moving towards a different organizational structure.
I think we're going to see a lot of these types of preemptive and proactive restructurings
over the course of the coming year.
A little bit of investment news.
Invidia plans to invest a billion dollars into AI coding company Poolside.
Poolside is primarily focused on building foundation models that are used on programming,
rather than IDs and applications to support the use case. They released their first product last year
after being founded in 2023 by former GitHub CTO Jason Warner. Bloomberg reports that they're
currently seeking $2 billion in fundraising that would quadruple the company's valuation to $12 billion.
Nvidia already invested $500 million into Pulside during their last fundraising round in October of 24.
Poolside plans to use a portion of their fresh capital to fund the purchase of Nvidia GB300
chips according to sources, and for Pooleside, this fundraising lines up with plans to build out their own
compute ambitions. Earlier this month, Bloomberg reported on a deal with CoreWeave to build a two-gigawatt
capacity data center in West Texas to support their model training and inference needs.
On the feature side of the house, Canva has added new AI tools in a push to keep up with the
rapid evolution of design. The new tools leveraged Gen AI to create and edit posters,
short videos, and presentations using natural language prompts. The release comes just a few months
after a major product overhaul in April, and notably within days of Adobe unveiling their
big AI update. The product update seeks to reframe Canva as a new product.
an AI-powered creative operating system, which integrates AI into every layer of content creation.
Essentially, they're looking to move away from template-based creation into a more freeform
workflow driven by Gen AI. One example of how this might work is their approach to semi-automated
brand advertising. Co-founder Cameron Adams said, Canva automatically scans your website,
figures out who your audience is, what assets you use to promote your products, the message
it needs to send out, the formats you want to send it out in, makes it creative for you,
and you can deploy it directly to the platform without having to leave Canva.
This, I think, is part and parcel of the productization era of AI that I think we were quickly
moving into. Canva's updates are following the theme closely, with a real emphasis on integrating
existing AI capabilities into full product suites that allow people to get bigger chunks of work done.
That's a theme that we are going to very much keep an eye on, but for now, that's going to do it for
today's headlines. Next up, the main episode.
Small, nimble teams beat bloated consulting every time. Robots and pencils partners,
with organizations on intelligent, cloud-native systems powered by AI. They cover human needs,
design AI solutions, and cut-through complexity to deliver meaningful impact without the layers of bureaucracy.
As an AWS-certified partner, robots and pencils combines the reach of a large firm with the focus
of a trusted partner, with teams across the U.S., Canada, Europe, and Latin America, clients gain
local expertise and global scale. As AI evolves, they ensure you keep peace with change. And that
means faster results, measurable outcomes, and a partnership built to last. The right partner makes
progress inevitable. Partner with robots and pencils at robots and pencils.com slash AI Daily Brief.
This episode is brought to you by Blitzy, the Enterprise Autonomous Software Development Platform with
Infinite Code Context. Blitzy uses thousands of specialized AI agents that think for hours to
understand Enterprise-scale code bases with millions of lines of code. Enterprise engineering leaders
start every development sprint with the Blitzie platform, bringing in their development
The Blitzy platform provides a plan, then generates and pre-compiles code for each task.
Blitzy delivers 80% plus of the development work autonomously, while providing a guide for the final 20%
of human development work required to complete the sprint.
Public companies are achieving a 5x engineering velocity increase when incorporating Blitzy
as their pre-IDE development tool, pairing it with their coding pilot of choice to bring
an AI-native SDLC into their org.
Visit blitzy.com and press get a demo to learn how Blitzy transforms your SDLC from AI-assisted to
AI native. What if AI wasn't just a buzzword, but a business imperative? On You Can with
AI, we take you inside the boardrooms and strategy sessions of the world's most forward-thinking
enterprises. Hosted by me, Nathaniel Wittamore, and powered by KPMG, this seven-part series delivers
real-world insights from leaders who are scaling AI with purpose, from aligning culture and
leadership to building trust, data readiness, and deploying AI agents. Whether you're a C-suite
executive, strategist, or innovator, this podcast is your first.
front row seat to the future of Enterprise AI. So go check it out at www.kpmg.comg.com slash AI podcasts or
search you Penn with AI on Spotify, Apple Podcasts, or wherever you get your podcasts.
There's a reason most Enterprise AI initiatives never make it to production. You can't find a
platform that's both powerful and secure enough. The result, AI budgets burn with zero business
impact, but not anymore. Area is the Enterprise AI platform that delivers speed without
compromise. Unlike other platforms that force you to choose between fast deployment or secure operations,
Area brings speed and security together. Launch AI quickly without cutting corners on compliance.
Scale rapidly without sacrificing governance. Move at the speed of business without moving past
your security requirements. Fortune 500 companies across finance, healthcare, retail,
legal, and more choose area because they deliver what seemed impossible. Enterprise AI that's fast enough
to beat the competition and secure enough to protect your most sensitive data. Ready for AI at full
speed with zero compromise, visit area.com to see the platform in action. That's AIRIA.com. Simplify
Enterprise AI. Welcome back to the Daily Brief. Today we are talking about an interesting new way of measuring
agent performance. A big theme that we've talked about throughout this year is the need for better
benchmarks and evaluations. Right now, the benchmarks and evaluation suites that we get alongside
new models are highly academic. They're not really connected to the real world.
world. And that I think is part of why people are so hungry for alternatives, like the meter scale
that test the ability of AI to complete tasks at different durations. And yet, even metrics like this one
are still pretty theoretical and academic. For example, the default meter scale, test the duration
with which a model can complete a task at 50% success rate or 80% success rate, but in the real world,
80 and certainly not 50% success is probably not going to pass muster. Now, of course, this is all the
more important, as the narrative of AI layoffs starts to scream to the forefront. This fear is
based on AI being able to do the work of lots of people, and so whether it actually can is a pretty
important question. Now, recently there have been some new efforts explicitly designed to move
performance measurement to the real world. You might remember back at the end of September,
OpenAI introduced GDP Val. The idea was to create a metric that measured model performance on
economically valuable real-world tasks. The way that GDP Val was,
works is they took 44 occupations selected from the top nine industries that contribute to USGDP,
cut that up into 1,320 specialized tasks, or rather worked with experienced professionals in each
of those areas to cut it up into those specialized tasks, and then put the output of the AI
threw up to five rounds of expert review, including, as they put it, checks from other
task writers, additional occupational reviewers, and model-based validation. When this was launched,
I was incredibly excited about taking the move from e-vals into the real world, and so I sat up and took
noticed when Dan Hendrix, the director of the Center for AI Safety, tweeted out a new project
called the Remote Labor Index that was explicitly once again about testing AI in the real world.
So what is the Remote Labor Index? The creators say that the goal is to create a standardized
empirical measurement of AI's capability to automate practical remote computer-based work.
Now, importantly, the tasks that they tested on are not theoretical or academic. They were
real-world projects sourced directly from online freelance platforms primarily focused on
Upwork. Like OpenAI with GDP Val, the group behind Remote Labor Index or RLI started by engaging
directly with a set of experienced Upwork freelancers, 358 to be exact. The average freelancer in this
pool had over 2300 hours worked and over 23,000 in platform earnings. That may seem small,
but a lot of these folks are international and a lot of these jobs are 10 or 15 bucks at a time.
The researchers paid the professionals to provide samples of their past completed work,
with the goal being that the benchmark is grounded in actual economic transactions.
The thing that they were measuring against was projects that real clients defined, commissioned, and paid for.
Starting from a pool of 550 potential projects, they ran them through an extensive filtering and cleaning process,
removing anything that required physical labor, anything that couldn't be evaluated or anything that had client interaction like tutoring.
The final RLI dataset consisted of 240 unique,
high-quality projects, with each project containing a brief, which is the original text document
describing the work to be done, the input files, all the assets, data, or documents needed to
start, think a voiceover file, a PDF design, a spreadsheet, et cetera, and then the human
deliverable, the final gold standard product that the human professional delivered and got
paid for. That, of course, is what the AI's work would be judged against. To get a little bit more
of a sense of the projects that were involved, it took human professionals a mean of 28.9
hours to complete each project, with a median of 11.5 hours, meaning that there was a fairly
big variety in the difficulty. The average project costs $632, with a median of 200, and the
projects span 23 different categories with the top being video and animation at 13%, 3D modeling
and CAD at 12%, graphic design at 11%, game development at 10%, architecture at 7%, and audio at 10%.
And now for the results. In short, the current state-of-the-art AI agents perform
near the floor. The top performer was Manus, achieving an automation rate of just 2.5%.
What that means is that in a head-to-head comparison, a human evaluator decided the AI's deliverable
was at least as good as the human deliverable and would be accepted by a reasonable client,
only 2.5% of the time. GROC 4 and Son at 4.5 both had an automation rate of 2.1%.
GPT5 had 1.7 and ChatGPT agent had 1.3, and Gemini 2.5 Pro had just a 0.8%.
8% automation rate. So why did the projects tend to fail? When an AI's deliverable was rejected,
it was typically for one of the following reasons. Forty-five point six percent were rejected for
poor quality. In other words, the work was technically done, but it wasn't professionally acceptable.
They said that could be anything from childlike drawings to robotic and unnatural voiceovers.
For another 35.7% of rejections, it was about incompleteness, where the agent simply failed to
finish the job. 17.6% failed for technical and file.
issues, the AI producing corrupt or empty files, or delivering the work in an unusable format,
and 14.8% had inconsistencies, where, for example, the deliverable lacked internal logic.
An example of this would be in an architecture project, a house's appearance changing
completely across different 3D views. Now, if you're doing the math and quickly realizing
that this adds up to more than 100%, certain projects failed for more than one issue. So this is
pretty rough, right? Maybe lower than some people would have thought. But I think that there are a bunch
of caveats and the paper makes some of them as well. First of all, there were areas where agents did
much better, specifically audio and image-related work, along with writing and data retrieval or
web scraping. This is a surprising given how ubiquitous those types of use cases are with the
assistance that we have. The next important caveat is that they saw progress even though the overall
rate of automation was low. They used the second metric, an ELO or relative performance score,
to have an additional layer of analysis. They write, for each project, a deliverable from two different
AIs is presented to human evaluators who judge which deliverable is closer to completing the project
successfully. If both agents successfully complete the project, then their deliverables are compared
on overall quality. And looking at that, they found that across all projects, AI agents are
steadily improving, even if still far below the human baseline. Another thing that I think is
fairly positive is the degree of completeness. Frankly, I think I would have expected a higher
incomplete rate than 35.7. It feels like a fairly significant Rubicon to be a very significant rubicon to be
able to get the job completely done, even if it's not good enough. And the other really important
thing to note is that this test is not about judging task completion. It is instead about judging
an agent's ability to do an entire work stream that a client has defined as a project that
might have multiple steps, and which is actually worth paying for to them. The paper's authors
are very explicit about the choice that they're making here. In other words, it's different than
GDP Val. They want a metric of full automation.
It's really important for us to remember, as we're discussing the spectrum of autonomy and automation,
that it is not always a priority going to be the case that the goal that we have for AI is full automation.
There is a whole conversation happening about this right now in the agentic coding world,
where the folks who are building those systems are trying to figure out how much to prioritize totally autonomous background work versus speeding up code assistance.
I think it's a great attempt, though, frankly, to have something that is exclusively concerning itself,
with full automation. I think that the authors of the paper are right to identify that we need
metrics that actually give us a sense of how fast that level of automation is evolving outside
of what is going to be increasingly loud political noise. To some, this means that we should be
a little bit calmer when it comes to our doomsday prognostications. Rio Longacre writes,
reminder, AI in its current incarnation is great at automating specific tasks, not entire jobs.
Anyone who tells you otherwise is either diluted or lying. Obviously, this could
change over time, but the current hyperventilating about mass layoffs and a permanent underclass
are hyperbolic. And what's interesting is that now that this is out, people are using it to ask if there are
other important ways to measure automation as well. Amit on Twitter writes, 2.5% automation rate is actually
really high for general purpose AI. Consider this. These are random economically valuable projects spanning
multiple domains, no fine-tuning, no human in the loop, just raw agent capability. Burkhov implies that if you got
into very specific areas like software engineering, the automation rate would likely be much,
much higher. And so perhaps in addition to this base, we're going to need sector-specific measures and
tests. Look, overall, I am here for it. I think the more that we can measure and understand what's actually
going on and how we should be benchmarking the performance of our AIs, the better. Requisite shout out
to our AIRRUI benchmarking study that's live at ROISurvey.aI. It's exactly why we have this
project live. We want to help provide some of that information. And I really appreciate the group at the
Center for AI Safety and their partners for doing that with the Remote Labor Index.
For now, that's going to do it for today's AI Daily Brief.
Appreciate you listening or watching as always.
Until next time, peace.
