The AI Daily Brief: Artificial Intelligence News and Analysis - So Far, AI Can Only Automate 2.5% of Jobs

Starting point is 00:00:00 Today on the AI Daily Brief, a new measure for how much work AI can actually automate, before that in the headlines, Amazon's Cloud Group outperforms. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. All right, friends, quick announcements before we dive in. First of all, thank you to today's sponsors, robots and pencils, Blitzy, KPMG, and area. To get an ad-free version of the show, go to patreon.com slash AI Daily Brief or sign up on Apple

Starting point is 00:00:29 podcasts. To learn about sponsoring the show, shoot us a note at sponsors at AIdailydlybrief. to learn about speaking in jobs, go to AIDailyBrief.aI and to participate in our ROI benchmarking survey and get access to a report around how much value everyone is getting out of the now hundreds and hopefully thousands of different use cases that people have contributed, go to ROISurvey.aI.

Starting point is 00:00:50 And now with all those dot AIs out of the way, let's get to the show. Welcome back to the AI Daily Brief Headlines edition, all the daily AI news you need in around five minutes. We kick off today with an upside surprise from Amazon. Now, I think at this point, it's fair to say that 2025 has been a bit of a rough one for Amazon. Alexa finally made it to market, but it was met with pretty underwhelming reviews.

Starting point is 00:01:13 Google appeared to swoop in on their big anthropic partnership in a 10-figure deal that highlighted Google's custom silicone, while there was initial excitement around Amazon Nova that buzz has petered off pretty significantly, and of course for the past week, they have become the poster boy for AI layoffs. Turns out that is very much not the full story. AWS posted quarterly revenues of $33 billion, up 20% from last year, and recording their highest growth rate since 2022. Analysts had only forecast 18% growth, and there's nothing analysts love more than underestimating a fast-growing tech stock.

Starting point is 00:01:45 E-commerce sales were also a slight outperformance, coming in at $180 billion for 13% growth. As we saw with Google's earnings on Wednesday, R-OI justifies CAPEX and Amazon is the current leader in AI spending. They bumped up the forecast for this year's CAP-X to $125 billion. exceeding the analyst forecast of $119 billion. That's up 55% year-over-year with the expectation it will continue increasing next year as well. CEO Andy Jassy said, We continue to see strong demand and AI and core infrastructure,

Starting point is 00:02:14 and we've been focused on accelerating capacity, adding more than 3.8 gigawatts in the past 12 months. He also had some fighting words for the other hypers commenting, percentage growth is a relative term. It's very different having 20% year-over-year growth on a $132 billion annualized run rate than to have a higher percentage growth rate on a meaningfully smaller annual revenue, which is the case with our competitors. The markets like the spunk and they like the news,

Starting point is 00:02:37 and the stock was up 13% in after-hours trading, making Amazon surprisingly the best performer during this big week of tech earnings. Now, one important note that came from the earnings call was Jassy disavowing the notion that Amazon's 14,000 white-collar layoffs were anything to do with AI. He told investors the layoffs were, quote, not really financially driven, and it's not even really AI-driven, not right now. It's culture. He explained that Amazon had added a ton of headcount, a lot of locations,

Starting point is 00:03:04 and a lot of business lines over recent years. So he said, you end up with a lot more people than you had before, and you end up with a lot more layers. Sometimes without realizing it, you can weaken the ownership of the people that you have who are doing the actual work. He continued, it can lead to slowing you down as a leadership team. We are committed to operating like the world's largest startup, and that means removing layers. Now, AI certainly has a part of this story, but it's a much more nuanced part than it's been reported. I think it's clear that they're positioning for a future where you don't need as many people to do the same amount of work.

Starting point is 00:03:35 At the same time, it's very clearly not AI replacing people right now. In fact, if you just have to pin it down to one thing, Jassy is basically saying this is a reversal of overhiring after the pandemic. Pretty much peer hit full stop. Now, one of the substories of the way that analysts talk about this that's so funny to me is that they're just having such a hard time understanding that there might be a change in how companies want to spend their money. Jack Farley, who hosts the great Monetary Matters podcast if you're interested in macro,

Starting point is 00:04:05 writes very strong quarter from Amazon, no doubt. But at the same time, Amazon's free cash flow is collapsing. AI CapEx is consuming so much capital. But also, they're spending free cash flows rather than raising debt. It is such a foreign concept right now that there might be a better way to invest profits to get a better return than buybacks. I don't know, man, for my money, companies investing in a future that they're very convinced in seems a lot better than just artificially juicing their stock price.

Starting point is 00:04:34 But what do I know? Now, speaking of funding structures, while Amazon might be using free cash flow, data center funding markets are on fire as Meta closes a record corporate bond deal. Meta closed a $30 billion bond sale this week to fund their data center buildout. The deal was the highest large-grade corporate debt issuance this year. Sources said the deal attracted orders for $125 billion, showing $1,000,000,000, massively outsized demand from investors to fund AI CAPEX. According to Bloomberg data, this is the largest demand ever for corporate bond sale.

Starting point is 00:05:03 Now, these bonds aren't paying wildly excessive returns. They range between 5 and 40 years in maturity and are expected to pay 1.4 percentage points over treasury yields for the longer dated instruments. Interestingly, the sale was reported shortly after Meta's earnings report, which saw the stock tumble on fears of ramping CAPEX with no clear connection to ROI. Turns out that the bond market has a much longer term horizon. And again, if you were trying to be a nuanced watcher of AI bubbles, keep an eye on whether we start to see a shift in how much demand for corporate debt around this stuff there is.

Starting point is 00:05:32 If all of a sudden we see these sort of bond sales starting to scrape the bottom of the barrel, that's when things might be looking a little tricky. Next up, the latest in the layoff theme, YouTube has offered staff voluntary buyouts as they restructure their organization around, you guessed it, AI. Tech reporter Alex Heath broke the news on his sources blog, obtaining a copy of an internal memo from YouTube CEO Neil Mohan. Mohan wrote, Looking to the future, the next frontier for YouTube is AI, which has the potential to transform every part of the platform. We need to set ourselves up to make the most of this opportunity.

Starting point is 00:06:03 He referred to this as the first update to YouTube's core leadership structure in a decade. The structure will shift three product groups to report directly to Mohan, along with additional reshuffling of divisions. YouTube said that no layoffs would take place as part of the shakeup. Instead, U.S.-based staff who are, quote, ready for a new challenge, will be offered severance as part of a voluntary exit program. While it would be very easy to be cynical about the corporate speak here, I think it's actually part and parcel of the same story we saw with Jassy's comments about Amazon layoffs, that when we say AI layoffs, we might be talking about a couple very different things.

Starting point is 00:06:37 On the one hand, R just straight up, we don't need you anymore because AI can do your job sort of layoffs. On the other hand are, we anticipate that 10 years down the line, because of a new digital hybridized workforce, that includes both agents and humans, we're going to need less people than we do now for the current set of activities. That doesn't mean that we won't need more people for new things that haven't come around yet, but for the stuff that we offer now, we need to start moving towards a different organizational structure.

Starting point is 00:07:04 I think we're going to see a lot of these types of preemptive and proactive restructurings over the course of the coming year. A little bit of investment news. Invidia plans to invest a billion dollars into AI coding company Poolside. Poolside is primarily focused on building foundation models that are used on programming, rather than IDs and applications to support the use case. They released their first product last year after being founded in 2023 by former GitHub CTO Jason Warner. Bloomberg reports that they're currently seeking $2 billion in fundraising that would quadruple the company's valuation to $12 billion.

Starting point is 00:07:36 Nvidia already invested $500 million into Pulside during their last fundraising round in October of 24. Poolside plans to use a portion of their fresh capital to fund the purchase of Nvidia GB300 chips according to sources, and for Pooleside, this fundraising lines up with plans to build out their own compute ambitions. Earlier this month, Bloomberg reported on a deal with CoreWeave to build a two-gigawatt capacity data center in West Texas to support their model training and inference needs. On the feature side of the house, Canva has added new AI tools in a push to keep up with the rapid evolution of design. The new tools leveraged Gen AI to create and edit posters, short videos, and presentations using natural language prompts. The release comes just a few months

Starting point is 00:08:12 after a major product overhaul in April, and notably within days of Adobe unveiling their big AI update. The product update seeks to reframe Canva as a new product. an AI-powered creative operating system, which integrates AI into every layer of content creation. Essentially, they're looking to move away from template-based creation into a more freeform workflow driven by Gen AI. One example of how this might work is their approach to semi-automated brand advertising. Co-founder Cameron Adams said, Canva automatically scans your website, figures out who your audience is, what assets you use to promote your products, the message it needs to send out, the formats you want to send it out in, makes it creative for you,

Starting point is 00:08:47 and you can deploy it directly to the platform without having to leave Canva. This, I think, is part and parcel of the productization era of AI that I think we were quickly moving into. Canva's updates are following the theme closely, with a real emphasis on integrating existing AI capabilities into full product suites that allow people to get bigger chunks of work done. That's a theme that we are going to very much keep an eye on, but for now, that's going to do it for today's headlines. Next up, the main episode. Small, nimble teams beat bloated consulting every time. Robots and pencils partners, with organizations on intelligent, cloud-native systems powered by AI. They cover human needs,

Starting point is 00:09:29 design AI solutions, and cut-through complexity to deliver meaningful impact without the layers of bureaucracy. As an AWS-certified partner, robots and pencils combines the reach of a large firm with the focus of a trusted partner, with teams across the U.S., Canada, Europe, and Latin America, clients gain local expertise and global scale. As AI evolves, they ensure you keep peace with change. And that means faster results, measurable outcomes, and a partnership built to last. The right partner makes progress inevitable. Partner with robots and pencils at robots and pencils.com slash AI Daily Brief. This episode is brought to you by Blitzy, the Enterprise Autonomous Software Development Platform with Infinite Code Context. Blitzy uses thousands of specialized AI agents that think for hours to

Starting point is 00:10:11 understand Enterprise-scale code bases with millions of lines of code. Enterprise engineering leaders start every development sprint with the Blitzie platform, bringing in their development The Blitzy platform provides a plan, then generates and pre-compiles code for each task. Blitzy delivers 80% plus of the development work autonomously, while providing a guide for the final 20% of human development work required to complete the sprint. Public companies are achieving a 5x engineering velocity increase when incorporating Blitzy as their pre-IDE development tool, pairing it with their coding pilot of choice to bring an AI-native SDLC into their org.

Starting point is 00:10:43 Visit blitzy.com and press get a demo to learn how Blitzy transforms your SDLC from AI-assisted to AI native. What if AI wasn't just a buzzword, but a business imperative? On You Can with AI, we take you inside the boardrooms and strategy sessions of the world's most forward-thinking enterprises. Hosted by me, Nathaniel Wittamore, and powered by KPMG, this seven-part series delivers real-world insights from leaders who are scaling AI with purpose, from aligning culture and leadership to building trust, data readiness, and deploying AI agents. Whether you're a C-suite executive, strategist, or innovator, this podcast is your first. front row seat to the future of Enterprise AI. So go check it out at www.kpmg.comg.com slash AI podcasts or

Starting point is 00:11:27 search you Penn with AI on Spotify, Apple Podcasts, or wherever you get your podcasts. There's a reason most Enterprise AI initiatives never make it to production. You can't find a platform that's both powerful and secure enough. The result, AI budgets burn with zero business impact, but not anymore. Area is the Enterprise AI platform that delivers speed without compromise. Unlike other platforms that force you to choose between fast deployment or secure operations, Area brings speed and security together. Launch AI quickly without cutting corners on compliance. Scale rapidly without sacrificing governance. Move at the speed of business without moving past your security requirements. Fortune 500 companies across finance, healthcare, retail,

Starting point is 00:12:06 legal, and more choose area because they deliver what seemed impossible. Enterprise AI that's fast enough to beat the competition and secure enough to protect your most sensitive data. Ready for AI at full speed with zero compromise, visit area.com to see the platform in action. That's AIRIA.com. Simplify Enterprise AI. Welcome back to the Daily Brief. Today we are talking about an interesting new way of measuring agent performance. A big theme that we've talked about throughout this year is the need for better benchmarks and evaluations. Right now, the benchmarks and evaluation suites that we get alongside new models are highly academic. They're not really connected to the real world. world. And that I think is part of why people are so hungry for alternatives, like the meter scale

Starting point is 00:12:57 that test the ability of AI to complete tasks at different durations. And yet, even metrics like this one are still pretty theoretical and academic. For example, the default meter scale, test the duration with which a model can complete a task at 50% success rate or 80% success rate, but in the real world, 80 and certainly not 50% success is probably not going to pass muster. Now, of course, this is all the more important, as the narrative of AI layoffs starts to scream to the forefront. This fear is based on AI being able to do the work of lots of people, and so whether it actually can is a pretty important question. Now, recently there have been some new efforts explicitly designed to move performance measurement to the real world. You might remember back at the end of September,

Starting point is 00:13:40 OpenAI introduced GDP Val. The idea was to create a metric that measured model performance on economically valuable real-world tasks. The way that GDP Val was, works is they took 44 occupations selected from the top nine industries that contribute to USGDP, cut that up into 1,320 specialized tasks, or rather worked with experienced professionals in each of those areas to cut it up into those specialized tasks, and then put the output of the AI threw up to five rounds of expert review, including, as they put it, checks from other task writers, additional occupational reviewers, and model-based validation. When this was launched, I was incredibly excited about taking the move from e-vals into the real world, and so I sat up and took

Starting point is 00:14:19 noticed when Dan Hendrix, the director of the Center for AI Safety, tweeted out a new project called the Remote Labor Index that was explicitly once again about testing AI in the real world. So what is the Remote Labor Index? The creators say that the goal is to create a standardized empirical measurement of AI's capability to automate practical remote computer-based work. Now, importantly, the tasks that they tested on are not theoretical or academic. They were real-world projects sourced directly from online freelance platforms primarily focused on Upwork. Like OpenAI with GDP Val, the group behind Remote Labor Index or RLI started by engaging directly with a set of experienced Upwork freelancers, 358 to be exact. The average freelancer in this

Starting point is 00:15:03 pool had over 2300 hours worked and over 23,000 in platform earnings. That may seem small, but a lot of these folks are international and a lot of these jobs are 10 or 15 bucks at a time. The researchers paid the professionals to provide samples of their past completed work, with the goal being that the benchmark is grounded in actual economic transactions. The thing that they were measuring against was projects that real clients defined, commissioned, and paid for. Starting from a pool of 550 potential projects, they ran them through an extensive filtering and cleaning process, removing anything that required physical labor, anything that couldn't be evaluated or anything that had client interaction like tutoring. The final RLI dataset consisted of 240 unique,

Starting point is 00:15:43 high-quality projects, with each project containing a brief, which is the original text document describing the work to be done, the input files, all the assets, data, or documents needed to start, think a voiceover file, a PDF design, a spreadsheet, et cetera, and then the human deliverable, the final gold standard product that the human professional delivered and got paid for. That, of course, is what the AI's work would be judged against. To get a little bit more of a sense of the projects that were involved, it took human professionals a mean of 28.9 hours to complete each project, with a median of 11.5 hours, meaning that there was a fairly big variety in the difficulty. The average project costs $632, with a median of 200, and the

Starting point is 00:16:19 projects span 23 different categories with the top being video and animation at 13%, 3D modeling and CAD at 12%, graphic design at 11%, game development at 10%, architecture at 7%, and audio at 10%. And now for the results. In short, the current state-of-the-art AI agents perform near the floor. The top performer was Manus, achieving an automation rate of just 2.5%. What that means is that in a head-to-head comparison, a human evaluator decided the AI's deliverable was at least as good as the human deliverable and would be accepted by a reasonable client, only 2.5% of the time. GROC 4 and Son at 4.5 both had an automation rate of 2.1%. GPT5 had 1.7 and ChatGPT agent had 1.3, and Gemini 2.5 Pro had just a 0.8%.

Starting point is 00:17:08 8% automation rate. So why did the projects tend to fail? When an AI's deliverable was rejected, it was typically for one of the following reasons. Forty-five point six percent were rejected for poor quality. In other words, the work was technically done, but it wasn't professionally acceptable. They said that could be anything from childlike drawings to robotic and unnatural voiceovers. For another 35.7% of rejections, it was about incompleteness, where the agent simply failed to finish the job. 17.6% failed for technical and file. issues, the AI producing corrupt or empty files, or delivering the work in an unusable format, and 14.8% had inconsistencies, where, for example, the deliverable lacked internal logic.

Starting point is 00:17:49 An example of this would be in an architecture project, a house's appearance changing completely across different 3D views. Now, if you're doing the math and quickly realizing that this adds up to more than 100%, certain projects failed for more than one issue. So this is pretty rough, right? Maybe lower than some people would have thought. But I think that there are a bunch of caveats and the paper makes some of them as well. First of all, there were areas where agents did much better, specifically audio and image-related work, along with writing and data retrieval or web scraping. This is a surprising given how ubiquitous those types of use cases are with the assistance that we have. The next important caveat is that they saw progress even though the overall

Starting point is 00:18:24 rate of automation was low. They used the second metric, an ELO or relative performance score, to have an additional layer of analysis. They write, for each project, a deliverable from two different AIs is presented to human evaluators who judge which deliverable is closer to completing the project successfully. If both agents successfully complete the project, then their deliverables are compared on overall quality. And looking at that, they found that across all projects, AI agents are steadily improving, even if still far below the human baseline. Another thing that I think is fairly positive is the degree of completeness. Frankly, I think I would have expected a higher incomplete rate than 35.7. It feels like a fairly significant Rubicon to be a very significant rubicon to be

Starting point is 00:19:04 able to get the job completely done, even if it's not good enough. And the other really important thing to note is that this test is not about judging task completion. It is instead about judging an agent's ability to do an entire work stream that a client has defined as a project that might have multiple steps, and which is actually worth paying for to them. The paper's authors are very explicit about the choice that they're making here. In other words, it's different than GDP Val. They want a metric of full automation. It's really important for us to remember, as we're discussing the spectrum of autonomy and automation, that it is not always a priority going to be the case that the goal that we have for AI is full automation.

Starting point is 00:19:46 There is a whole conversation happening about this right now in the agentic coding world, where the folks who are building those systems are trying to figure out how much to prioritize totally autonomous background work versus speeding up code assistance. I think it's a great attempt, though, frankly, to have something that is exclusively concerning itself, with full automation. I think that the authors of the paper are right to identify that we need metrics that actually give us a sense of how fast that level of automation is evolving outside of what is going to be increasingly loud political noise. To some, this means that we should be a little bit calmer when it comes to our doomsday prognostications. Rio Longacre writes, reminder, AI in its current incarnation is great at automating specific tasks, not entire jobs.

Starting point is 00:20:30 Anyone who tells you otherwise is either diluted or lying. Obviously, this could change over time, but the current hyperventilating about mass layoffs and a permanent underclass are hyperbolic. And what's interesting is that now that this is out, people are using it to ask if there are other important ways to measure automation as well. Amit on Twitter writes, 2.5% automation rate is actually really high for general purpose AI. Consider this. These are random economically valuable projects spanning multiple domains, no fine-tuning, no human in the loop, just raw agent capability. Burkhov implies that if you got into very specific areas like software engineering, the automation rate would likely be much, much higher. And so perhaps in addition to this base, we're going to need sector-specific measures and

Starting point is 00:21:11 tests. Look, overall, I am here for it. I think the more that we can measure and understand what's actually going on and how we should be benchmarking the performance of our AIs, the better. Requisite shout out to our AIRRUI benchmarking study that's live at ROISurvey.aI. It's exactly why we have this project live. We want to help provide some of that information. And I really appreciate the group at the Center for AI Safety and their partners for doing that with the Remote Labor Index. For now, that's going to do it for today's AI Daily Brief. Appreciate you listening or watching as always. Until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - So Far, AI Can Only Automate 2.5% of Jobs

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.