The AI Daily Brief: Artificial Intelligence News and Analysis - Does AI Secretly Slow Developers Down?
Episode Date: July 16, 2025Are AI coding tools actually making developers less productive? A new study from Meta’s AI research nonprofit, METR, claims that developers using AI were 19% slower, even though they believed they w...ere 20% faster. In this episode of the AI Daily Brief, we break down the study’s design, explore criticisms from Emmett Shear and other AI leaders, and explain why this experiment may not reflect real-world AI coding performance. From model limitations to the steep learning curve of tools like Cursor, we explore how AI-assisted development isn’t just about faster typing, but rather a new way of working that requires time, practice, and workflow adjustments. Get Ad Free AI Daily Brief: https://patreon.com/AIDailyBriefBrought to you by:KPMG – Go to https://kpmg.com/ai to learn more about how KPMG can help you drive value with our AI solutions.Blitzy.com - Go to https://blitzy.com/ to build enterprise software in days, not months AGNTCY - The AGNTCY is an open-source collective dedicated to building the Internet of Agents, enabling AI agents to communicate and collaborate seamlessly across frameworks. Join a community of engineers focused on high-quality multi-agent software and support the initiative at agntcy.org Vanta - Simplify compliance - https://vanta.com/nlwPlumb - The automation platform for AI experts and consultants https://useplumb.com/The Agent Readiness Audit from Superintelligent - Go to https://besuper.ai/ to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Subscribe to the newsletter: https://aidailybrief.beehiiv.com/Join our Discord: https://bit.ly/aibreakdownInterested in sponsoring the show? nlw@breakdown.network
Transcript
Discussion (0)
Today on the AI Daily Brief, are AI coding tools actually making developers slower?
Before that in the headlines, Apple is considering another big acquisition.
The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.
Hello, friends, quick announcements today.
First of all, thank you to today's sponsors, KPMG, Blitzy and Superintelligent.
To get an ad-free version of the show, go to patreon.com.
That's going to start at just $3 a month.
And if you are interested in sponsoring the show, shoot me a note at nLW at breakdown.
and I can send you all the relevant information.
We're starting to get tightened up for the fall,
so if you are planning some big announcement or campaign,
or just interested in getting to this audience of awesome AI builders, executives, etc.,
shoot me a note.
With that, let's get into the latest rumors out of Cupertino.
Welcome back to the AI Daily Brief Headlines edition,
all the daily AI news you need in around five minutes.
We kick off today with the latest in the Apple AI saga,
where Bloomberg's Mark German has mentioned almost haphazardly
in a larger piece about Apple's changing strategy,
that they are seriously considering an acquisition of mistral.
Now, as always, it's not clear that mistral is interested in said acquisition.
It's not clear that European regulators would allow said acquisition.
But it's interesting that these types of reports are now getting more commonplace,
especially given how resistant to acquiring its way out of problems Apple has been in the past.
Not much more to go on for this,
but I think taken alongside the perplexity interest from a couple of weeks ago,
it feels like Apple is preparing to make some big move,
and goodness gracious, it couldn't come so.
soon enough. Now, speaking of acquisitions, we have a follow-up story on the Winsurf deal.
The whirlwind continues for the team left at Winsurf, as that leftover crew and the company
itself have now been formally acquired by Cognition, who are the creators of Devin.
Yesterday, we covered Google's aqua-hire deal that saw Winsurf's leaders and around 30 developers
joined Google. That deal left behind a couple of hundred staff members and was controversial
for being part of the continued breakup of the social contract of starting
up exits. Google paid $2.4 billion for the licensing agreement, however, reports suggested that it
mostly went to pay out early investors and the 30 developers joining them rather than the rest of the staff.
The consensus opinion was that the remaining staff, who now owned Winsurf, would have to do something
like split the $100 million in the treasury and wind down the company. And yet, later on Monday,
news broke that rival AI startup cognition had acquired the remains of Winsurf for an undisclosed sum.
Jeff Wang, Winsurf's former head of business, saw himself thrust into the interim CEO.
role on Friday after the rest of the executives left. In a LinkedIn post, he wrote,
The last 72 hours have been the wildest roller coaster ride of my career. I'm beyond thrilled to share
that Winsurf is joining forces with Cognition, the legendary team behind Devin, to reinvent
the future of software development. Winsurf has built the leading in IDE agentic experience,
and Cognition has pioneered the leading autonomous software agent. Together, we're going to
redefine the future of software development. Cognition noted that Winsurf will now have full
access to Anthropics Cod models again, which goes a long way to making the product
viable. And more importantly for Winsurf staff, Cognition CEO Scott Wu reinforced that, quote,
one of my top priorities in structuring this deal was to honor their talent, hard work, and
accomplishments in making Winsurf the great business that it is today. To that end, Jeff and I
work together to ensure that every single employee is treated with respect and well taken care of in this
transaction. So is this the new mode of acquisitions? One company acquires the founders and leading
engineers, and then someone else acquires the rest of the company? I don't know, but it's certainly
a good ending for the story, at least for Winsurf,
Next up, an interesting one out of META, the new superintelligence lab is considering switching
to closed-source models.
The New York Times reports that top members of the new lab have discussed abandoning META's
large Lama 4 Bohemoth model in favor of developing a closed model from scratch.
Bohemoth was not included in the April release of Lama 4, with rumors that the training run had
had produced unimpressive performance.
A switch to closed models would be a huge philosophical shift for META, surrounding the release
of the first Lava model in February 2023, META said that they were making it open as part of their
commitment to open science. They wrote, even with all the recent advancements in large language models,
full research access to them remains limited because of the resources that are required to train
and run such large models. This restricted access has limited researchers' ability to understand
how and why these LLMs work, hindering progress on efforts to improve their robustness and mitigate
known issues. Now, on the one hand, many saw this as a ruthless commercial decision rather than an
altruistic move. ChatGBTGBT had been released just a few months prior, and the commercial
logic was presumed to be that a free meta chatbot would quickly overtake the competition.
Now, keep in mind, this was a very different era in AI. The first public release of Anthropics
Claude was still a month away, and Gemini was way, way in the future. Meta's chief AI scientist
Jan Lacoon said the platform that will win will be the open one, and to be fair to Zuckerberg,
he has really hated being under Apple's thumb, so he does have a philosophical disposition towards
this, even if there were probably commercial reasons as well. Still, what it comes to if they're
actually shifting the strategy, meta has so far denied it. They commented, we plan to continue
releasing leading open-source models. We haven't released everything we've developed historically,
and we expect to continue training a mix of open-and-closed models going forward. Honestly, I think we're
just going to have to wait and see what will happen now that we've got this new superintelligence lab
in play, and we don't even know how much their mandate is about commercial products in the short-term
versus some bigger, longer-term goal. One more on meta, along
their massive spend on AI talent, they're also planning to build out a whole lot of compute.
In a post on thread, Zuckerberg wrote,
For our superintelligence effort, I'm focused on building the most elite and talent-dense team in the industry.
We're also going to invest hundreds of billions of dollars into compute to build superintelligence.
We have the capital from our business to do this.
Semi-analysis just reported that meta is on track to be the first lab to bring a one-gigawak supercluster
online.
We're actually building several multi-gigawak clusters.
We're calling the first one Prometheus and it's coming online in 26.
We're also building Hyperion, which will be able to scale up to 5 gigawatts over several years.
We're building multiple more Titan clusters as well.
Just one of these covers a significant part of the footprint of Manhattan.
Meta Super Intelligence Labs will have industry-leading levels of compute and by far the greatest
compute per researcher.
XAI's colossus supercluster currently operates at around 250 megawatts, although they have plans
to increase its capacity five-fold to 1.2 gigawatts.
The first Project Stargate Data Center in Aberdeen, Texas, aims to have one gigawatt of
compute online by the beginning of next year, but that timeline is starting to look a little stretched.
Adding some details, a meta-spokesperson said the target is to have two gigawatts operational
at the Hyperion facility by 2030, and then expand to five gigawatts within a few years.
In an interview with the information, Zuckerberg discussed how megaclusters aren't just necessary
for meta-superintelligence plan. They're also a key recruiting tool. He said,
a lot has been written about money and a lot of the numbers have been inaccurate,
but I think it discounts the other key reason why people are super excited to come work at meta-super
intelligence labs. One of the biggest is just that you have more leverage as a researcher. You have
more compute. Historically, when I was recruiting people to different parts of the company,
people asked, what's my scope going to be? Here, people say, I want the fewest people reporting
to me in the most GPUs. Having basically the most compute per researcher is a strategic
advantage, not just for doing the work, but for attracting the best people. In other words,
it all comes back to talent. That, however, is going to do it for today's AID Daily Brief Headlines
edition. Next up, the main episode. Today's episode is brought to you,
by KPMG. In today's
fiercely competitive market, unlocking AI's
potential could help give you a competitive edge,
foster growth, and drive new value.
But here's the key. You don't
need an AI strategy. You need to
embed AI into your overall business
strategy to truly power it up.
KPMG can show you how to integrate
AI and AI agents into your
business strategy in a way that truly works
and is built on trusted AI principles
and platforms. Check out real
stories from KPMG to hear how
AI is driving success with its clients,
at www.kmg.org.us slash AI. Again, that's www.kmg.comg.com slash AI.
This episode is brought to you by Blitzy. Now, I talk to a lot of technical and business leaders who are eager to
implement cutting-edge AI, but instead of building competitive modes, their best engineers are stuck
modernizing ancient codebases or updating frameworks just to keep the lights on. These projects,
like migrating Java 17 to Java 21, often means staffing a team for a year or more. And sure,
copilot's help, but we all know they hit context limits fast, especially on large legacy systems.
Blitzy flips the script. Instead of engineers doing 80% of the work, Blitzy's autonomous platform
handles the heavy lifting, processing millions of lines of code and making 80% of the required
changes automatically. One major financial firm used Blitzy to modernize a 20 million line
Java code base in just three and a half months, cutting 30,000 engineering hours and accelerating their
entire roadmap. Email Jack at Blitzie.com with Modernize in the subject line for prioritized
onboarding. Visit blitzie.com today before your competitors do.
Today's episode is brought to you by superintelligence specifically agent readiness audits.
Everyone is trying to figure out what agent use cases are going to be most impactful for
their business and the agent readiness audit is the fastest and best way to do that.
We use voice agents to interview your leadership and team and process all of that information
to provide an agent readiness score, a set of insights around that score, and a set of
highly actionable recommendations on both organizational gaps and high-value agent use cases that
you should pursue. Once you've figured out the right use cases, you can use our marketplace
to find the right vendors and partners. And what it all adds up to is a faster, better
agent strategy. Check it out at B-Super.a.i or email agents at B-Supertai to learn more.
Welcome back to the AI Daily Brief. Today we are talking about a study that is getting an
absolute ton of buzz. A group of developers were tested to see how much more productive AI coding tools
would make them. They assumed going into the study that they would be about 24, 25% more productive,
and even after the study concluded, thought that AI had made them 20% more productive,
but the study actually found that they were 19% less productive, 19% slower on these set of coding
tasks. Now, as you might imagine, this has been widely reported on outlets like CNBC, suggesting that it
makes for a crack in the AI productivity bullcase. The implications of something like this are big.
Billions and billions, if not trillions of dollars are being spent, assuming that AI is going to
make us more productive. Does this all throw it into question? Somehow, my guess is at this point,
if you are a regular listener to this show, you will virtually hear the cracking of my knuckles
and neck as I prepare to critique this particular study. Now, I do want to caveat things. In general,
I am always interested to see what this group comes out with.
There were the team who developed the methodology that suggested that agent capabilities
are doubling every seven months.
And so I don't think that this is from some shoddy organization or anything like that.
I just happen to disagree pretty fundamentally with at least one particular assumption
that I think is fairly important to the study and even more than that, the way that it's
being reported.
But let's get into what it actually said before I get into my critique.
Researchers from Meador, which by the way, I don't even know if they call it Meador.
That's what I call it, METR.
which is a non-profit AI research firm,
recently tested 16 developers with what they identified as moderate AI skills,
something that we will come back to,
across hundreds of tasks in which they had roughly five years of experience.
Each task was randomly assigned to either allow or disallow AI usage.
Before the test began, the programmer said that they believed that AI would reduce
completion time by 24%.
And after they finished, they believed that the AI had helped them get a 20% speed boost.
But the actual results, as I mentioned, found that
AI had actually slowed them down by 19%. The studies showed a wide range of results across different
complexities of tasks. For tasks that take up to one hour, developers were basically the same
speed whether they used AI or not. The same was true for extremely long tasks that took
seven or eight hours. The only range where there was a big difference was in moderately complex
tasks that take between one hour and six hours. The results were extremely consistent, with AI-assisted
programmers slowing down as the tasks stretched to the two-hour mark. AI and non-AI programming
again converged as the task got even longer, with very little gap once they reached eight hours.
The study also used screen recordings to break down how the programmers used their time across
AI and non-AI coding. When using AI, the researchers found the time spent actively coding,
reading, researching, testing, and debugging, and dealing with the Git and environment all went
down. Active coding and reading or researching saw the sharpest drops in time spent,
idle or overhead time was the only factor that went up when using AI, and the difference was
in time spent prompting, waiting for AI outputs and reviewing the generated code.
Now, the researchers bundled the potential causes of the slowdown into five major categories.
The first was simply that programmers were overly optimistic about AI usefulness.
Second, they noted that some developers were too familiar with the codebase they were working on so AI didn't have much to offer.
Third, others were working on larger complex repositories where AI ran into context window limits.
Fourth, and a big one was low AI reliability.
The programmers only accepted 44% of AI generations and spent 9% of their time cleaning up generated code.
and fifth, and finally, the developers reported more generalized context issues where the AI
didn't recognize the repository properly. Now, to their credit, the researchers here are not at all
suggesting that we should throw the baby out with the bathwater. There are caveats up and down
this thing, qualifications that try to not overstate the case. They said, for example,
the slowdown we observe does not imply that current AI tools do not often improve developers'
productivity. We find evidence that the high developer familiarity with repositories and the size and
maturity of the repositories, both contribute to the observed slowdown, and these factors do not
apply in many software development settings. So let's actually talk about what some of the challenges
here could have been. First of all, I think it's completely correct to acknowledge that this is a
different type of working. Coding with coding tools involves entirely new processes,
an entirely new emphasis and different types of work categories. One of the devs in the study,
Quentin Anthony, actually talked about this. Regarding the idea of distractions,
he writes, it's super easy to get distracted in the downtime while LLMs are generating.
The social media attention economy is brutal, and I think people spend 30 minutes scrolling
while quote-unquote waiting for their 32nd generation. All I can say on this one is that we should
know our own pitfalls and try to fill this LLM generation time productively. If the task requires
high focus, spend this time either working on a subtask or thinking about follow-up questions,
even if the model one shots your question, what else don't I understand? If the task requires
low focus, do another small task in the meantime. As always, small digital hygiene steps helps
with this. And holding aside any sort of focus on social media intrusions and distractions while
you waiting for a prompt to resolve, there also is inevitably just going to be a shift in the
type of work that you have to do. Maybe you are writing less actual code, but you might spend
more time debugging. That was part of what the researchers actually directly found. And so I think
the summary of these two parts, and something we'll come back to at the end of this, is that
coding with AI tools is not just the same as coding but faster. It is a new process that requires
new thinking. Next, let's talk about the models that were used. Now, bad models might be an
overstatement for the sake of space in an AI generated image here, but this study was conducted at the
beginning of this year. And while that seems like a short time ago, all of the models that people
used to code are much advanced from the ones that they were using in this study. Ruben Bloom, who
works on Less Wrong, also participated in the study and said, as a developer in the study, it's
striking to me how much more capable the models have gotten since February when I was participating.
I'm trying to recall if I was even using agents at the start.
Certainly the later models, Opus 4, Gemini 2.5 Pro, O3, could do just vastly more with less
guidance than 3601, etc.
For me, not going over my own data in the study, I could buy that maybe I was being slowed
down a few months ago, but it's much harder to believe now.
Now, Rubin or Ruby, as he goes by, also did validate the other piece that we were
just discussing about as well, saying, I feel like historically a lot of my AI speed-up
gains were eaten by the fact that while a prompt was running, I'd look at something else,
Facebook, X, etc., and continue to do so for much longer than it took the prompt to run.
I discovered two days ago that cursor has or now has a feature you can enable to ring a bell when
the prompt is done. I expect to reclaim a lot of AI gains this way.
Point being that while 3.5 and 3.7 sonnet aren't bad models, contrary to my image here,
they are certainly less performant than all the tools we have now, right? This is before Claude Code.
This is before 03. This is before 2.5.
A fourth category that is incredibly important and is the one acknowledged most by the authors is the code-based context.
Remember, the authors wrote, we find evidence that the high developer familiarity with repositories
and the size and maturity of the repositories both contributed to the observed slowdown
and these factors do not apply in many software development settings.
Fellow AI podcaster Nathan Labens put it more simply,
expert developers working in large codebases is known to be the setting where AI can help least.
Both of these factors matter.
The fact that they are working in large codebases
and that they are experts in those codebases
is, as Nathan points out,
something of a mismatched use case for some of these AI coding tools.
Which does not mean at all, by the way,
that it's not valuable to study them.
What it means, and this is something that's going to run
throughout this analysis,
is that it's very difficult to draw general conclusions
across the entire field of software developers
based on these 16 that were studied.
If you want to be generous to the researchers
that's not exactly what they're trying to do,
but when you put out a study like this,
you know it's going to get amplified.
Now, Nathan-in-Though points out that there is another piece here,
that the fact that it's known that this isn't the best use case for AI coding
meant that the participants didn't have as much AI coding experience coming in,
and as he points out, not wrongly, given the work they do.
And this gets us to the biggest debate,
which is about learning curves and how to designate this set of developers
when it comes to their AI experience.
This is where some of the loudest disagreement comes in and where I have some of my biggest issues.
Now, I am not alone in this. In fact, perhaps the loudest critique of this paper has come from Emmett Shear.
Emmett was a co-founder at Twitch and spent a very hectic weekend as the CEO of OpenAI when Sam Altman was deposed.
He tweeted,
Meador's analysis of this experiment is wildly misleading.
The results indicate that people who have approximately never used AI tools before
are less productive while learning to use the tools and says nothing about experience.
AI tool users.
Emmett continues,
I immediately found the claim suspect because it didn't jive with my own experience
working with people who were using coding assistants.
But sometimes there are surprising results, so I dug in.
The first question, who were these developers in the study getting such poor results?
He then quoted from the methodology.
We recruited 16 experienced open source developers to work on 246 real tasks in their own
repositories.
So, Emmett writes, they sound like reasonably experienced software devs.
Back to the study, developers.
have a range of experience using AI tools. 93% have prior experience with tools like ChatGBTGBT,
but only 44% have experience using cursor. Uh-oh, writes Emmett, so they haven't actually used
AI coding tools. They've like tried prompting an LLM to write code for them, but that's an
entirely different kind of experience as anyone who has used these tools can tell you.
They claim a range of experience using AI tools, yet only a single developer of their 16 had
more than a single week of experience using cursor. They make it look like a range by breaking less than a week
into under one hour, one to 10 hours, 10 to 30 hours, and 30 to 50 hours of experience.
Given the steep learning curve for effectively using these AI tools, well, this division betrays
what I hope is just grossly negligent ignorance about the reality rather than intentional deception.
Of course, the one developer who did have more than one week of experience was 20% faster
instead of 20% slower. The authors note this fact, but then say, we are underpower to draw
strong conclusions from this analysis and bury it in a figure's description in an appendix.
If the authors of the paper had made the claim, we tested experienced developers using AI tools for the first time
and found that at least during the first week they were slower rather than faster,
that would have been a moderately interesting finding and true.
Alas, that is not the claim they made.
Now, David Rian, one of the researchers, stood behind the methodology, responding,
devs had roughly the following prior LLM experience.
Seven out of the 16 had over hundreds of hours,
seven of the 16 had 10 to 100 hours,
and two of the 16 had one to 10 hours.
We think describing this as moderate AI experience is fair.
Now, in the thread, David said, my guesses will have to agree to disagree.
And respectfully, I firmly, firmly disagree here.
First of all, using ChatGAPT, even to code is not the same as using a dedicated
agenic IDE.
Second, this is not a significant period of time when it comes to tool use.
40 hours, one workweek, is not a moderate amount of time to use a new tool,
especially when we were just discussing the fact that it in
involves totally new patterns of working.
Emmett again wrote,
it's clear that the source of disagreement
is that I think using cursor effectively
is a distinct skill from talking to GDP while you program
and expect fairly low transfer,
and the authors think it's the similar skill
and expect much higher.
When Megan Kinneman from Metter pointed out
that devs whose primary IDE was cursor
before the experiment were also slowed down on average,
although by less than the average in the study,
developer Tyler John pointed out,
this is useful, but there's only three of them.
And it sounds like the most experienced one
was dramatically sped up.
I think a study with experience cursor users is warranted to test the hypothesis.
Now, it's not just me and Emmett and a handful of Twitter commenters who are having the same
response. AI Programmer Simon Willison shared his thoughts writing,
My personal theory is that getting a significant productivity boost from LLM assistants
and AI tools has a much deeper learning curve than most people expect. We see positive speed
up for the one developer who has more than 50 hours of cursor experience, so it's plausible
that there is a high skill ceiling for using cursor, such that developers with significant experience
see positive speed up. My intuition here is that the study mainly demonstrates that the learning curve
on AISD development is high enough that asking developers to bake it into their existing
workflows reduces their performance while they climb that learning curve. And part of why
Emmett is so frustrated here is something which, while outside of Meadors control, he believes
effectively that they should have anticipated, which is how the mainstream media is going to
amplify these results. Again, I mentioned the headline is, study finds AI tools made open source
software developers 19% slower.
All over Twitter slash X, there are graphics like this one from Tech Juice.
Shocking studies suggest AI coding tools are slowing veteran developers by 19%.
And then there's the mainstream media.
Tech giants like Microsoft and Google are outsourcing more and more coding to AI in a productivity
push.
But some new research shows the tools might not be as helpful as some expect.
These are stories that have the ability to impact markets in significant ways,
despite the fact that there are all these questions.
Now, it is an entirely different episode on what researchers find their responsibility to be when it comes to the potential for amplification by mainstream media.
Given how unbelievably politicized AI is and will continue to be, perhaps there is a higher burden there.
But like I said, that's sort of the subject for a different show.
The TLDR for me is not that I think that the study isn't useful.
It's that I don't think ultimately that it's saying what the researchers think it's saying.
I think it's much closer to what Emmett Shear argued that a specific type of developer
working on a specific type of codebase with a specific limited experience set with this
particular set of tools encountered all sorts of issues that made them temporarily slower
than have they not been using the tools.
So where does that leave us?
Well, from a research perspective, there are obvious needs for follow-ups here.
I think having developers working on different types of codebases with different levels of experience,
and specifically those who have actually worked more deeply with Cursor or any other
Agenic IDE would be a really valuable follow-up.
At the same time, as Simon Willison points out, measuring developer productivity is notoriously
difficult, so even with that, we're still going to have to take everything with a grain of salt.
And by the way, to their credit, it appears that Mehta is actually thinking about expanding
this study, and I hope that they do.
We'll certainly report the updated results on the show when they come out.
But, holding aside the specifics, and trying to give credit where credit is due for what
the study uncovered. I do believe that it does show that we need to think about this as a different
type of work. As we shift the balance of quote-unquote coding work away from actually typing and
writing out code, new types of work are going to emerge, things like debugging and checking
results, and new types of challenges, such as social media time management, are going to become
even more significant. So if you are a company trying to understand these results, the worst possible
takeaway is to say, ah, see, it was all just overhyped. I guess we're just going to ignore those
tools. The best takeaway is to understand that these productivity gains are not free. They come with
a learning curve. They come with real work to reorganize the work. The faster you start,
and the more quickly you get to those serious hours of reps that seem to make a big difference,
the more likely to actually get this value you are. I don't know why people think it would be
any different. If you've ever tried to use any type of complex software in the past,
whether it's Salesforce or Adobe Photoshop or anything.
You don't get mastery quickly.
You don't even get competence quickly.
Powerful tools, even agentic tools, require practice.
And if anything, this study shows that we can't shortcut that step.
But hey, man, look, if the goal is to generate a conversation, well done.
Because this has been a huge point of discussion for the entire AI engineering community and beyond.
And that in general is almost always a good thing.
For now, that's going to do it for today's AI Daily Brief.
Thanks, as always for listening or watching.
And until next time, peace.
