The AI Daily Brief: Artificial Intelligence News and Analysis - Does AI Secretly Slow Developers Down?

Starting point is 00:00:00 Today on the AI Daily Brief, are AI coding tools actually making developers slower? Before that in the headlines, Apple is considering another big acquisition. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. Hello, friends, quick announcements today. First of all, thank you to today's sponsors, KPMG, Blitzy and Superintelligent. To get an ad-free version of the show, go to patreon.com. That's going to start at just $3 a month. And if you are interested in sponsoring the show, shoot me a note at nLW at breakdown.

Starting point is 00:00:33 and I can send you all the relevant information. We're starting to get tightened up for the fall, so if you are planning some big announcement or campaign, or just interested in getting to this audience of awesome AI builders, executives, etc., shoot me a note. With that, let's get into the latest rumors out of Cupertino. Welcome back to the AI Daily Brief Headlines edition, all the daily AI news you need in around five minutes.

Starting point is 00:00:55 We kick off today with the latest in the Apple AI saga, where Bloomberg's Mark German has mentioned almost haphazardly in a larger piece about Apple's changing strategy, that they are seriously considering an acquisition of mistral. Now, as always, it's not clear that mistral is interested in said acquisition. It's not clear that European regulators would allow said acquisition. But it's interesting that these types of reports are now getting more commonplace, especially given how resistant to acquiring its way out of problems Apple has been in the past.

Starting point is 00:01:23 Not much more to go on for this, but I think taken alongside the perplexity interest from a couple of weeks ago, it feels like Apple is preparing to make some big move, and goodness gracious, it couldn't come so. soon enough. Now, speaking of acquisitions, we have a follow-up story on the Winsurf deal. The whirlwind continues for the team left at Winsurf, as that leftover crew and the company itself have now been formally acquired by Cognition, who are the creators of Devin. Yesterday, we covered Google's aqua-hire deal that saw Winsurf's leaders and around 30 developers

Starting point is 00:01:55 joined Google. That deal left behind a couple of hundred staff members and was controversial for being part of the continued breakup of the social contract of starting up exits. Google paid $2.4 billion for the licensing agreement, however, reports suggested that it mostly went to pay out early investors and the 30 developers joining them rather than the rest of the staff. The consensus opinion was that the remaining staff, who now owned Winsurf, would have to do something like split the $100 million in the treasury and wind down the company. And yet, later on Monday, news broke that rival AI startup cognition had acquired the remains of Winsurf for an undisclosed sum. Jeff Wang, Winsurf's former head of business, saw himself thrust into the interim CEO.

Starting point is 00:02:33 role on Friday after the rest of the executives left. In a LinkedIn post, he wrote, The last 72 hours have been the wildest roller coaster ride of my career. I'm beyond thrilled to share that Winsurf is joining forces with Cognition, the legendary team behind Devin, to reinvent the future of software development. Winsurf has built the leading in IDE agentic experience, and Cognition has pioneered the leading autonomous software agent. Together, we're going to redefine the future of software development. Cognition noted that Winsurf will now have full access to Anthropics Cod models again, which goes a long way to making the product viable. And more importantly for Winsurf staff, Cognition CEO Scott Wu reinforced that, quote,

Starting point is 00:03:08 one of my top priorities in structuring this deal was to honor their talent, hard work, and accomplishments in making Winsurf the great business that it is today. To that end, Jeff and I work together to ensure that every single employee is treated with respect and well taken care of in this transaction. So is this the new mode of acquisitions? One company acquires the founders and leading engineers, and then someone else acquires the rest of the company? I don't know, but it's certainly a good ending for the story, at least for Winsurf, Next up, an interesting one out of META, the new superintelligence lab is considering switching to closed-source models.

Starting point is 00:03:41 The New York Times reports that top members of the new lab have discussed abandoning META's large Lama 4 Bohemoth model in favor of developing a closed model from scratch. Bohemoth was not included in the April release of Lama 4, with rumors that the training run had had produced unimpressive performance. A switch to closed models would be a huge philosophical shift for META, surrounding the release of the first Lava model in February 2023, META said that they were making it open as part of their commitment to open science. They wrote, even with all the recent advancements in large language models, full research access to them remains limited because of the resources that are required to train

Starting point is 00:04:14 and run such large models. This restricted access has limited researchers' ability to understand how and why these LLMs work, hindering progress on efforts to improve their robustness and mitigate known issues. Now, on the one hand, many saw this as a ruthless commercial decision rather than an altruistic move. ChatGBTGBT had been released just a few months prior, and the commercial logic was presumed to be that a free meta chatbot would quickly overtake the competition. Now, keep in mind, this was a very different era in AI. The first public release of Anthropics Claude was still a month away, and Gemini was way, way in the future. Meta's chief AI scientist Jan Lacoon said the platform that will win will be the open one, and to be fair to Zuckerberg,

Starting point is 00:04:51 he has really hated being under Apple's thumb, so he does have a philosophical disposition towards this, even if there were probably commercial reasons as well. Still, what it comes to if they're actually shifting the strategy, meta has so far denied it. They commented, we plan to continue releasing leading open-source models. We haven't released everything we've developed historically, and we expect to continue training a mix of open-and-closed models going forward. Honestly, I think we're just going to have to wait and see what will happen now that we've got this new superintelligence lab in play, and we don't even know how much their mandate is about commercial products in the short-term versus some bigger, longer-term goal. One more on meta, along

Starting point is 00:05:25 their massive spend on AI talent, they're also planning to build out a whole lot of compute. In a post on thread, Zuckerberg wrote, For our superintelligence effort, I'm focused on building the most elite and talent-dense team in the industry. We're also going to invest hundreds of billions of dollars into compute to build superintelligence. We have the capital from our business to do this. Semi-analysis just reported that meta is on track to be the first lab to bring a one-gigawak supercluster online. We're actually building several multi-gigawak clusters.

Starting point is 00:05:52 We're calling the first one Prometheus and it's coming online in 26. We're also building Hyperion, which will be able to scale up to 5 gigawatts over several years. We're building multiple more Titan clusters as well. Just one of these covers a significant part of the footprint of Manhattan. Meta Super Intelligence Labs will have industry-leading levels of compute and by far the greatest compute per researcher. XAI's colossus supercluster currently operates at around 250 megawatts, although they have plans to increase its capacity five-fold to 1.2 gigawatts.

Starting point is 00:06:19 The first Project Stargate Data Center in Aberdeen, Texas, aims to have one gigawatt of compute online by the beginning of next year, but that timeline is starting to look a little stretched. Adding some details, a meta-spokesperson said the target is to have two gigawatts operational at the Hyperion facility by 2030, and then expand to five gigawatts within a few years. In an interview with the information, Zuckerberg discussed how megaclusters aren't just necessary for meta-superintelligence plan. They're also a key recruiting tool. He said, a lot has been written about money and a lot of the numbers have been inaccurate, but I think it discounts the other key reason why people are super excited to come work at meta-super

Starting point is 00:06:52 intelligence labs. One of the biggest is just that you have more leverage as a researcher. You have more compute. Historically, when I was recruiting people to different parts of the company, people asked, what's my scope going to be? Here, people say, I want the fewest people reporting to me in the most GPUs. Having basically the most compute per researcher is a strategic advantage, not just for doing the work, but for attracting the best people. In other words, it all comes back to talent. That, however, is going to do it for today's AID Daily Brief Headlines edition. Next up, the main episode. Today's episode is brought to you, by KPMG. In today's

Starting point is 00:07:24 fiercely competitive market, unlocking AI's potential could help give you a competitive edge, foster growth, and drive new value. But here's the key. You don't need an AI strategy. You need to embed AI into your overall business strategy to truly power it up. KPMG can show you how to integrate

Starting point is 00:07:41 AI and AI agents into your business strategy in a way that truly works and is built on trusted AI principles and platforms. Check out real stories from KPMG to hear how AI is driving success with its clients, at www.kmg.org.us slash AI. Again, that's www.kmg.comg.com slash AI. This episode is brought to you by Blitzy. Now, I talk to a lot of technical and business leaders who are eager to

Starting point is 00:08:07 implement cutting-edge AI, but instead of building competitive modes, their best engineers are stuck modernizing ancient codebases or updating frameworks just to keep the lights on. These projects, like migrating Java 17 to Java 21, often means staffing a team for a year or more. And sure, copilot's help, but we all know they hit context limits fast, especially on large legacy systems. Blitzy flips the script. Instead of engineers doing 80% of the work, Blitzy's autonomous platform handles the heavy lifting, processing millions of lines of code and making 80% of the required changes automatically. One major financial firm used Blitzy to modernize a 20 million line Java code base in just three and a half months, cutting 30,000 engineering hours and accelerating their

Starting point is 00:08:46 entire roadmap. Email Jack at Blitzie.com with Modernize in the subject line for prioritized onboarding. Visit blitzie.com today before your competitors do. Today's episode is brought to you by superintelligence specifically agent readiness audits. Everyone is trying to figure out what agent use cases are going to be most impactful for their business and the agent readiness audit is the fastest and best way to do that. We use voice agents to interview your leadership and team and process all of that information to provide an agent readiness score, a set of insights around that score, and a set of highly actionable recommendations on both organizational gaps and high-value agent use cases that

Starting point is 00:09:25 you should pursue. Once you've figured out the right use cases, you can use our marketplace to find the right vendors and partners. And what it all adds up to is a faster, better agent strategy. Check it out at B-Super.a.i or email agents at B-Supertai to learn more. Welcome back to the AI Daily Brief. Today we are talking about a study that is getting an absolute ton of buzz. A group of developers were tested to see how much more productive AI coding tools would make them. They assumed going into the study that they would be about 24, 25% more productive, and even after the study concluded, thought that AI had made them 20% more productive, but the study actually found that they were 19% less productive, 19% slower on these set of coding

Starting point is 00:10:09 tasks. Now, as you might imagine, this has been widely reported on outlets like CNBC, suggesting that it makes for a crack in the AI productivity bullcase. The implications of something like this are big. Billions and billions, if not trillions of dollars are being spent, assuming that AI is going to make us more productive. Does this all throw it into question? Somehow, my guess is at this point, if you are a regular listener to this show, you will virtually hear the cracking of my knuckles and neck as I prepare to critique this particular study. Now, I do want to caveat things. In general, I am always interested to see what this group comes out with. There were the team who developed the methodology that suggested that agent capabilities

Starting point is 00:10:48 are doubling every seven months. And so I don't think that this is from some shoddy organization or anything like that. I just happen to disagree pretty fundamentally with at least one particular assumption that I think is fairly important to the study and even more than that, the way that it's being reported. But let's get into what it actually said before I get into my critique. Researchers from Meador, which by the way, I don't even know if they call it Meador. That's what I call it, METR.

Starting point is 00:11:12 which is a non-profit AI research firm, recently tested 16 developers with what they identified as moderate AI skills, something that we will come back to, across hundreds of tasks in which they had roughly five years of experience. Each task was randomly assigned to either allow or disallow AI usage. Before the test began, the programmer said that they believed that AI would reduce completion time by 24%. And after they finished, they believed that the AI had helped them get a 20% speed boost.

Starting point is 00:11:39 But the actual results, as I mentioned, found that AI had actually slowed them down by 19%. The studies showed a wide range of results across different complexities of tasks. For tasks that take up to one hour, developers were basically the same speed whether they used AI or not. The same was true for extremely long tasks that took seven or eight hours. The only range where there was a big difference was in moderately complex tasks that take between one hour and six hours. The results were extremely consistent, with AI-assisted programmers slowing down as the tasks stretched to the two-hour mark. AI and non-AI programming again converged as the task got even longer, with very little gap once they reached eight hours.

Starting point is 00:12:15 The study also used screen recordings to break down how the programmers used their time across AI and non-AI coding. When using AI, the researchers found the time spent actively coding, reading, researching, testing, and debugging, and dealing with the Git and environment all went down. Active coding and reading or researching saw the sharpest drops in time spent, idle or overhead time was the only factor that went up when using AI, and the difference was in time spent prompting, waiting for AI outputs and reviewing the generated code. Now, the researchers bundled the potential causes of the slowdown into five major categories. The first was simply that programmers were overly optimistic about AI usefulness.

Starting point is 00:12:48 Second, they noted that some developers were too familiar with the codebase they were working on so AI didn't have much to offer. Third, others were working on larger complex repositories where AI ran into context window limits. Fourth, and a big one was low AI reliability. The programmers only accepted 44% of AI generations and spent 9% of their time cleaning up generated code. and fifth, and finally, the developers reported more generalized context issues where the AI didn't recognize the repository properly. Now, to their credit, the researchers here are not at all suggesting that we should throw the baby out with the bathwater. There are caveats up and down this thing, qualifications that try to not overstate the case. They said, for example,

Starting point is 00:13:27 the slowdown we observe does not imply that current AI tools do not often improve developers' productivity. We find evidence that the high developer familiarity with repositories and the size and maturity of the repositories, both contribute to the observed slowdown, and these factors do not apply in many software development settings. So let's actually talk about what some of the challenges here could have been. First of all, I think it's completely correct to acknowledge that this is a different type of working. Coding with coding tools involves entirely new processes, an entirely new emphasis and different types of work categories. One of the devs in the study, Quentin Anthony, actually talked about this. Regarding the idea of distractions,

Starting point is 00:14:06 he writes, it's super easy to get distracted in the downtime while LLMs are generating. The social media attention economy is brutal, and I think people spend 30 minutes scrolling while quote-unquote waiting for their 32nd generation. All I can say on this one is that we should know our own pitfalls and try to fill this LLM generation time productively. If the task requires high focus, spend this time either working on a subtask or thinking about follow-up questions, even if the model one shots your question, what else don't I understand? If the task requires low focus, do another small task in the meantime. As always, small digital hygiene steps helps with this. And holding aside any sort of focus on social media intrusions and distractions while

Starting point is 00:14:41 you waiting for a prompt to resolve, there also is inevitably just going to be a shift in the type of work that you have to do. Maybe you are writing less actual code, but you might spend more time debugging. That was part of what the researchers actually directly found. And so I think the summary of these two parts, and something we'll come back to at the end of this, is that coding with AI tools is not just the same as coding but faster. It is a new process that requires new thinking. Next, let's talk about the models that were used. Now, bad models might be an overstatement for the sake of space in an AI generated image here, but this study was conducted at the beginning of this year. And while that seems like a short time ago, all of the models that people

Starting point is 00:15:19 used to code are much advanced from the ones that they were using in this study. Ruben Bloom, who works on Less Wrong, also participated in the study and said, as a developer in the study, it's striking to me how much more capable the models have gotten since February when I was participating. I'm trying to recall if I was even using agents at the start. Certainly the later models, Opus 4, Gemini 2.5 Pro, O3, could do just vastly more with less guidance than 3601, etc. For me, not going over my own data in the study, I could buy that maybe I was being slowed down a few months ago, but it's much harder to believe now.

Starting point is 00:15:52 Now, Rubin or Ruby, as he goes by, also did validate the other piece that we were just discussing about as well, saying, I feel like historically a lot of my AI speed-up gains were eaten by the fact that while a prompt was running, I'd look at something else, Facebook, X, etc., and continue to do so for much longer than it took the prompt to run. I discovered two days ago that cursor has or now has a feature you can enable to ring a bell when the prompt is done. I expect to reclaim a lot of AI gains this way. Point being that while 3.5 and 3.7 sonnet aren't bad models, contrary to my image here, they are certainly less performant than all the tools we have now, right? This is before Claude Code.

Starting point is 00:16:26 This is before 03. This is before 2.5. A fourth category that is incredibly important and is the one acknowledged most by the authors is the code-based context. Remember, the authors wrote, we find evidence that the high developer familiarity with repositories and the size and maturity of the repositories both contributed to the observed slowdown and these factors do not apply in many software development settings. Fellow AI podcaster Nathan Labens put it more simply, expert developers working in large codebases is known to be the setting where AI can help least. Both of these factors matter.

Starting point is 00:16:59 The fact that they are working in large codebases and that they are experts in those codebases is, as Nathan points out, something of a mismatched use case for some of these AI coding tools. Which does not mean at all, by the way, that it's not valuable to study them. What it means, and this is something that's going to run throughout this analysis,

Starting point is 00:17:16 is that it's very difficult to draw general conclusions across the entire field of software developers based on these 16 that were studied. If you want to be generous to the researchers that's not exactly what they're trying to do, but when you put out a study like this, you know it's going to get amplified. Now, Nathan-in-Though points out that there is another piece here,

Starting point is 00:17:35 that the fact that it's known that this isn't the best use case for AI coding meant that the participants didn't have as much AI coding experience coming in, and as he points out, not wrongly, given the work they do. And this gets us to the biggest debate, which is about learning curves and how to designate this set of developers when it comes to their AI experience. This is where some of the loudest disagreement comes in and where I have some of my biggest issues. Now, I am not alone in this. In fact, perhaps the loudest critique of this paper has come from Emmett Shear.

Starting point is 00:18:07 Emmett was a co-founder at Twitch and spent a very hectic weekend as the CEO of OpenAI when Sam Altman was deposed. He tweeted, Meador's analysis of this experiment is wildly misleading. The results indicate that people who have approximately never used AI tools before are less productive while learning to use the tools and says nothing about experience. AI tool users. Emmett continues, I immediately found the claim suspect because it didn't jive with my own experience

Starting point is 00:18:33 working with people who were using coding assistants. But sometimes there are surprising results, so I dug in. The first question, who were these developers in the study getting such poor results? He then quoted from the methodology. We recruited 16 experienced open source developers to work on 246 real tasks in their own repositories. So, Emmett writes, they sound like reasonably experienced software devs. Back to the study, developers.

Starting point is 00:18:56 have a range of experience using AI tools. 93% have prior experience with tools like ChatGBTGBT, but only 44% have experience using cursor. Uh-oh, writes Emmett, so they haven't actually used AI coding tools. They've like tried prompting an LLM to write code for them, but that's an entirely different kind of experience as anyone who has used these tools can tell you. They claim a range of experience using AI tools, yet only a single developer of their 16 had more than a single week of experience using cursor. They make it look like a range by breaking less than a week into under one hour, one to 10 hours, 10 to 30 hours, and 30 to 50 hours of experience. Given the steep learning curve for effectively using these AI tools, well, this division betrays

Starting point is 00:19:35 what I hope is just grossly negligent ignorance about the reality rather than intentional deception. Of course, the one developer who did have more than one week of experience was 20% faster instead of 20% slower. The authors note this fact, but then say, we are underpower to draw strong conclusions from this analysis and bury it in a figure's description in an appendix. If the authors of the paper had made the claim, we tested experienced developers using AI tools for the first time and found that at least during the first week they were slower rather than faster, that would have been a moderately interesting finding and true. Alas, that is not the claim they made.

Starting point is 00:20:07 Now, David Rian, one of the researchers, stood behind the methodology, responding, devs had roughly the following prior LLM experience. Seven out of the 16 had over hundreds of hours, seven of the 16 had 10 to 100 hours, and two of the 16 had one to 10 hours. We think describing this as moderate AI experience is fair. Now, in the thread, David said, my guesses will have to agree to disagree. And respectfully, I firmly, firmly disagree here.

Starting point is 00:20:32 First of all, using ChatGAPT, even to code is not the same as using a dedicated agenic IDE. Second, this is not a significant period of time when it comes to tool use. 40 hours, one workweek, is not a moderate amount of time to use a new tool, especially when we were just discussing the fact that it in involves totally new patterns of working. Emmett again wrote, it's clear that the source of disagreement

Starting point is 00:20:56 is that I think using cursor effectively is a distinct skill from talking to GDP while you program and expect fairly low transfer, and the authors think it's the similar skill and expect much higher. When Megan Kinneman from Metter pointed out that devs whose primary IDE was cursor before the experiment were also slowed down on average,

Starting point is 00:21:12 although by less than the average in the study, developer Tyler John pointed out, this is useful, but there's only three of them. And it sounds like the most experienced one was dramatically sped up. I think a study with experience cursor users is warranted to test the hypothesis. Now, it's not just me and Emmett and a handful of Twitter commenters who are having the same response. AI Programmer Simon Willison shared his thoughts writing,

Starting point is 00:21:34 My personal theory is that getting a significant productivity boost from LLM assistants and AI tools has a much deeper learning curve than most people expect. We see positive speed up for the one developer who has more than 50 hours of cursor experience, so it's plausible that there is a high skill ceiling for using cursor, such that developers with significant experience see positive speed up. My intuition here is that the study mainly demonstrates that the learning curve on AISD development is high enough that asking developers to bake it into their existing workflows reduces their performance while they climb that learning curve. And part of why Emmett is so frustrated here is something which, while outside of Meadors control, he believes

Starting point is 00:22:11 effectively that they should have anticipated, which is how the mainstream media is going to amplify these results. Again, I mentioned the headline is, study finds AI tools made open source software developers 19% slower. All over Twitter slash X, there are graphics like this one from Tech Juice. Shocking studies suggest AI coding tools are slowing veteran developers by 19%. And then there's the mainstream media. Tech giants like Microsoft and Google are outsourcing more and more coding to AI in a productivity push.

Starting point is 00:22:39 But some new research shows the tools might not be as helpful as some expect. These are stories that have the ability to impact markets in significant ways, despite the fact that there are all these questions. Now, it is an entirely different episode on what researchers find their responsibility to be when it comes to the potential for amplification by mainstream media. Given how unbelievably politicized AI is and will continue to be, perhaps there is a higher burden there. But like I said, that's sort of the subject for a different show. The TLDR for me is not that I think that the study isn't useful. It's that I don't think ultimately that it's saying what the researchers think it's saying.

Starting point is 00:23:18 I think it's much closer to what Emmett Shear argued that a specific type of developer working on a specific type of codebase with a specific limited experience set with this particular set of tools encountered all sorts of issues that made them temporarily slower than have they not been using the tools. So where does that leave us? Well, from a research perspective, there are obvious needs for follow-ups here. I think having developers working on different types of codebases with different levels of experience, and specifically those who have actually worked more deeply with Cursor or any other

Starting point is 00:23:50 Agenic IDE would be a really valuable follow-up. At the same time, as Simon Willison points out, measuring developer productivity is notoriously difficult, so even with that, we're still going to have to take everything with a grain of salt. And by the way, to their credit, it appears that Mehta is actually thinking about expanding this study, and I hope that they do. We'll certainly report the updated results on the show when they come out. But, holding aside the specifics, and trying to give credit where credit is due for what the study uncovered. I do believe that it does show that we need to think about this as a different

Starting point is 00:24:22 type of work. As we shift the balance of quote-unquote coding work away from actually typing and writing out code, new types of work are going to emerge, things like debugging and checking results, and new types of challenges, such as social media time management, are going to become even more significant. So if you are a company trying to understand these results, the worst possible takeaway is to say, ah, see, it was all just overhyped. I guess we're just going to ignore those tools. The best takeaway is to understand that these productivity gains are not free. They come with a learning curve. They come with real work to reorganize the work. The faster you start, and the more quickly you get to those serious hours of reps that seem to make a big difference,

Starting point is 00:25:07 the more likely to actually get this value you are. I don't know why people think it would be any different. If you've ever tried to use any type of complex software in the past, whether it's Salesforce or Adobe Photoshop or anything. You don't get mastery quickly. You don't even get competence quickly. Powerful tools, even agentic tools, require practice. And if anything, this study shows that we can't shortcut that step. But hey, man, look, if the goal is to generate a conversation, well done.

Starting point is 00:25:34 Because this has been a huge point of discussion for the entire AI engineering community and beyond. And that in general is almost always a good thing. For now, that's going to do it for today's AI Daily Brief. Thanks, as always for listening or watching. And until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - Does AI Secretly Slow Developers Down?

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.