The AI Daily Brief: Artificial Intelligence News and Analysis - What AI Coding Agents Can Do Right Now

Episode Date: February 20, 2025

AI coding tools are advancing rapidly, but how effective are they for freelance jobs? OpenAI's new SWE Lancer benchmark evaluated top AI models on 1,400 software engineering tasks from Upwork. The... outcome? Claude 3.5 Sonnet surpassed OpenAI’s models, completing more tasks and earning the highest simulated payout. Additionally, "vibe coding" is transforming software development into a more interactive, less technical process. Brought to you by:KPMG – Go to ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠www.kpmg.us/ai⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ to learn more about how KPMG can help you drive value with our AI solutions.Vanta - Simplify compliance - ⁠⁠⁠⁠⁠⁠⁠https://vanta.com/nlwThe Agent Readiness Audit from Superintelligent - Go to https://besuper.ai/ to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Subscribe to the newsletter: https://aidailybrief.beehiiv.com/Join our Discord: https://bit.ly/aibreakdown

Transcript
Discussion (0)
Starting point is 00:00:00 Today on the AI Daily Brief, OpenAI released a paper, effectively seeking to test how competent their leading models are in real-world coding applications. Before that in the headlines, former OpenAI CTO Mira Muradi has officially announced her new company thinking machines. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. To join the conversation, follow the Discord link in our show notes. Welcome back to the AI Daily Brief Headlines edition, all the daily AI news you need in around five minutes. Open AI has had a lot of talent departures over the last year and a half or so. In some cases, it's felt like a protest on how the direction of the company was going and indeed has explicitly
Starting point is 00:00:39 been shared as such. In others, it's about people making a boatload of money and just wanting to do something different for a while. And then still in others, it's about building something new outside of the constraints of that company. And among that set, one of the most closely watched people has been former CTO Mira Muradi. For months now, there have been rumors around what she's building, mostly fueled by departures, and recruitment from OpenAI and Anthropic to join Maradi on some as yet unrevealed company. Now, however, that company has been officially announced. Yesterday, Mira tweeted, I started thinking machines labs alongside a remarkable team of scientists, engineers, and builders. We're building three things, helping people adapt AI systems to work
Starting point is 00:01:21 for their specific needs, developing strong foundations to build more capable AI systems, fostering a culture of open science that helps the whole field understand and improve these systems. Our goal is simple, advance AI by making it broadly useful and understandable through solid foundations, open science, and practical applications. Alongside it, they published a website thinking machines.a.I., they write, we're building a future where everyone has access to the knowledge and tools to make AI work for their unique needs and goals. While AI capabilities have advanced dramatically, key gaps remain. The scientific community's understanding of frontier AI systems lags behind rapidly advancing capabilities. Knowledge of how these systems are trained
Starting point is 00:01:56 is concentrated within the top research labs, limiting both the public discourse on AI and people's abilities to use AI effectively. And despite their potential, these systems remain difficult for people to customize to their specific needs and values, to bridge the gaps for building thinking machine labs to make AI systems more widely understood, customizable, and generally capable. Now, if you're sitting there thinking, boy, I have absolutely no, idea what these folks are actually building. You my friend are not alone. Cosmic Chaos writes, good luck. But I'm still not sure what exactly you are building. Is it one product that does all three or separately? Is it a service or a product? And what's your roadmap? William Wolf writes,
Starting point is 00:02:31 I'm rooting for thinking machines, but I wish projects like this had products both engineering and design and their founding philosophies. Otherwise, it kind of just feels like yet another group of world-class researchers vaguely gesticulating at the future. Where is the vision? Swix pointed out what he called two notable omissions from the Thinking Machines Manifesto. The website does not use the word reasoning or agent at all. So what are these folks building? I have absolutely no idea. It does feel a little bit like the type of text that may be in retrospect when we learn
Starting point is 00:03:01 what they're building like it'll make sense. Right now, I think vaguely gesticulating at the future is a pretty accurate way to describe it. At the end of the day, though, when it comes to things like potential for fundraising, the clarity of the description doesn't probably matter even a little bit. Currently, the 29 or so employees come from places like OpenAI, Meta, Character AI, and Google DeepMind. Verrett Zoff, OpenAI's former VP of Post-Training Research is taking on the CTO role, with OpenAI co-founder John Schulman, serving as chief scientist.
Starting point is 00:03:28 And indeed, when it comes to people's interest in the company, it's best summed up by Andre Carpathy, who writes, very strong team, a large fraction of whom we're directly involved with and built the Chats Chippy Team Miracle. In other words, while this may be a situation where we don't have any idea what they're actually building, they're probably still worth paying attention to. Next up, on the other end of the startup journey, less than a year after launch,
Starting point is 00:03:49 the Humane PIN is officially dead and gone. Humane announced on Tuesday that their AI wearable startup has been acquired by HP. Customers have been given just 10 days notice that servers would be shut down rendering the expensive device useless. In the FAQ, Humane noted the device could still be used for offline features like checking the battery level. So there's something there, I guess.
Starting point is 00:04:10 Now, of course, the Humane PIN was a bold early attempt at creating a wearable AI assistant, but fell flat for a number of reasons, all of which have been endlessly discussed in retrospect. It was originally priced at $699, making it very inaccessible, really only for very high-end gizmo enthusiasts. Initial reviews were universally terrible, the absolute apex of which was Marquez Brownlee, calling it the worst product I've ever reviewed, a review which has been seen 8.5 million times. Updates also couldn't save the device. At one point last summer, Humane was processing more returns than they had sales. Humane even told customers to stop using the charging case due to battery fire concerns.
Starting point is 00:04:49 As for the buyout, HP said they were acquiring the team in the company's AI operating system to help them create, quote, intelligent ecosystem across all HP devices from AI PCs to smart printers and connected conference rooms. Gonzalo Nunez writes, The Humane founders having to go work for AI for office jet printers at HP is the ultimate Sisyphysian punishment for the prototypical Steve Jobs LARPA founder. I cannot imagine anything more cruel. So is there anything to learn from the failure of Humane? Investor Justin Duke doesn't think so, writing,
Starting point is 00:05:18 I don't think we draw many interesting lessons from Humane. They feel like a relic from a younger, more jucero-drenched era. Even when they were in stealth mode, there was an obvious perfume of vaporware about them. Basically, Duke is arguing that Humane was very much a creature of the 2019, 2020, era of VC when massive checks were flying around Silicon Valley at the very end of Zerp. Entrepreneur Chris Back writes, Humane is the perfect cautionary tale of how talented people get completely distorted from reality by staying at large, successful companies for too long.
Starting point is 00:05:45 Are you really a great product designer, or do you just work at Apple? Are you actually great at sales, or do you just work at Google? Are you really an incredible growth marketer, or do you just work at Instagram? After a certain size, the brands sell themselves. The only way to test your abilities is to leave the shelter of these megabrands and go out and build something yourself from scratch. And usually, throwing lots of money at the problem pre-launch isn't going to help you. Maybe a more pertinent question is what it means about the state of AI wearables in general.
Starting point is 00:06:10 One thing that makes it complicated to determine is the disconnect between when it was launched and how capabilities have changed. The Humane Pin was released in April 2024, a few months before Google released the first version of AI search that suggested eating rocks and using glue as a pizza topping. Now, however, we're at a stage where leading AI models, even small ones, designed for on-device use, are as good at coding as most junior programmers. Although exactly how good they are we'll get into in the main episode. Still, at this point, it's not clear that people actually want an AI assistant in a standalone device.
Starting point is 00:06:38 newsletter writer Jack Appleby thinks that there's a form factor problem. He writes, The Future of AI isn't new hardware, it's upgrading existing software. Control Alt Dwayne writes, the first AI hardware flop. I don't know a single person who bought a Humane AI pin, but this is brutal. This is exactly why AI hardware will only succeed when it's 100% local with no cloud or API dependencies. I don't know, man. I'm not so sure that the lessons are as clear as people think. People have a love to rip on Humane from the very beginning, and a lot of it is absolutely self-inflicted.
Starting point is 00:07:06 the overly rot marketing videos that felt like they were trying too hard to live in Steve Jobs' shadow, the price point, the amount of money raised, there were plenty of red flags for even someone who is trying to go in unbiased. It is going to be an extraordinary process of trial and error to figure out if and what sort of AI wearable experiences consumers are actually going to want. No one has a perfect crystal ball into that future, otherwise they'd be making a ton of money. I'm glad that there are experiments still happening. I would say that Humane is a great reminder that extraordinarily well-funded startups tend not to be the ones to invent these sort of new experiences. But at the same time, there are some indicators of AI wearables actually getting some traction.
Starting point is 00:07:46 Best example of that may be the Rayban meta AI glasses, which are an extremely popular product. So who knows? All we know for sure is that Humane's part of the story is done for now. But I would be very surprised ultimately if that means the category of AI wearables is actually cooked. Anyways, guys, that's going to do it for today's AI Daily Brief. One new beginning, one ending. And next up, the main episode. Today's episode is brought to you by Vanta. Trust isn't just earned, it's demanded.
Starting point is 00:08:13 Whether you're a startup founder navigating your first audit or a seasoned security professional scaling your GRC program, proving your commitment to security has never been more critical or more complex. That's where Vanta comes in. Businesses use Vanta to establish trust by automating compliance needs across over 35 frameworks like SOC2 and ISO-27-01. centralized security workflows, complete questionnaires up to 5x faster, and proactively manage vendor risk. Vanta can help you start or scale up your security program by connecting you with auditors and experts to conduct your audit and set up your security program quickly.
Starting point is 00:08:48 Plus, with automation and AI throughout the platform, Vanta gives you time back, so you can focus on building your company. Join over 9,000 global companies like Atlassian, Cora, and Factory who use Vanta to manage risk and prove security in real time. For a limited time, this audience gets $1,000 off Vanta at vanta.com slash NLW. That's V-A-N-T-A dot com slash N-L-W for $1,000 off. If there is one thing that's clear about AI in 2025, it's that the agents are coming. Vertical agents by industry, horizontal agent platforms, agents per function. If you are running a large enterprise, you will be experimenting with agents next year. year. And given how new this is, all of us are going to be back in pilot mode. That's why Superintelligent
Starting point is 00:09:36 is offering a new product for the beginning of this year. It's an agent readiness and opportunity audit. Over the course of a couple quick weeks, we dig in with your team to understand what type of agents make sense for you to test, what type of infrastructure support you need to be ready, and to ultimately come away with a set of actionable recommendations that get you prepared to figure out how agents can transform your business. If you are interested in the agent readiness and opportunity audit, reach out directly to me, NLW at B-Supert.A.I. Put the word agent in the subject line so I know what you're talking about. And let's have you be a leader in the most dynamic part of the AI market. Hey listeners, are you tasked with the safe deployment and use of trustworthy AI?
Starting point is 00:10:16 KPMG has a first of its kind AI Risk and Controls Guide, which provides a structured approach for organizations to begin identifying AI risks and design controls to mitigate threats. What makes KPMG's AI Risks and Controls Guide different is that it actually, outlines practical control considerations to help businesses manage risks and accelerate value. To learn more, go to www.kpmg.org.us slash AI guide. That's www.kmg.comg slash AI guide. Welcome back to the AI Daily Brief. If you've been anywhere near AI Twitter slash X over the last few weeks, you've probably heard this term vibe coding. It was coined by OpenAI co-founder Andre Carpathy, who said,
Starting point is 00:11:00 said, there's a new kind of coding I call vibe coding, where you fully give in to the vibes, embrace exponentials and forget that the code even exists. It's possible because the LLMs, eG cursor composer with Sonnet, are getting too good. Also, I just talk to composer with Super Whisper, so I barely even touch the keyboard. I ask for the dumbest things like decrease the padding on the sidebar by half because I'm too lazy to find it. I accept all always. I don't read the diffs anymore. When I get error messages, I just copy paste them in with no comment. Usually that fixes it. The code grows beyond my usual comprehension. I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug, so I just work around it or ask for random changes until it goes away.
Starting point is 00:11:38 It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or a web app, but it's not really coding. I just see stuff, say stuff, run stuff, and copy-paste stuff, and it mostly works. Now, this as we will discuss, has be got an entire movement of vibe coders who are thinking about new categories of tools, and it's predicated as Carpathy Pointing points out on the availability of a particular set of new coding tools that hit that line right between LLMs and agents in terms of how much they're being controlled by humans and how much they're actually doing for themselves. Indeed, I think part of what makes this area so interesting is that it is really at the forefront of agents in practice. It demonstrates on the one hand
Starting point is 00:12:18 how mushy some of this terminology is, but at the same time, how powerful these tools are likely to be in practice. All right, so part of the context for today's show is vibe coding, but then another little bit of background is the conversation we were having yesterday about GROC 3. When GROC 3 launched, it showed off how it had done on a bunch of benchmarks, and I, like many people, found myself basically just having my eyes glaze over when it came to those benchmarks because they're so saturated at this point that it's really hard to actually get signal from them. As Ethan Malik pointed out, public benchmarks are both meh and saturated, leaving a lot of AI testing to be like food reviews based on taste. If AI is critical to work, we need more.
Starting point is 00:12:55 He also pointed out that a lot of these benchmarks, quote, look nothing like. like actual work. And given that we spend all of our time over at Superintelligent on the actual deployment and practice of AI and agents at work, this is a particularly poignant problem. It's also not an easy one. Another reminder from just this morning from Ethan, AI is so challenging to figure out because it's genuinely capable of doing PhD-level work in some areas while messing up basic tasks in closely related areas. And the abilities of AI are growing but unevenly. All right, so all of this is background to our main topic today, which is a new benchmark from OpenAI called the SWE-Lancer benchmark. The gist and the question that provoked the whole conversation
Starting point is 00:13:34 was can Frontier LLMs earn $1 million from real-world freelance software engineering? Earlier this week, OpenAI released a paper, effectively seeking to test how competent their leading models are in real-world coding applications. This new S-WE-Lancer benchmark consists of, quote, over 1400 freelance software engineering tasks from Upwork, valued at $1 million in USD total in real-world payouts. SWE-Lancer encompasses both independent engineering tasks ranging from $50 bug fixes to $32,000 feature implementations and managerial tasks where models choose between technical implementation proposals. So why is this important? Well, this gets at exactly what we were just discussing. Until now, coding benchmarks have largely involved competitive
Starting point is 00:14:17 coding problems. These are tests that assess models on tricky programming puzzles, but don't translate directly into practical real-world use cases. On top of their inapplicability to the real world, they're also, as we just mentioned, becoming increasingly saturated, making it difficult to know whether a new model represents a significant improvement or was simply trained to perform well on a known set of questions. This benchmark, then, is much more focused on the real world. And it actually harkens back to an idea that some like Microsoft's Mustafa Sullyman had proposed for a new type of Turing test based on how AI interacts with the real world. Back in the middle of 23, Mustafa Silliman proposed a Turing test of whether AI could make a million dollars.
Starting point is 00:14:58 Mustafa wrote, I think we're in a moment of genuine confusion or perhaps more charitably debate about what's really happening. Even as the Turing test fails, it doesn't leave us much clearer on where we are with AI or what it can actually achieve. It doesn't tell us what impact these systems will have on society or help us understand how that will play out. His proposal then for a modern Turing test would be to give AI the instruction, go make a million dollars on a retail web platform in a few months with just a $100,000 investment. So this is a little bit different, obviously, than what Open AI had done, in that OpenAI is specifically giving the model these 1400 freelance tasks,
Starting point is 00:15:31 rather than asking it to go be creative and figure out how to make that money. But the principle of getting benchmarks into the real world, plus this baselining to a million dollars, obviously are reminiscent. Getting back to Swelancer, for the purposes of this paper, the researchers set three LLMs to the task. They tested OpenAIs GPT40 and 01, alongside Anthropics. Claude 3.5 Sonnet. Each LLM was driving a basic coding agent capable of directly interacting with a codebase. The models were given one shot to complete each task. Overall, researchers found that,
Starting point is 00:16:02 quote, the results indicate that the real world freelance work in our benchmarks remains challenging for frontier language models. Going even farther in the abstract, they write, we find that frontier models are still unable to solve the majority of tasks. Providing a little more clarity on the tasks themselves, they were scraped directly from Upwork and Expensify with no word changes or clarification, giving the models a taste of real-world freelancing work. The models were also denied internet access including GitHub, ensuring that they were working based solely on their pre-trained dataset. However, they did have access to a snapshot of the code basis they were working on. The results found that none of the models had earned a million dollars as an automated freelancer.
Starting point is 00:16:39 Interestingly, though, despite the fact that this research was from OpenAI, Claude 3.5 Sonnet performed the best, resolving 20, 26% of individual contributor issues and earning $89,000 out of a possible $415,000. For individual contributor tasks, 01 came in second place earning 78,000, while GPT40 performed less well, earning 29,000. As interesting as the results, though, was the analysis. The report explained, agents excel at localizing but fail to root cause, resulting in partial or flawed solutions. Agents pinpoint the source of an issue remarkably quickly, using keyword searches across the whole repository to quickly locate the relevant file and functions often
Starting point is 00:17:19 far faster than a human would. However, they often exhibit a limited understanding of how the issue spans multiple components or files and fail to address the root cause, leading to solutions that are incorrect or insufficiently comprehensive. We rarely find cases where the agent aims to reproduce the issue or fails due to not finding the right file or location to edit. For the managerial tasks, each model displayed better performance. Claude 3.5 Sonet was again the best performing model, earning 314,000 of a possible
Starting point is 00:17:45 585,000, completing 54% of tasks. O1 was hot on its heels, correctly completing 52% of tasks for a total of 302,000. And even GPT-40, bringing up the rear, still managed 47% of tasks to earn 275,000. This showed that the models were all decent at choosing the right solution when presented with several options, but still have a long way to go until they can fully replace a technical lead. Overall, Claude 3.5 Sonet won the day, earning 403,000 overall with a 40% completion rate. O1 earned 380,000 while completing 38% of the full set of tasks, and GPT40 finished 30% of tasks earning 304,000. Now, to be clear, no money was actually earned, these tasks were all simulated, but that's how much they would have earned had the AI actually
Starting point is 00:18:29 been in charge of that job from Upwork or Expensify. Part of what's so interesting about this, and we'll get to this in a moment in the commentary, is that this absolutely reflects the broad consensus that people have had for some time, which is that Claude 3.5 Sonnet is just by far and away the best coding model. We've even talked about how it's ubiquity as the coding model created some challenges for Anthropics Economic Report, given what a high percentage of Claude's use comes from those coding use cases. Now, in terms of commentary and the response to this so far, a lot of it is focusing on exactly this weird contrast that we've identified. Mihir Patel writes, there's increasingly a difference between academic benchmarks in real-world use cases, how
Starting point is 00:19:07 are 01 and 03 top competitive programmers yet still worse than Sonnet 3.5 on Sway Lancer and Cursor AI. always evals remain hard and messy, and still somehow Sonnet is the best code model. Benjamin De Cracker, who was previously on the team at XAI but fired for saying that GROC3 wasn't the second coming, noted that it was bold of OpenAI to show that Claude 3.5 Sonnet outperformed O1 on their own benchmark. Synthetica Lab responded, I'm not benchmarking, but in a real project that I'm working on in C++. O1 was basically unusable.
Starting point is 00:19:37 They then went to share their experience with O1, Claude 3.5, and Grock 3, again pointing out that these benchmarks are really not necessarily useful for understanding how things are going to work in the real world. Another interesting comment came from Henry She, the founder of Super.com. He pointed out that in a previous experiment that he had run that was very similar, while they had reached the same conclusion that, quote, frontier models are still unable to solve the majority of tasks, he also wrote, what's interesting and underappreciated in the paper is that 01 is able to solve almost 50% of all I see sweet tasks on the Upwork benchmark.
Starting point is 00:20:08 This makes sense as human freelancers rarely get the solution right on the first trial. there's a lot of back and forth in clarification required with the client. If AI agents are able to effectively iterate on a problem, it should be able to drastically improve performance, just like humans and feedback in the workplace as well. In other words, for the sake of this benchmark, these model-powered agents were given a single chance to do it. That's not actually how it would work in the real world.
Starting point is 00:20:31 And so as the user experience and interactive capabilities of agents go up, it's likely that in real-world settings, they'd be able to even outperform where they got during this test. Another thing that some pointed out was the likelihood that this means that OpenAI is actually building an end production coding agent. Developer Nick Dobos writes, if they took the time to build a benchmark, it means they are building a product to test an agent against it. We haven't talked about this all that much on this show, but I'm fairly certain that in a
Starting point is 00:20:57 world where it's increasingly clear that the underlying models are going to be commoditized and that there's not going to be much moat when it comes to technology, I think OpenAI has a much stronger incentive to own the customer experience and to end. and my guess is that they are looking at agents in just about every key domain of work. Now, going back to this broader idea of vibe coding, I wanted to flag just how big a theme this has gotten to be. Like I said, I think that coding is one of the areas where agents are coming to production and actually being deployed for businesses most quickly.
Starting point is 00:21:25 And I think that this whole idea of vibe coding is really fleshing out the spectrum of code creation from no code all the way to coding agents all the way to traditional coding experiences. A16Z recently did a new market map of these types of tools. people like Riley Brown, who's the number one AI creator on TikTok, has gone all in on vibe coding, even working on some tools to improve how people do their vibe coding now. He also shared some interesting thoughts recently about how this might change the structure of the economy. Specifically, he points out that as creators can monetize their audiences with software, rather than things like courses and ads, it creates a very different type of economic opportunity,
Starting point is 00:22:02 one that's starting to be reflected in a new generation of VC creator funds. And speaking of VCs, it's very clear that. that there is lots of interest in this area. A16Z's Andrew Chen tweets, Who is building the product that's 100% focused on vibe coding? It needs to have built-in highly primitive G-slides-level drawing tools, Spotify integration for background music, library of pre-existing app UIs,
Starting point is 00:22:23 so you can, for example, make the sign-up flow the same as XYZ app, explainers on highlighted code diffs, etc, library of graphic assets, integrated logo creator. And Andrew points out all the PMs and X-PMs like me will have a field day with this. point being that when we look at coding right now, not only are we talking about disruption to the way that coding happens among traditional software engineers. We're also talking about totally different modalities and an expansion of who gets to actually push code. At the same time, even as all of these people get excited about what they can do that they couldn't do before because they weren't coders, that's not the same as these tools being able to be inserted willy-nilly into enterprise code processes.
Starting point is 00:23:00 And so a lot of the work over the next couple years is going to be to figure out how these experiences diverge and what type of coding agents are good for different settings. Still, it is an absolutely fascinating time, and I am very excited to see what comes next. For now that, that is going to do it for today's AI Daily Brief. Appreciate you listening, as always. Until next time, peace.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.