The AI Daily Brief: Artificial Intelligence News and Analysis - What AI Coding Agents Can Do Right Now
Episode Date: February 20, 2025AI coding tools are advancing rapidly, but how effective are they for freelance jobs? OpenAI's new SWE Lancer benchmark evaluated top AI models on 1,400 software engineering tasks from Upwork. The... outcome? Claude 3.5 Sonnet surpassed OpenAI’s models, completing more tasks and earning the highest simulated payout. Additionally, "vibe coding" is transforming software development into a more interactive, less technical process. Brought to you by:KPMG – Go to www.kpmg.us/ai to learn more about how KPMG can help you drive value with our AI solutions.Vanta - Simplify compliance - https://vanta.com/nlwThe Agent Readiness Audit from Superintelligent - Go to https://besuper.ai/ to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Subscribe to the newsletter: https://aidailybrief.beehiiv.com/Join our Discord: https://bit.ly/aibreakdown
Transcript
Discussion (0)
Today on the AI Daily Brief, OpenAI released a paper, effectively seeking to test how competent
their leading models are in real-world coding applications. Before that in the headlines,
former OpenAI CTO Mira Muradi has officially announced her new company thinking machines.
The AI Daily Brief is a daily podcast and video about the most important news and discussions in
AI. To join the conversation, follow the Discord link in our show notes.
Welcome back to the AI Daily Brief Headlines edition, all the daily AI news you need in around five
minutes. Open AI has had a lot of talent departures over the last year and a half or so. In some cases,
it's felt like a protest on how the direction of the company was going and indeed has explicitly
been shared as such. In others, it's about people making a boatload of money and just wanting
to do something different for a while. And then still in others, it's about building something new
outside of the constraints of that company. And among that set, one of the most closely watched
people has been former CTO Mira Muradi. For months now, there have been rumors around what she's
building, mostly fueled by departures, and recruitment from OpenAI and Anthropic to join Maradi
on some as yet unrevealed company. Now, however, that company has been officially announced.
Yesterday, Mira tweeted, I started thinking machines labs alongside a remarkable team of scientists,
engineers, and builders. We're building three things, helping people adapt AI systems to work
for their specific needs, developing strong foundations to build more capable AI systems,
fostering a culture of open science that helps the whole field understand and improve these systems.
Our goal is simple, advance AI by making it broadly useful and understandable through solid
foundations, open science, and practical applications. Alongside it, they published a website
thinking machines.a.I., they write, we're building a future where everyone has access to the
knowledge and tools to make AI work for their unique needs and goals. While AI capabilities
have advanced dramatically, key gaps remain. The scientific community's understanding of frontier
AI systems lags behind rapidly advancing capabilities. Knowledge of how these systems are trained
is concentrated within the top research labs, limiting both the public discourse on AI and people's
abilities to use AI effectively. And despite their potential, these systems remain difficult for
people to customize to their specific needs and values, to bridge the gaps for building
thinking machine labs to make AI systems more widely understood, customizable, and generally capable.
Now, if you're sitting there thinking, boy, I have absolutely no,
idea what these folks are actually building. You my friend are not alone. Cosmic Chaos writes,
good luck. But I'm still not sure what exactly you are building. Is it one product that does all
three or separately? Is it a service or a product? And what's your roadmap? William Wolf writes,
I'm rooting for thinking machines, but I wish projects like this had products both engineering
and design and their founding philosophies. Otherwise, it kind of just feels like yet another group
of world-class researchers vaguely gesticulating at the future. Where is the vision? Swix pointed out what he
called two notable omissions from the Thinking Machines Manifesto.
The website does not use the word reasoning or agent at all.
So what are these folks building?
I have absolutely no idea.
It does feel a little bit like the type of text that may be in retrospect when we learn
what they're building like it'll make sense.
Right now, I think vaguely gesticulating at the future is a pretty accurate way to describe
it.
At the end of the day, though, when it comes to things like potential for fundraising, the clarity
of the description doesn't probably matter even a little bit.
Currently, the 29 or so employees come from places like OpenAI, Meta, Character AI, and Google DeepMind.
Verrett Zoff, OpenAI's former VP of Post-Training Research is taking on the CTO role,
with OpenAI co-founder John Schulman, serving as chief scientist.
And indeed, when it comes to people's interest in the company,
it's best summed up by Andre Carpathy, who writes,
very strong team, a large fraction of whom we're directly involved with and built the
Chats Chippy Team Miracle.
In other words, while this may be a situation where we don't have any idea what they're
actually building, they're probably still worth paying attention to.
Next up, on the other end of the startup journey,
less than a year after launch,
the Humane PIN is officially dead and gone.
Humane announced on Tuesday that their AI wearable startup
has been acquired by HP.
Customers have been given just 10 days notice
that servers would be shut down rendering the expensive device useless.
In the FAQ, Humane noted the device could still be used
for offline features like checking the battery level.
So there's something there, I guess.
Now, of course, the Humane PIN was a bold early attempt
at creating a wearable AI assistant,
but fell flat for a number of reasons, all of which have been endlessly discussed in retrospect.
It was originally priced at $699, making it very inaccessible, really only for very high-end gizmo enthusiasts.
Initial reviews were universally terrible, the absolute apex of which was Marquez Brownlee,
calling it the worst product I've ever reviewed, a review which has been seen 8.5 million times.
Updates also couldn't save the device. At one point last summer, Humane was processing more returns than they had sales.
Humane even told customers to stop using the charging case due to battery fire concerns.
As for the buyout, HP said they were acquiring the team in the company's AI operating system
to help them create, quote, intelligent ecosystem across all HP devices from AI PCs to smart printers
and connected conference rooms. Gonzalo Nunez writes,
The Humane founders having to go work for AI for office jet printers at HP is the ultimate
Sisyphysian punishment for the prototypical Steve Jobs LARPA founder.
I cannot imagine anything more cruel.
So is there anything to learn from the failure of Humane?
Investor Justin Duke doesn't think so, writing,
I don't think we draw many interesting lessons from Humane.
They feel like a relic from a younger, more jucero-drenched era.
Even when they were in stealth mode, there was an obvious perfume of vaporware about them.
Basically, Duke is arguing that Humane was very much a creature of the 2019, 2020,
era of VC when massive checks were flying around Silicon Valley at the very end of Zerp.
Entrepreneur Chris Back writes,
Humane is the perfect cautionary tale of how talented people get completely distorted
from reality by staying at large, successful companies for too long.
Are you really a great product designer, or do you just work at Apple?
Are you actually great at sales, or do you just work at Google?
Are you really an incredible growth marketer, or do you just work at Instagram?
After a certain size, the brands sell themselves.
The only way to test your abilities is to leave the shelter of these megabrands and go out
and build something yourself from scratch.
And usually, throwing lots of money at the problem pre-launch isn't going to help you.
Maybe a more pertinent question is what it means about the state of AI wearables in general.
One thing that makes it complicated to determine is the disconnect between when it was launched
and how capabilities have changed.
The Humane Pin was released in April 2024, a few months before Google released the first version
of AI search that suggested eating rocks and using glue as a pizza topping.
Now, however, we're at a stage where leading AI models, even small ones, designed for on-device use,
are as good at coding as most junior programmers.
Although exactly how good they are we'll get into in the main episode.
Still, at this point, it's not clear that people actually want an AI assistant in a standalone device.
newsletter writer Jack Appleby thinks that there's a form factor problem.
He writes, The Future of AI isn't new hardware, it's upgrading existing software.
Control Alt Dwayne writes, the first AI hardware flop.
I don't know a single person who bought a Humane AI pin, but this is brutal.
This is exactly why AI hardware will only succeed when it's 100% local with no cloud or API dependencies.
I don't know, man.
I'm not so sure that the lessons are as clear as people think.
People have a love to rip on Humane from the very beginning, and a lot of it is absolutely self-inflicted.
the overly rot marketing videos that felt like they were trying too hard to live in Steve Jobs' shadow,
the price point, the amount of money raised, there were plenty of red flags for even someone who
is trying to go in unbiased. It is going to be an extraordinary process of trial and error to
figure out if and what sort of AI wearable experiences consumers are actually going to want.
No one has a perfect crystal ball into that future, otherwise they'd be making a ton of money.
I'm glad that there are experiments still happening. I would say that Humane is a great reminder
that extraordinarily well-funded startups tend not to be the ones to invent these sort of new experiences.
But at the same time, there are some indicators of AI wearables actually getting some traction.
Best example of that may be the Rayban meta AI glasses, which are an extremely popular product.
So who knows? All we know for sure is that Humane's part of the story is done for now.
But I would be very surprised ultimately if that means the category of AI wearables is actually cooked.
Anyways, guys, that's going to do it for today's AI Daily Brief.
One new beginning, one ending.
And next up, the main episode.
Today's episode is brought to you by Vanta.
Trust isn't just earned, it's demanded.
Whether you're a startup founder navigating your first audit
or a seasoned security professional scaling your GRC program,
proving your commitment to security has never been more critical or more complex.
That's where Vanta comes in.
Businesses use Vanta to establish trust by automating compliance needs
across over 35 frameworks like SOC2 and ISO-27-01.
centralized security workflows, complete questionnaires up to 5x faster, and proactively manage vendor risk.
Vanta can help you start or scale up your security program by connecting you with auditors and experts to conduct your audit and set up your security program quickly.
Plus, with automation and AI throughout the platform, Vanta gives you time back, so you can focus on building your company.
Join over 9,000 global companies like Atlassian, Cora, and Factory who use Vanta to manage risk and prove security in real time.
For a limited time, this audience gets $1,000 off Vanta at vanta.com slash NLW.
That's V-A-N-T-A dot com slash N-L-W for $1,000 off.
If there is one thing that's clear about AI in 2025, it's that the agents are coming.
Vertical agents by industry, horizontal agent platforms, agents per function.
If you are running a large enterprise, you will be experimenting with agents next year.
year. And given how new this is, all of us are going to be back in pilot mode. That's why Superintelligent
is offering a new product for the beginning of this year. It's an agent readiness and opportunity audit.
Over the course of a couple quick weeks, we dig in with your team to understand what type of agents
make sense for you to test, what type of infrastructure support you need to be ready, and to ultimately
come away with a set of actionable recommendations that get you prepared to figure out how agents
can transform your business. If you are interested in the agent readiness and opportunity audit,
reach out directly to me, NLW at B-Supert.A.I. Put the word agent in the subject line so I know
what you're talking about. And let's have you be a leader in the most dynamic part of the
AI market. Hey listeners, are you tasked with the safe deployment and use of trustworthy AI?
KPMG has a first of its kind AI Risk and Controls Guide, which provides a structured approach
for organizations to begin identifying AI risks and design controls to mitigate threats.
What makes KPMG's AI Risks and Controls Guide different is that it actually,
outlines practical control considerations to help businesses manage risks and accelerate value.
To learn more, go to www.kpmg.org.us slash AI guide. That's www.kmg.comg
slash AI guide. Welcome back to the AI Daily Brief. If you've been anywhere near AI Twitter
slash X over the last few weeks, you've probably heard this term vibe coding. It was coined by OpenAI
co-founder Andre Carpathy, who said,
said, there's a new kind of coding I call vibe coding, where you fully give in to the vibes,
embrace exponentials and forget that the code even exists. It's possible because the LLMs,
eG cursor composer with Sonnet, are getting too good. Also, I just talk to composer with Super
Whisper, so I barely even touch the keyboard. I ask for the dumbest things like decrease the padding
on the sidebar by half because I'm too lazy to find it. I accept all always. I don't read the
diffs anymore. When I get error messages, I just copy paste them in with no comment. Usually that fixes
it. The code grows beyond my usual comprehension. I'd have to really read through it for a while.
Sometimes the LLMs can't fix a bug, so I just work around it or ask for random changes until it goes away.
It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project
or a web app, but it's not really coding. I just see stuff, say stuff, run stuff, and copy-paste stuff,
and it mostly works. Now, this as we will discuss, has be got an entire movement of vibe coders
who are thinking about new categories of tools, and it's predicated as Carpathy Pointing
points out on the availability of a particular set of new coding tools that hit that line right
between LLMs and agents in terms of how much they're being controlled by humans and how much
they're actually doing for themselves. Indeed, I think part of what makes this area so interesting
is that it is really at the forefront of agents in practice. It demonstrates on the one hand
how mushy some of this terminology is, but at the same time, how powerful these tools are likely to be
in practice. All right, so part of the context for today's show is vibe coding, but then another
little bit of background is the conversation we were having yesterday about GROC 3.
When GROC 3 launched, it showed off how it had done on a bunch of benchmarks, and I, like
many people, found myself basically just having my eyes glaze over when it came to those benchmarks
because they're so saturated at this point that it's really hard to actually get signal from
them. As Ethan Malik pointed out, public benchmarks are both meh and saturated, leaving a lot
of AI testing to be like food reviews based on taste. If AI is critical to work, we need more.
He also pointed out that a lot of these benchmarks, quote, look nothing like.
like actual work. And given that we spend all of our time over at Superintelligent on the actual
deployment and practice of AI and agents at work, this is a particularly poignant problem.
It's also not an easy one. Another reminder from just this morning from Ethan, AI is so challenging
to figure out because it's genuinely capable of doing PhD-level work in some areas while messing
up basic tasks in closely related areas. And the abilities of AI are growing but unevenly.
All right, so all of this is background to our main topic today, which is a new benchmark from
OpenAI called the SWE-Lancer benchmark. The gist and the question that provoked the whole conversation
was can Frontier LLMs earn $1 million from real-world freelance software engineering?
Earlier this week, OpenAI released a paper, effectively seeking to test how competent
their leading models are in real-world coding applications. This new S-WE-Lancer benchmark
consists of, quote, over 1400 freelance software engineering tasks from Upwork, valued at $1 million
in USD total in real-world payouts. SWE-Lancer encompasses both independent engineering tasks
ranging from $50 bug fixes to $32,000 feature implementations and managerial tasks where models
choose between technical implementation proposals. So why is this important? Well, this gets at exactly
what we were just discussing. Until now, coding benchmarks have largely involved competitive
coding problems. These are tests that assess models on tricky programming puzzles, but don't translate
directly into practical real-world use cases. On top of their inapplicability to the real world,
they're also, as we just mentioned, becoming increasingly saturated, making it difficult to know
whether a new model represents a significant improvement or was simply trained to perform well on a
known set of questions. This benchmark, then, is much more focused on the real world.
And it actually harkens back to an idea that some like Microsoft's Mustafa Sullyman had proposed
for a new type of Turing test based on how AI interacts with the real world. Back in the middle of
23, Mustafa Silliman proposed a Turing test of whether AI could make a million dollars.
Mustafa wrote, I think we're in a moment of genuine confusion or perhaps more charitably debate about
what's really happening. Even as the Turing test fails, it doesn't leave us much clearer on where we
are with AI or what it can actually achieve. It doesn't tell us what impact these systems will
have on society or help us understand how that will play out. His proposal then for a modern Turing
test would be to give AI the instruction, go make a million dollars on a retail web platform in a few months
with just a $100,000 investment.
So this is a little bit different, obviously, than what Open AI had done,
in that OpenAI is specifically giving the model these 1400 freelance tasks,
rather than asking it to go be creative and figure out how to make that money.
But the principle of getting benchmarks into the real world,
plus this baselining to a million dollars, obviously are reminiscent.
Getting back to Swelancer, for the purposes of this paper,
the researchers set three LLMs to the task.
They tested OpenAIs GPT40 and 01, alongside Anthropics.
Claude 3.5 Sonnet. Each LLM was driving a basic coding agent capable of directly interacting with a
codebase. The models were given one shot to complete each task. Overall, researchers found that,
quote, the results indicate that the real world freelance work in our benchmarks remains challenging
for frontier language models. Going even farther in the abstract, they write, we find that frontier
models are still unable to solve the majority of tasks. Providing a little more clarity on the tasks
themselves, they were scraped directly from Upwork and Expensify with no word changes or clarification,
giving the models a taste of real-world freelancing work. The models were also denied internet access
including GitHub, ensuring that they were working based solely on their pre-trained dataset.
However, they did have access to a snapshot of the code basis they were working on.
The results found that none of the models had earned a million dollars as an automated freelancer.
Interestingly, though, despite the fact that this research was from OpenAI,
Claude 3.5 Sonnet performed the best, resolving 20,
26% of individual contributor issues and earning $89,000 out of a possible $415,000.
For individual contributor tasks, 01 came in second place earning 78,000, while GPT40 performed
less well, earning 29,000. As interesting as the results, though, was the analysis.
The report explained, agents excel at localizing but fail to root cause, resulting in partial
or flawed solutions. Agents pinpoint the source of an issue remarkably quickly, using keyword
searches across the whole repository to quickly locate the relevant file and functions often
far faster than a human would.
However, they often exhibit a limited understanding of how the issue spans multiple components
or files and fail to address the root cause, leading to solutions that are incorrect or
insufficiently comprehensive.
We rarely find cases where the agent aims to reproduce the issue or fails due to not
finding the right file or location to edit.
For the managerial tasks, each model displayed better performance.
Claude 3.5 Sonet was again the best performing model, earning 314,000 of a possible
585,000, completing 54% of tasks. O1 was hot on its heels, correctly completing 52% of
tasks for a total of 302,000. And even GPT-40, bringing up the rear, still managed 47% of tasks
to earn 275,000. This showed that the models were all decent at choosing the right solution
when presented with several options, but still have a long way to go until they can
fully replace a technical lead. Overall, Claude 3.5 Sonet won the day, earning 403,000 overall
with a 40% completion rate. O1 earned 380,000 while completing 38% of the full set of tasks,
and GPT40 finished 30% of tasks earning 304,000. Now, to be clear, no money was actually earned,
these tasks were all simulated, but that's how much they would have earned had the AI actually
been in charge of that job from Upwork or Expensify. Part of what's so interesting about this,
and we'll get to this in a moment in the commentary, is that this absolutely reflects the broad
consensus that people have had for some time, which is that Claude 3.5 Sonnet is just by far and away
the best coding model. We've even talked about how it's ubiquity as the coding model created some
challenges for Anthropics Economic Report, given what a high percentage of Claude's use comes from
those coding use cases. Now, in terms of commentary and the response to this so far, a lot of it
is focusing on exactly this weird contrast that we've identified. Mihir Patel writes,
there's increasingly a difference between academic benchmarks in real-world use cases, how
are 01 and 03 top competitive programmers yet still worse than Sonnet 3.5 on Sway Lancer and Cursor AI.
always evals remain hard and messy, and still somehow Sonnet is the best code model.
Benjamin De Cracker, who was previously on the team at XAI but fired for saying that GROC3
wasn't the second coming, noted that it was bold of OpenAI to show that Claude 3.5 Sonnet
outperformed O1 on their own benchmark.
Synthetica Lab responded, I'm not benchmarking, but in a real project that I'm working on
in C++.
O1 was basically unusable.
They then went to share their experience with O1, Claude 3.5, and Grock 3, again pointing out
that these benchmarks are really not necessarily useful for understanding how things are going to
work in the real world.
Another interesting comment came from Henry She, the founder of Super.com.
He pointed out that in a previous experiment that he had run that was very similar, while
they had reached the same conclusion that, quote, frontier models are still unable to solve
the majority of tasks, he also wrote, what's interesting and underappreciated in the paper
is that 01 is able to solve almost 50% of all I see sweet tasks on the Upwork benchmark.
This makes sense as human freelancers rarely get the solution right on the first trial.
there's a lot of back and forth in clarification required with the client.
If AI agents are able to effectively iterate on a problem,
it should be able to drastically improve performance,
just like humans and feedback in the workplace as well.
In other words, for the sake of this benchmark,
these model-powered agents were given a single chance to do it.
That's not actually how it would work in the real world.
And so as the user experience and interactive capabilities of agents go up,
it's likely that in real-world settings,
they'd be able to even outperform where they got during this test.
Another thing that some pointed out was the likelihood that this means that OpenAI is actually
building an end production coding agent.
Developer Nick Dobos writes, if they took the time to build a benchmark, it means they are
building a product to test an agent against it.
We haven't talked about this all that much on this show, but I'm fairly certain that in a
world where it's increasingly clear that the underlying models are going to be commoditized
and that there's not going to be much moat when it comes to technology, I think OpenAI
has a much stronger incentive to own the customer experience and to end.
and my guess is that they are looking at agents in just about every key domain of work.
Now, going back to this broader idea of vibe coding,
I wanted to flag just how big a theme this has gotten to be.
Like I said, I think that coding is one of the areas where agents are coming to production
and actually being deployed for businesses most quickly.
And I think that this whole idea of vibe coding is really fleshing out the spectrum of code creation
from no code all the way to coding agents all the way to traditional coding experiences.
A16Z recently did a new market map of these types of tools.
people like Riley Brown, who's the number one AI creator on TikTok, has gone all in on vibe coding,
even working on some tools to improve how people do their vibe coding now.
He also shared some interesting thoughts recently about how this might change the structure of the economy.
Specifically, he points out that as creators can monetize their audiences with software,
rather than things like courses and ads, it creates a very different type of economic opportunity,
one that's starting to be reflected in a new generation of VC creator funds.
And speaking of VCs, it's very clear that.
that there is lots of interest in this area.
A16Z's Andrew Chen tweets,
Who is building the product that's 100% focused on vibe coding?
It needs to have built-in highly primitive G-slides-level drawing tools,
Spotify integration for background music,
library of pre-existing app UIs,
so you can, for example, make the sign-up flow the same as XYZ app,
explainers on highlighted code diffs, etc, library of graphic assets,
integrated logo creator.
And Andrew points out all the PMs and X-PMs like me will have a field day with this.
point being that when we look at coding right now, not only are we talking about disruption to the way that coding happens among traditional software engineers.
We're also talking about totally different modalities and an expansion of who gets to actually push code.
At the same time, even as all of these people get excited about what they can do that they couldn't do before because they weren't coders,
that's not the same as these tools being able to be inserted willy-nilly into enterprise code processes.
And so a lot of the work over the next couple years is going to be to figure out how these experiences diverge and what type of coding agents are good for different settings.
Still, it is an absolutely fascinating time, and I am very excited to see what comes next.
For now that, that is going to do it for today's AI Daily Brief.
Appreciate you listening, as always.
Until next time, peace.
