The AI Daily Brief: Artificial Intelligence News and Analysis - Agent Pilots Nearly Doubled Last Quarter

Starting point is 00:00:00 Today on the AI Daily Brief, enterprises are screaming towards agents, and before that in the headlines, do reasoning models hallucinate more? The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. To join the conversation, follow the Discord link in our show notes. Welcome back to the AI Daily Brief Headlines edition, all the daily AI news you need in around five minutes. Reasoning models have obviously taken over a huge number of use cases with AI. However, one downside is that it appears that hallucinations appear to be getting, worst as reasoning models scale up. In their technical report on their new 03 and 04 mini models, OpenAI disclosed that they both performed 01 on the person QA evaluation. That's an in-house test

Starting point is 00:00:50 that queries the models against publicly available facts and is designed explicitly to elicit hallucinations. O1 hallucinated 16% of the time. O3 roughly doubled that, hallucinating 33% of the time, while O4 Mini hallucinated nearly half of its answers at 48%. OpenAI wrote that 04 minis results were expected, as, quote, smaller models have less world knowledge and tend to hallucinate more. However, when it came to O3's performance, they said, O3 tends to make more claims overall, leading to more accurate claims as well as more inaccurate and hallucinated claims. In essence, they're saying the longer a model thinks, the more opportunity it has to trip up and hallucinate in its reasoning. O-1's conclusion was that more research is needed to

Starting point is 00:01:30 understand the cause of this. And as we'll discuss in today's main episode, the more that these models move into high production, high-value use cases inside the enterprise, the more that hallucinations become a big concern. One promising counter signal, however, is that access to web search seems to mitigate hallucinations, so it's not as though there aren't ways to mitigate this. In the meantime, this is a real issue. Developer Patrick Bade wrote, this might sound harsh, but O3 is unusable for low-level coding at the moment. It spits out ridiculous code snippets full of hallucinations and wrong assumptions. There is no doubt O3 is excellent in making plans and analyzing high-level stuff, but it's downright terrible at implementing logic. Speaking of O3, independent benchmarkers

Starting point is 00:02:12 have been unable to match OpenAI scores for O3. After getting a hold of the new model, Epic A.I tried to match Open AI's results, specifically they attempted to put the model through its paces in the ultra-hard frontier math benchmark. Until O3, no model had achieved more than a 2% result. Open AI claimed that O3 could achieve 25% correct, but Epic AI only managed a 10% result, which is, of course, still much better than any other model on the market, but a far cry from that 25%. Epic wrote, the difference between our results and OpenAIs might be due to OpenAI evaluating with a more powerful internal scaffold, using more test time compute, or because these results were run under a different subset of frontier math.

Starting point is 00:02:52 One thing that OpenAI did say is that the O3 that is now available in production is, quote, more optimized for real-world use cases and speed as opposed to the version of O3 that was demoed in December. Technical staff member Wanda Zhao wrote, we've done optimizations to make the model more cost-efficient and more useful in general. To me, it's another great reminder that we should only view benchmarks as a very small part of how we view a model, and that ultimately the proof is in the pudding with the actual tasks and use cases that you give it. On that front, I am still finding O3 a leap and bounds improvement over previous models. Over in the world of AI coders, Figma appears to be latching onto the vibe coding trend with an AI app maker. The feature accepts

Starting point is 00:03:30 text prompts, Figma files, and images as input to return fully functional apps. This new no-cold tool is powered by Anthropics Codd 3.7 Sonnet, and while some suggested that Figma is scrambling for feature parity with Canva, who introduced their own vibe coding tool two weeks ago, this kind of seems like a sign of the times. Simple AI prototyping is such a powerful tool that it is rapidly becoming a must-have feature for every single design platform, which is not to say that vibe-coding platforms don't have their challenges.

Starting point is 00:03:57 Cursor has apologized after an AI support agent went rogue and invented a new policy. Last week, users began reporting that they were being logged out of sessions when switching between devices. Many emailed Cursor's customer support to ask whether this was the intention, leading to an employee identified as Sam saying that this was expected behavior under a new policy. Adding, quote, cursor is designed to work with one device per subscription as a core security feature. The issue blew up on Reddit with one user commenting that multi-device workflows are table stakes for devs. There were dozens of angry responses from programmers stating that they were canceling their subscriptions.

Starting point is 00:04:30 As it turned out, however, Sam was actually an AI support agent and there was no such policy in place. Cursor co-founder Michael Truel responded in the thread. We have no such policy. You're of course free to use cursor on multiple machines. Unfortunately, this is an incorrect response from a frontline AI support bot. We did roll out a change to improve the security of sessions, and we're investigating to see if it caused any problems with session and validation. And so again, you here have an example of how hallucinating. can be actually problematic in practice, which I think is ultimately a perfect segue to our main episode where we discuss some updates in how enterprises are using AI.

Starting point is 00:05:07 Today's episode is brought to you by Super Intelligent and more specifically Super's Agent Readiness Audits. If you've been listening for a while, you have probably heard me talk about this, but basically the idea of the Agent Readiness Audit is that this is a system that we've created to help you benchmark and map opportunities in your organization. where agents could specifically help you solve your problems, create new opportunities in a way that, again, is completely customized to you. When you do one of these audits, what you're going to do is a voice-based agent interview where we work with some number of your leadership and employees to map what's going on

Starting point is 00:05:43 inside the organization and to figure out where you are in your agent journey. That's going to produce an agent readiness score that comes with a deep set of explanations, strength, weaknesses, key findings, and of course, a set of very important. specific recommendations that then we have the ability to help you go find the right partners to actually fulfill. So if you are looking for a way to jumpstart your agent strategy, send us an email at agent at besuper.aI, and let's get you plugged into the agentic era. Today's episode is brought to you by Vanta. Vanta is a trust management platform that helps businesses automate security and compliance, enabling them to demonstrate strong security practices and scale. In today's business

Starting point is 00:06:22 landscape, businesses can't just claim security, they have to prove it. Achieving compliance with a framework like SOC2, ISO-2, ISO-2-HIPA, GDPR, and more is how businesses can demonstrate strong security practices. And we see how much this matters every time we connect enterprises with agent services providers at super-intelligent. Many of these compliance frameworks are simply not negotiable for enterprises. The problem is that navigating security and compliance is time-consuming and complicated. It can take months of work and use up valuable time and resources. Vanta makes it easy and faster by automating compliance across 35 plus frameworks. It gets you audit ready in weeks instead of months and saves you up to 85% of associated costs. In fact, a recent IDC White Paper found that Vanta customers

Starting point is 00:07:04 achieve $535,000 per year in benefits, and the platform pays for itself in just three months. The proof is in the numbers. More than 10,000 global companies trust Vanta, including Atlassian, Kora and more. For a limited time, listeners get $1,000 off at vanta.com slash nLW. That's va-n-ta.com for $1,000 off. Hey listeners, are you tasked with the safe deployment and use of trustworthy AI? KPMG has a first-of-its-kind AI Risk and Controls Guide, which provides a structured approach for organizations to begin identifying AI risks and design controls to mitigate threats. What makes KPMG's AI risks and controls guide different is that it outlines practical control considerations to help businesses manage risks and accelerate value.

Starting point is 00:07:51 To learn more, go to www.kpmg.org.us slash AI guide. That's www.kpmg.org.comg slash AI guide. Welcome back to the AI Daily Brief. Today we are once again looking at some research around AI usage in the enterprise. And unlike some of the surveys that I've been co-examined, complaining about recently, this data is actually pretty contemporary. So this, of course, is KPMG's quarterly AI Pulse survey. It's a longitudinal survey of about 130 executives at companies with a billion dollars or more in revenue. So this represents sort of the upper echelon and upper size range of

Starting point is 00:08:29 big enterprises. And again, is current data. They do this every quarter. So we're dealing with actual recency of information. And the story here is very clear. The results here show organizations that are moving from theory to practice in an increasingly urgent way, with the concerns and priorities following along. So to kick us off, let's talk about the first clear theme here, which is growth in the investment around and an increase in the usage of AI. In Q4, executives thought that they would spend $89 million on Gen AI over the next year, while by Q1, that number had gone up to $114 million. Even more dramatic as some of the increases in actual usage. Weekly usage of knowledge assistance jumped from 48% in Q4 to 61.

Starting point is 00:09:11 percent in Q1. Gen AI usage embedded into existing workflows went from 24% to 35%. But the big one is daily usage of AI productivity tools significantly more than doubled from 22 to 58%. And to me, this does not read like some statistical outlier, even though that increase is fairly dramatic. I think that those of us, probably most of you who are listening, who have used these tools, find them very quickly making their way into your daily habits. I actually think that that 58% number just reflects the data catching up to how AI adoption is actually happening in practice. And I would expect to see it nothing but increase in the coming quarters. And so pursuant to all of this increase in usage,

Starting point is 00:09:51 I think that the concerns that people have around AI are changing consequently as well. When it came to challenges with AI, the number one concern shifted from misuse of AI by bad actors to accuracy and fairness of AI outputs. The concern around accuracy of AI outputs jumped from 20% to 32% and the concern around misuse of AI by bad actors went down from 50% to just 30%. A new category of concern, over-regulation, stifling innovation, jumped massively from just 2% in Q4 of last year to 17% in this study.

Starting point is 00:10:22 So what are these numbers telling us? Well, frankly, these things all seem to me like a better calibration if we assume that people are actually using AI more. Why does a concern around accuracy increase the more that you use AI? Because the use cases that we're getting into are increasingly more important. The more that we're trusting AI to do things that are mission critical, the more that we need it to be accurate. Meanwhile, why would the concern around misuse go down? Well, misuse of AI by bad actors is a theoretical outside the organization concern rather than something that would be an issue day to day.

Starting point is 00:10:52 And so it doesn't surprise me once again, the more that we're actually using AI to see that comparatively decrease and make way for concerns about things that are actually relevant for your business in the here and now. Likewise, this concern around over-regulation stifling innovation, I think that increasingly organizations are making plans based on the rate of change, which has now been fairly consistent. Especially as agenic capabilities increase, people are starting to build plans around assuming updated capabilities in the months and years to come, and so exogenous factors that could slow those developments down, once again, become more important.

Starting point is 00:11:26 Expectations of challenges once again reflects a very practical, we're actually doing this thing sort of point of view. 82% of the leaders surveyed said that they expected risk management, such as data privacy and cybersecurity to be their biggest challenge for Gen A.I. Strategy this year. Organizational data quality was second with 64%. Now, one of the areas with the most dramatic increase was the percentage of organizations piloting AI agents. Now, to KPMG, the pilot phase comes after the experimentation phase, so pilot is perhaps a little bit more high intent than the word might suggest in other contexts. The percentage of organizations piloting AI agents

Starting point is 00:12:02 jumped from 37% in the last quarter of last year to 65% in the first quarter of this year, nearly doubling. Now, when it comes to full deployments, the number remained flat at just 11% in both of these quarters. But still, overall, that represents a dramatic move down the funnel towards full deployment. And again, that shows up in the intent. Ninety-nine percent of organizations surveyed said that they plan to deploy agents, suggesting to me that 1% of organizations misread the question. Interestingly, the buy-build spread? kind of got back to something that we might expect, with two-thirds planning to buy a pre-built agent and a little less than a third at 27% planning on a combination of buying and building.

Starting point is 00:12:41 We had seen some research towards the end of last year from Menlo's enterprise study that found a dramatic shift in behavior between 23 and 24. In 23, the divide between buy and build behavior was something like 80-20, whereas in 2024, it was very nearly 50-50. I think it was 53% buy versus 47% build, and I had speculated at the time that I thought that probably reflected that organizations were discovering verticalized use cases that didn't really have a great startup yet, and that probably there would be a bit of a boomerang back as more startups focused on those specific niche areas came online, although I will also note that when it comes to agents, the build by hierarchy is a little bit misleading given how much customization

Starting point is 00:13:20 there is and given how blurry those lines actually are. Still, overall, the story is that AI agents are coming online, they're coming online fast, and that while full deployments are not here in general, we are cascading in that direction. Alongside that shift, the priorities of organizations are changing consequently as well. KPMG writes, when asked what risk mitigation measures are being considered when it comes to AI agents in the next 6 to 12 months, 63% of leaders plan to deploy AI agents developed by trusted tech partners,

Starting point is 00:13:50 up from 23% in Q4, while another 52% are not allowing AI agents to access sensitive data without human oversight, up from 31% in Q4. Point being that in a world where, more and more high-sensitive, highly important functions are being trusted to agents, and frankly in a world where software and capabilities are being increasingly commoditized, trust is something that is distinctly not commoditized and begins to matter even more than before. Finally, it's pretty clear that organizations are still in the mindset of augmentation of human labor

Starting point is 00:14:20 rather than replacement of human labor. Fifty-seven percent surveyed said that they thought that AI will help low performers become stronger performers. Sixty-nine percent said that they thought it would help strong performers focus on more strategic work, i.e. takeaway tasks that are important but wrote. And 76% relatedly said that they thought that AI would automate specific tasks but would not replace roles entirely. I tend to think that there's going to be more and more pressure to actually think in terms of job displacement, particularly if economic instability continues to increase. And so I'm very glad to see that organizations have such a strong starting point of this vision of AI augmenting what they do rather than replacing their existing teams.

Starting point is 00:14:59 So, summing up, like I said, I really think that this tells the story of enterprise AI usage that is maturing, moving farther from experimentation and towards deployment, where agents and use cases are increasing in significance, and both the concerns and priorities are changing because of that as well. I will include a link to this study so you can dig in more. For now, though, that is going to do it for today's AI Daily Brief. Appreciate you listening, as always, and until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - Agent Pilots Nearly Doubled Last Quarter

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.