The AI Daily Brief: Artificial Intelligence News and Analysis - No, Apple's New AI Paper Doesn't Undermine Reasoning Models

Starting point is 00:00:00 Today on the AI Daily Brief, a look at Apple's non-existent AI strategy at WWDC, plus a deep dive on a very controversial paper from the Cupertino company that I think you can, in most cases, safely ignore, but which is probably still worth talking about anyways. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. All right, friends, quick announcements as always. First of all, thank you to today's sponsors, KPMG, blitzy.com, Vanta, and super intelligent. As always, if you were looking for an ad-free version of the show, you can get it for just $3 over at patreon.com slash AI Daily Brief. Also, I am traveling this week, which always means that

Starting point is 00:00:43 there might be a little bit of variability in the show format. Obviously, you got an interview yesterday. And today, we have a full episode dedicated to the main. There are a bunch of important headlines, though, so we will most certainly be coming back to our normal format tomorrow. For now, though, let's talk about WWDC and the illusion of thinking. Welcome back to the AI Daily Brief. And today, of course, we are talking about Apple. First, we're going to talk about their non-existent AI at WWDC, but then we're going to spend more time on this paper that everyone is talking about the illusion of thinking. You can probably tell from my title how I feel about it, but that is for just a minute from now. First of all, however,

Starting point is 00:01:23 let's talk about WWDC yesterday. Now, you might remember that last year, Apple finally came out of the and shared an AI strategy for the first time since the launch of ChatGBTGBT. It was, of course, Apple intelligence, because Apple had to brand its own thing. And the idea of it was, in short, to provide regular everyday users with the use cases that actually would matter to them. AI that wasn't big and techy and burdensome, but was just useful. The principle of it was good. It felt just like Apple. The problem has been in execution. None of the solutions they were talking about were really ready. Siri was an absolute disgrace. And basically, Apple has pushed nothing of note on Apple intelligence, which just falls farther and farther behind. Now, expectations were already

Starting point is 00:02:09 on the floor heading into this event when it came to AI specifically, because it basically seemed like they were going to forego the topic entirely. And indeed, that's exactly what we got. There were no big announcements like we've seen in previous years. AI Siri was completely absent from the conference. There were some minor feature updates and a new image model, but nothing really compelling was unveiled. We did, I guess, get a new numbering system for iOS models, and we got a graphical redesign of iOS that has just been universally maligned for being confusing and weird and not really clearly having any particular purpose. Reports were pretty grim out on the conference floor. Linus Ekinsdam tweeted, Apple has clearly missed the mark for far too many times now. I felt

Starting point is 00:02:50 today was yet another one of these occurrences. Sadly, Apple is trying hard to do too much. There's too much fat. They need to trim it and back to basics. Apple desperately needs to reinvent itself or become the new Nokia. During the first 40 minutes, there was nothing that made me feel wow. Actually, there was one thing after another leaving me with way more questions than answers. Gen Moji, backgrounds and group messages, visual intelligence, Apple games? And what is up with the new unified design language? The glass UI is a Ux nightmare. Visual after visual in the presentation is worse than the previous. Apple needs to go back to its roots. Make a really good operating system, make really good scaffolding for others to make the and stuff that lives on the device.

Starting point is 00:03:28 I'm completely underwhelmed. Apple needs a step change to their entire existence if things are going to turn around. Sure, I'm typing this on an Apple device because there are not a lot of options out there, but clearly this WWDC might go down as the most boring one ever. Now, Apple Watcher, Bloomberg's Mark German,

Starting point is 00:03:44 was a little more charitable. He said, excellent WWDC, cohesive story, deep integration, and continuity across the devices, zero false promises, impressive new UI, and significant new productivity features on the Mac and iPad. But the lack of any real new AI features, despite that being my expectation, is startling. Azamazar said, can it really be excellent without an AI feature? And also, as I mentioned, clearly German is in the minority when it comes to his thoughts,

Starting point is 00:04:10 for example, on the new UI. Even investors who aren't as plugged into the tech scene are starting to see Apple's AI strategy as what it is, a crisis. Andrew Choi, a portfolio manager at Parnassas Investments commented, it's hard to argue that Apple's lack of standing with AI isn't an existential risk. If it can paint a future where it's integrating and commoditizing AI, that would be compelling because otherwise, what is going to get people to buy their next phone for a lot more money? Still, rather than a breathtaking conference rollout, Apple is trending on AI Twitter for a very

Starting point is 00:04:40 different reason. They've just released a controversial new paper entitled The Illusion of Thinking, understanding the strengths and limitations of reasoning models via the lens of problem complexity. AI Threader Ruben Hesed writes, Apple just proved AI reasoning models like Claude, Deepseekar 1 and O3 Mini don't actually reason at all. They just memorize patterns really well. Now, Rubin actually went on to provide a lengthy explanation of the paper, but judging the way the likes fell off after these 13.4 million views, very few people made it past the first post. Now, for many who follow AI development, the notion that Apple would release an authoritative paper on the topic was perhaps somewhat ironic. Henry Arithmaquine wrote,

Starting point is 00:05:20 be Apple, richest company in the world, every advantage imaginable, go all in on AI, make countless promises, get immediately lapped by anyone two years into the race, nothing to show for it, give up, write a paper about how it's all fake and doesn't matter anyway. Pliny the Liberator wrote, I'm not reading a single AI research paper coming out of that giant stale donut in Cupertino until Siri can do a little bit more than create calendar events on the fourth try. If I were CEO of Apple and someone from my team put out a paper focused solely on documenting the limitations of current models, I'd fire everyone involved on the spot. Andrew White of Future House SF noted that this isn't even the first paper from Apple on

Starting point is 00:05:55 the limitations to AI. He writes, Apple's AI researchers have embraced a kind of anti-LLM cynic ethos, publishing multiple papers trying to argue that reasoning LLMs are somehow limited and cannot generalize. Apple also has the worst AI products. No idea what their quote-unquote strategy is here. Now, on the flip side, the paper was absolutely jumped on by AI skeptics who believe the technology won't get better than it currently is. Gary Marcus, who when it comes to AI is basically a real-life version of the well-actually meme, publish his own lengthy screed on the paper, calling it a knockout blow for LLMs. He wrote, anyone who thinks LLMs are a direct route to the sort of AGI that could fundamentally transform society for the good is kidding themselves. This does not mean that the field

Starting point is 00:06:40 of neural networks is dead or that deep learning is dead. LLMs are just one form of deep learning, and maybe others, especially those that play nicer with symbols, will eventually thrive. Time will tell, but this particular approach has limits that are clearer by the day. Now, Marcus has been declaring that AI development has hit a wall every few months since at least March of 2022 back what it was still referred to as deep learning. So that is important context that you can do what you will with. Remarking on the state of the discourse, AI safety discusser extraordinaire, Cat Woods wrote, I hated when people just read the titles of papers and think they understand the results.

Starting point is 00:07:13 The illusion of thinking paper does not say LLMs don't reason. It says, currently, large reasoning models do reason just not with 100% accuracy and not on very hard problems. This would be like saying, human reasoning falls apart when placed in tribal situations, therefore humans don't reason. It even says so in the abstract. People are just getting distracted by the clever title. So with that in mind, let's talk about what the research actually set out to demonstrate.

Starting point is 00:07:38 The study was designed to test the limits of a reasoning model by asking it to solve a number of puzzles, specifically a tower of Hanoi puzzle. This puzzle features a number of differently sized discs stacked on a game board consisting of three poles. The goal is to transfer all of the disks without stacking a larger disk on a smaller disk. The game has an algorithmic solution for any number of disks, but the number of steps increases exponentially as you add disks to the puzzle. The paper measured the point at which the reasoning models fail to reason through the steps and observed how the models fail. The core finding was that Claude 3.7, with Thinking enabled, could easily complete a six-disc game, struggled a little more with a seven-disc game

Starting point is 00:08:15 and had little ability to reason through the solution to a game with eight or more disks. Similar results were found for O3 Mini-high, and the results were consistent across other logic puzzles where complexity can be modulated. The abstract for the paper stated, We found that reasoning models have limitations in exact computation. They failed to use explicit algorithms

Starting point is 00:08:34 and reason inconsistently across puzzles. Essentially, the big takeaway was that reasoning doesn't scale beyond a certain point, even if there are resources left, with the notion being that simply getting the model to think longer won't yield better performance. There were a lot of issues with the methodology that the internet quickly went to task unpacking. Lisanal Gabe, scaling 01, repeated the exact prompts used in the paper and found that the models were running up against token limits. The structured output required 10 tokens for each move, and the number of moves is known for this puzzle.

Starting point is 00:09:03 Therefore, the models were running into their limits at predictable levels of complexity. They weren't hitting the limits of reasoning. They couldn't physically print out all of the moves while staying inside the output limits. Now, the most interesting part of this failure was that the models actually recognized that they couldn't reason through the solution with their current limits. Instead of starting off the reasoning process and failing when the number of disks was too large, they recognized this fact and provided instructions for how to use the solution algorithm instead.

Starting point is 00:09:29 For Claude, this behavior started at eight disks, hence the sharp drop-off in performance. Lissan commented, all of this is just nonsense. But no, they didn't even bother looking at the outputs. The models literally recite the algorithm in their chains of thought in plain text and in code. Basically, the takeaway from this analysis was that the Apple researchers weren't measuring the limits of reasoning models. They were kind of just using a ton of extra steps to measure the engineering limits that AI labs have imposed on the models. That's a fairly big problem when the AI researchers being used to suggest that reasoning has hit a fundamental wall rather than a technical

Starting point is 00:10:01 limitation. Today's episode is brought to you by KPMG. In today's fiercely competitive market, unlocking AI's potential could help give you a competitive edge, foster growth, and drive new value. But here's the key. You don't need an AI strategy. You need to embed AI into your overall business strategy to truly power it up. KPMG can show you how to integrate AI and AI agents into your business strategy in a way that truly works and is built on trusted AI principles and platforms. Check out real stories from KPMG to hear how AI is driving success with its clients at www.kpmg. again, that's www.kmg.org.us slash AI. This episode is brought to you by Blitzy.

Starting point is 00:10:46 Now, I talk to a lot of technical and business leaders who are eager to implement cutting-edge AI, but instead of building competitive modes, their best engineers are stuck modernizing ancient codebases or updating frameworks just to keep the lights on. These projects, like migrating Java 17 to Java 21, often means staffing a team for a year or more. And sure, copilots help, but we all know they hit context limits fast, especially on large legacy systems. Blitzy flips the script. Instead of engineers doing 80% of the work, Blitzy's autonomous platform handles the heavy lifting, processing millions of lines of code

Starting point is 00:11:17 and making 80% of the required changes automatically. One major financial firm used Blitzy to modernize a 20 million line Java code base in just three and a half months, cutting 30,000 engineering hours and accelerating their entire roadmap. Email Jack at Blitzie.com with Modernize in the subject line for prioritized onboarding. Visit blitzie.com today before your competitors do. Today's episode is brought to you by Vanta. In today's business landscape, businesses can't just claim security, they have to prove it. Achieving compliance with a framework like SOC2, ISO-27-01, HIPAA, GDPR, and more, is how businesses can demonstrate strong security practices.

Starting point is 00:11:55 The problem is that navigating security and compliance is time-consuming and complicated. It can take months of work and use up valuable time and resources. Vanta makes it easy and faster by automating compliance across 35-plus frameworks. It gets you audit-ready. in weeks instead of months and saves you up to 85% of associated costs. In fact, a recent IDC White Paper found that Vanta customers achieve $535,000 per year in benefits, and the platform pays for itself in just three months. The proof is in the numbers.

Starting point is 00:12:22 More than 10,000 global companies trust Vanta. For a limited time, listeners get $1,000 off at vanta.com slash NLW. That's VANTA.com slash NLW for $1,000 off. Today's episode is brought to you by superintelligence, specifically agent readiness audits. Everyone is trying to figure out what agent use cases are going to be most impactful for their business, and the agent readiness audit is the fastest and best way to do that. We use voice agents to interview your leadership and team, and process all of that information to provide an agent readiness score,

Starting point is 00:12:56 a set of insights around that score, and a set of highly actionable recommendations on both organizational gaps and high-value agent use cases that you should pursue. Once you've figured out the right use cases, you can use our marketplace to find the right vendors and partners. And what it all adds up to is a faster, better agent strategy. Check it out at B-Super.a.I or email agents at B-supertai to learn more. Now, one of the big criticisms from Gary Marcus was that the models didn't choose to access readily available solutions algorithms on the internet and write Python code to solve

Starting point is 00:13:27 the problem. Careful reading of the paper, however, uncovers that the researchers had actually prevented the models from coding. which is fine if we're strictly talking about the limitations of scaling up reasoning, but if we're talking about model capabilities in general, and specifically model capabilities in practice, then access to coding tools, which is something they have access to, should be a part of the discussion.

Starting point is 00:13:49 Matthew Berman commented that access to tools really changes the math, writing, The biggest weakness of Apple's paper showing large reasoning models might not actually be reasoning all that well, is that they do not include the ability for models to write code to solve problems. State-of-the-art models failed the Tower of Hanoi puzzle at a complexity threshold of greater than eight disks when using natural language alone to solve it. However, ask it to write code to solve it, and it flawlessly does up to seemingly unlimited

Starting point is 00:14:12 complexity. Kevin Bryan, a professor of strategic management at the University of Toronto, remarked, that this paper is really measuring self-imposed limits to reasoning rather than reasoning itself. He wrote, we can of course program an LLM to spit out millions of tokens in response to Good Evening and use reinforcement learning to iterate creatively on all sorts of possible interpretations, then collate, then brainstorm more, etc. When the models don't do that, it's not because they can't. It's because we use post-training to stop them from doing something so crazy. This does mean that in some cases it should think longer.

Starting point is 00:14:43 We know from things like code with Claude and internal benchmarks that performance strictly increases as we increase in tokens use for inference. On circa every problem domain tried. But LLM companies can do this. You can't because the model you have access to tries not to overthink. Now, as one case in point, you might remember when OpenAI tested OTH, with essentially limitless compute and found a model that effectively beats the ARC-AGI test. However, these runs cost millions of dollars, so the model that was finally released was constrained

Starting point is 00:15:13 to a more reasonable amount of reasoning. TLDR on all of this is that paper is measuring, engineering, and cost constraints rather than detecting a scaling wall. Models predictably fail when they know they can't turn out enough tokens to present a full solution. This is actually the desired behavior. You don't want a reasoning model to spend hundreds of dollars failing to reach a full solution. The failure case is also very telling. Rather than spinning their wheels on pointless reasoning that won't reach a conclusion, the models instead describe an algorithmic solution. That is categorically different to just giving up on a more complex problem as some of the commentary suggested was happening. TLDR, the paper ultimately says absolutely nothing about the fundamental

Starting point is 00:15:49 limits of reasoning models. It just runs up against resource constraints and currently deployed AI systems. And yet, this is not even my biggest beef. My biggest beef is who cares. If you tell me right now that 03 isn't actually reasoning, I'm going to look over at the copious amount of work that I have done with this tool over the last month, shrug my shoulders, and then I'm going to keep on prompting O3 to go do business in ways that wasn't possible before. This gets to a bigger divide right now, where some people are looking at AI in the context of research and the long-term pursuit of AGI, and others are just focused on capabilities in the here and now. Broadly speaking, it's the research community on the one hand and the business community on the

Starting point is 00:16:33 other. Now, of course, these things do relate to one another. The research community needs its place because it's going to drive the advancements that ultimately manifest as better performance. But in the same way that I've said before that AGI is the least relevant term in all of AI for business people, this is sort of the same idea. I don't care if my agent is an automated workflow, as long as it significantly increases my human leverage, and upscales my valuable AI output. I don't care if my reasoning model is actually reasoning in air quotes, as long as it can do things my non-reasoning models can't. Josh Gans, who's a professor of management at the University of Toronto,

Starting point is 00:17:10 published a long piece, basically articulating a version of what I'm saying. After explaining that reasoning models are actually doing a ton of incredible work in enterprise and academia, he commented, they work exactly as people explain they would work and did not work in some miraculous way that the hyperconcern around them generated. And if you worked with them, you would know all this. Now, to the extent that you are looking for a steelman argument for why these issues actually do matter, and that we in the business side of this and the applied side of this should care about some of these questions,

Starting point is 00:17:39 machine learning scientist Francois Chalet commented, beyond the perhaps superficial semantic distinction between reasoning and pattern matching, there is a fundamental gap in the practical capabilities and behavior of these systems. You don't create an invention machine by iterating on an automation machine. The reason we care about reasoning is because of what it enables. It's not about definitions, it's about capabilities. You can use pattern matching to emulate specific well-known skills, but you cannot use pattern matching to produce autonomous skill acquisition in new domains.

Starting point is 00:18:07 All of that is well taken. I just don't care, man. And for the vast majority of you who are listening now, also doesn't matter to you. At least not in the here and now. Maybe it does in terms of what we get to in the future. As Gans summed it up, I don't care whether my tool is thinking or reasoning. I care how much it's helping, which is a very different thing. Sure, there is an intellectual question regarding cognition, but that's far removed from the

Starting point is 00:18:28 transformational impact AI can have right now. Nathan Snell wrote, I'm surprised Apple's research paper on LRM is getting so much attention. LRM has limited reasoning capacity. Shocker? It's clear if you use it. Doesn't make it less valuable. He also said what we're all thinking when he added. Also, is anyone else inherently skeptical about research put out by Apple related to AI? They don't exactly have a great track record there. And this is one of the, I think, sort of sad things for these researchers. This all feels to me like it might have been a case of very bad timing. WWDC was gearing up to announce literally squat zero about AI, and Apple researchers dropped this paper that seemed to sort of self-servingly say that

Starting point is 00:19:09 AI matters less than we all think it does. Essentially, the paper was a Rorschach test on AI. For some reason, there is an entire sector of AI discourse in the economy that seems to be dedicated to turning Paul Krugman's internet is no more significant than a fax quote from 1998 into an entire career positioning. Author E.N. Morrison posted, AI is it a wall. AI companies will try to hide this. Hundreds of billions have been spent on the wrong path. Kevin Ruse really sums it up when he writes, there is a strain of AI skepticism that's rooted in pretending like it's still 2021 and nobody can actually use this stuff for themselves. It's survived for longer than I would have guessed. Look, when it comes down to it, I think it's important that researchers have great debates about all of these things.

Starting point is 00:19:58 And I think it's great from the standpoint of what I want as a person in business who uses these tools for business, which is constantly improving models. The academic discussion and discourse that is so important is way upstream from business value, yes, but it is still part of the same stream. An academic research should be a place where different ideas can compete and people can disagree strenuously. I just think that when it comes to the practical day-to-day for most people using these tools, it doesn't matter a fig. And for those who are trying to turn this into some sort of gotcha, I just don't know what they're trying to accomplish. Signal hilariously tweets,

Starting point is 00:20:35 Apple proves that this feathered aquatic robot that looks, walks, flies, and quacks like a duck may not actually be a duck. We're no closer to having robot ducks after all. What are we even doing here anymore. The answer, at least for people who are listening to this, probably, is building really cool stuff, doing really cool things, being really excited about what capabilities AI has, and ultimately, not caring all that much over whether you call it a duck or a feathered aquatic robot. That's going to do it for today's AI Daily Brief. Until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - No, Apple's New AI Paper Doesn't Undermine Reasoning Models

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.