Everyday AI Podcast – An AI and ChatGPT Podcast - EP 543: Apple’s Weaponized Research: Inside its illusion of thinking paper
Episode Date: June 10, 2025Apple’s new AI paper says advanced AI thinking is an "illusion."Is this a groundbreaking scientific discovery?Or is it a cynical, weaponized piece of marketing dropped the weekend before W...WDC to hide the fact that Apple is catastrophically behind in the AI race?We read the paper so you don't have to.Newsletter: Sign up for our free daily newsletterMore on this Episode: Episode PageJoin the discussion: Have a question? Join the convo here.Upcoming Episodes: Check out the upcoming Everyday AI Livestream lineupWebsite: YourEverydayAI.comEmail The Show: info@youreverydayai.comConnect with Jordan on LinkedInTopics Covered in This Episode:Apple's Viral Illusion of Thinking PaperCritique of Apple's AI Research MethodologyApple's AI Deception and Flawed LogicStrategic Corporate Propaganda in AI ResearchApple's $2 Trillion AI Market LossAI Reasoning Models Tool Use RestrictionsTower of Hanai and Token LimitationsApple Research's Industry Skepticism StrategyTimestamps:00:00 Daily AI Insights & Growth04:32 "Evaluating AI: Illusion of Thinking"08:30 "Apple's AI Papers: $2 Trillion Dilemma"11:23 Apple's Missed $2 Trillion Opportunity15:06 Apple's AI Oversight: Massive Blunder18:55 PhD AI Research: Industry Influence21:43 Apple Challenges AI Test Validity26:15 AI Model Testing Complexity29:16 "The Challenge of Complex Puzzles"33:10 AI Testing Limits: A Designed Failure36:45 Questioning Study Methodology37:54 iPhone SOS Satellite Test Fails41:29 Flawed Report Undermines Credibility46:18 "Corporate Strategy Masked as Research"50:25 Apple's Controversial Stance on ResearchKeyword:Apple's illusion of thinking, Apple's AI research paper, AI reasoning models, large reasoning models, strategic deception, cherry picked science, weaponized research, flawed logic, cherry picked testing, all or nothing grading, Apple's marketing tactics, Apple vs. Microsoft, Apple's AI failures, WWDC conference, Apple's intelligence, Ajax model, Apple's AI spending, Apple's competition, generative AI, Microsoft, Google, OpenAI, code usage restriction, token output limits, reasoning collapse, AI's reasoning limitations, Tower Of Hanai, reasoning lab, Claude 3.7 SONNET, DeepSeek, thinking models, chain of thought processing, corporate propaganda, premeditated media strike, fear, uncertainty, doubt, strategic media strike, data contamination, Apple's research credibility, research methodology, scientific integrity.Send Everyday AI and Jordan a text message. (We can't reply back unless you leave contact info)
Transcript
Discussion (0)
This is the Everyday AI Show, the Everyday Podcast where we simplify AI and bring its power to your fingertips.
Listen daily for practical advice to boost your career, business, and everyday life.
Meet Firefly AI Assistant, now live in Adobe Firefly, the All In One Creative AI Studio.
Just describe what you want to create and the assistant handles the rest,
orchestrating multi-step workflows across Photoshop, Premiere Express, and more in one conversational interface.
You direct the outcome.
The assistant accelerates execution.
Apple's latest AI research paper has gone viral.
So viral, actually, it showed up in my wife's nightly business newsletter.
She reads that usually has absolutely nothing to do with AI.
So Apple's The Illusion of Thinking paper shows evidence, well, they say that large reasoning models slam into a wall the moment that tasks get
too demanding. So sounds pretty fatal for AI, right? Maybe. But if you dig deeper, you'll find
flawed logic, cherry-picked testing, and in all or nothing, a grading rule that would flunk
Einstein. In other words, if you take enough time to deconstruct this study, you'll find
it's not much of a study at all. It's marketing from Apple. And you shouldn't fall for it.
So stick with me for the next 30 minutes or so.
And I'll expose this quote unquote research paper for what it is.
It's strategic deception.
It's cherry-picked science and it's weaponized research at best.
All right.
Hope you're excited for this one.
I am.
If you're new here, welcome to Everyday AI.
My name is Jordan Wilson.
and I'm the host and we do this every single day.
This is your daily live stream podcast and free daily newsletter,
helping us all not just keep up with the world of AI,
but how we can use all this information to get ahead to grow our companies and our careers.
So sometimes the information like today's show can be a little confusing,
and that's what we do.
We break it down, whether it's myself or bringing on world-class experts.
We do it every single day, and then we break it down in our free daily newsletter.
So if you haven't already, please go to your everyday AI.com and sign up for that free daily
newsletter. We're going to be recapping today's show and a whole lot more everything you need to
stay in the loop and stay ahead and be the smartest person in your company. And if that's what you're
trying to do, then you are definitely in the right place. So some days, we start off by going over
the AI news. I don't want to make this an accidental like 50 minute podcast. I'm actually trying to
keep these things under 30 minutes. But we'll see because at least for today, it's hot take Tuesday.
and I got takes.
Last stream audience, it's good to see you.
Thanks for tuning in.
Let me know.
Should I take it nice?
Should I ramp it up?
I'm kind of feeling spicy.
I hope that's okay with you.
But let's just get into it.
All right.
Let's deconstruct this paper.
The illusion of thinking.
All right.
Like I said,
It's been grabbing a lot of headlines recently.
And let me also put my cards on the table, right?
Because you may be thinking, okay, this Jordan guy, you know, he's obviously very pro
AI.
Am I sure?
Yeah.
You could say that.
If I'm being honest, I'm pro AI because I feel there's no real choice, right?
I believe in large language models, the power of large language models.
just the way the entire world is investing in them.
There's really no other solution.
And one other thing, I want to talk a little bit very quickly about my background.
So I mentioned it a couple times in our, you know, 540 plus episodes together.
But I started my career as investigative reporter.
I did okay.
I was Pulitzer Fellow.
I won ACP story of the year.
So when I look at these things, I don't just read,
the study. Yes, I read the study manually twice. I fed it to three separate large language models.
I combined a bunch to have conversations with the paper. So I did it old school manual,
you know, and then using AI as well. So I want you to know, I'm not just blindly ever following
any study, what it says or what it doesn't say, whether I have a preconceived notion on if I agree
with it or not. But essentially what this study said is, hey, these large language models that
reason or what they call large reasoning models, uh, they don't really think, right?
This whole thinking thing, it's an illusion. Um, and I'm, I'm actually very excited to break
this one down, uh, but I'm hoping to do it in a very concise way. So, uh, if I go on fewer
tangents today, uh, that's probably why. So let's just start at a glimpse. All right.
So maybe you don't have 30 minutes. Maybe you have five. Well, let me spend the next two minutes,
just giving this to you at a glimpse. Here's why.
what's happened. And then for the rest of the episode, I'm going to lay it all down.
Lay it all out for you. So, you know, I'm not going to keep you captive here just for 20 minutes just to get to what's actually happened.
Okay, so Apple, Apple just released about four days ago, its illusion of thinking paper.
And this was three days before their big conference, the worldwide developer conference.
And in this paper, they publicly claim that advanced AI reasoning is more or less fake.
They're saying that these large reasoning models don't actually think their entire experiment was fundamentally writ.
All right.
And I'm going to show you why and show you how.
But essentially, they said that AI reasoning models couldn't use code.
What?
Okay.
Which is the single most effective way to solve the problems that the researchers were giving them.
So, you know, the researchers are like, hey, here's all these problems.
And normally a large language model would be like, yeah, I'm going to use code, right?
I'm going to use the tools at my disposal.
But the Apple researchers that, nah, you can't.
They also misrepresented the AI's intelligent decision to give up on impossible brute force tasks as a reasoning collapse, which I would say is not the case.
And they were treating that as a feature versus a bug.
Next, Apple failed to disclose that their hardest tests were technically physically physical.
impossible for the AI to pass due to its handcuffed token limits.
Yeah, more on that in a bit.
And y'all, y'all know I bring receipts, all right?
Also, the paper's timing reveal its true purpose, a strategic media strike to distract
from Apple's own AI weakness right before WWDC and their lack of AI.
Because everyone knows it's no secret.
Apple has failed.
And I think this will probably go down as the biggest failure in business history,
Apple's absolute failure to put together anything resemblance of artificial intelligence.
Maybe that's why they called it Apple intelligence because they couldn't actually figure out artificial intelligence, right?
And this also, this wasn't a good faith scientific study.
It really wasn't.
It was a calculated act of corporate deception disguised as research.
And I'm not blaming this on the researchers, right?
I'm sure that there were some higher ups that were pulling some strings or maybe that, you know, passed this down like, hey, we need to, you know, get some research that's very hard on these large language models, these reasoning models.
All right. So that's what we're going to be going over.
But aside from what I just laid out, which we're going to go into more depth, I want to talk.
There's at least two trillion other reasons Apple is putting out this quote unquote paper.
all right and i have taken my time mainly because i do our hot takes on tuesday and this paper came out
i believe it was a friday or saturday um but there's two trillion reasons and no one's talking about
this why is apple doing this why has apple put out multiple papers uh that literally go against
the power and the capabilities of large language models and AI well one i kind of already answered
they can't figure it out.
But here's two trillion reasons why.
All right.
So this is for our podcast audience.
I do have some visual slides on today's episode.
I'm going to do my best as I always try to do to describe them to you.
But you can always check out the show notes or go to our website and watch the video version.
So then you can see what I'm sharing on my screen.
But essentially pre-generative AI.
And this was in 2021.
Pre-generative.
AI. Apple was crushing the world. This was like 92 dream team kind of dominance for Olympic basketball.
It wasn't close. Apple had a $2.1 trillion market cap in 2021. And the next closest company,
Microsoft, had only a $1.6 billion. Now, I'm not the best at math, but that's not close. Having a half
billion or sorry half trillion dollar sorry that was 2.1 trillion market cap versus Microsoft's
1.6 trillion dollar market cap. So they had a half trillion dollar lead on the next biggest
company in the world, which is not even close. They were blowing out the competition. Like I said,
this is 92 dream team. This is, you know, 97 bulls, the 72 intent right there. No one's close.
It is a blowout. Apple is blowing out. Apple is blowing out.
the rest of the world in terms of we are the biggest, we are the best company and it's not even
close. Fast forward to today. Yeah. Apple is the third biggest company in the U.S. by market cap,
right? And now they are a half trillion dollars behind Microsoft, right, which I would say,
you know, depending on how you look at it, you could say it's Microsoft, you could say it's
Google. I would probably say Microsoft is Apple's closest competitor, right? Because everyone's
kind of changing in terms of where the revenue is coming from, you know, where they're trying
to compete, et cetera. But you could say historically that Microsoft and Apple have been the two
competing with each other. So let me say that again. In 2021, Apple was blowing Microsoft out,
all right? To the tune of a half trillion dollars. Now Microsoft is blowing out Apple. And if you
took the same growth rates, if you take the growth rate that Microsoft had from 2021 until today,
going from about a $1.6 trillion market cap to a $3.5 trillion market cap, if Apple stayed on a similar
or that same growth trajectory as Microsoft did do the math, y'all, that means Apple is at about a $5 trillion
valuation. Instead, they're staggering at $3 trillion. So essentially, if they would have made the
similar moves that Microsoft did, presumably they would be a $5 trillion market cap company.
So they have left, you can make the argument that they've left $2 trillion in market cap
on the table by not figuring out AI.
And it's not for not trying, right?
We've seen a lot of reporting going back multiple years, a report from 2023, which I remember
covering this report the day it came out on the everyday AI.
show. Yeah, we've been doing this thing for a while. And it said Apple is reportedly spending
millions of dollars a day training its AI. And Apple internally at the time said that their
internal model, which it was co-named Ajax, and it did come out under a similar name. They said
it is the most advanced language model and it is more powerful than chat GPT. All right. Imagine spending
millions of dollars a day just on training AI models if you're Apple. And when you finally
quote unquote released it.
You didn't even say it by name in the main keynote.
It was almost like Apple was embarrassed by the language model that they released at last year's
WWDC.
It was a small language model that lived on device.
It's Ajax model.
They didn't even say it by name in the main keynote.
Right.
Because if they would have,
it would have been embarrassing.
Right.
So it's almost like they didn't want to claim it because they had reportedly spent many,
many, many millions of dollars.
And by many millions, at that point, I mean, yo, millions of dollars a day back in 2023,
do the math.
That's potentially hundreds of millions or billions of dollars that they spent on AI that just didn't work.
And like I said, Apple is the only big tech company that has failed to produce the most basic of AI offerings.
Apple's produced nada.
Nata that works at least.
All right.
So some some headlines here from some different publications like payments, Axios, Bloomberg, PC Magazine, Computer World.
Let's read some of these headlines, shall we, The Verge?
This is a crisis.
New Apple report claims will get no Siri upgrades at WWDC due to AI turmoil.
Apple's AI headaches could lead to lukewarm revenue growth.
Drama at Apple as AI.
failures cause heads to roll. Apple sued for false advertising over Apple intelligence. Why Apple still
hasn't cracked AI. Two more class action lawsuits target misleading Apple intelligence claims. Yeah,
Apple's rollout of AI was absolutely so bad that they have, are now facing multiple class action
lawsuits because they couldn't deliver the simplest version of AI that they promoted, right? And I'm technically one of
those people, right? I have to be honest. I'm recording this on an Apple Mac mini, the camera I'm
using for the live stream here. It's the new iPhone. And one of the reasons I bought this new iPhone
is because they're like, oh, we're going to have all this new AI on the iPhone. And here it is
almost a year later. There's not a single thing on this iPhone that's quote unquote AI. There's
not. Right? Like, I'm looking for it. I'm like, hey, Siri, find me the AI. And series, you know,
10 minutes later, would you like me to use?
chat GPT for this query.
So yeah, Apple has fumbled the bag harder than any company has ever fumbled the bag, I would
say from a business perspective.
Because when you think of the numbers, think of the numbers, I don't think that's an
exaggeration because that, even though it's a hypothetical scenario I laid out, it was probably
a realistic scenario that Apple should have traveled that path.
They should have grown at the same rate that Microsoft grew over the last four years because
of generative AI, but they did it.
They didn't, but they should have.
Multiple trillion dollar market cap mistake from Apple,
which would very likely, I think, qualify that to be the biggest business blunder ever.
And it's probably not even close.
So all of Apple's competitors have been cashing in on AI, and Apple is still failing.
So with trillions of dollars on the line, Apple needed a red herring.
The claim that AI reasoned,
is an illusion, right? Because all these other companies, even though Apple has their own
edge AI, these are small language models that live on device. They don't have a large reasoning
model. So essentially what's happening here is all these other companies are running away,
you know, getting insane revenue from their AI offerings. And Apple's like,
hmm, what if we just throw some deception and doubt and confusion in the ring here,
right before our big event, right? So then people will not be mad at us if we don't release anything
AI at WWDC this year. So that was yesterday on Monday. Apple had their WWDC event where they
essentially took a quote unquote gap year. It was reportedly took a gap year on AI. They didn't really
release anything new. Whereas last year at their WWDC, they said AI every three seconds. They
actually rebranded it because it's Apple.
They're like, oh, it's not even artificial intelligence.
It's Apple intelligence.
Our AI is better than AI, right?
And here we are a year later.
And they're like, whoops, we're getting sued.
We couldn't deliver.
So let's take a gap here.
And instead, let's create some confusion.
Let's get a huge viral study.
Let's throw a, you know, a big smoke screen in front of everyone, cause some chaos.
And then maybe people will temporarily forget that we stink at AI.
and that we haven't been able to deliver on our promises.
And maybe shareholders will look at this study and be like,
oh, smart Apple.
Yeah, look.
This great research shows that these large reasoning models don't work.
Good thing.
Apple hasn't figured it out.
Wrong.
So let's actually deconstruct this thing.
Let's take it down.
All right.
On my screen, I'm showing you the difference between Apple's quote unquote study,
which is on the left.
and what I think a real study should look like on the right.
All right.
And I've talked about the one on the right.
Apple's study, quote unquote, is just Apple researchers, all right, which is not abnormal.
Okay.
I'll say this.
And I'm not saying this in a, how do I say this?
Like a lot of people are throwing some shade at some of the Apple researchers because they're
technically interns.
I'm not going to do that because that's technically normal.
Right.
So when PhD candidates in computer science, right, are looking to complete some meaningful research,
you know, a lot of times big tech companies will hire them on as interns or they were
already interns there to begin with.
So I'm not going to go down that route because these people are very capable.
But one thing I want you to look at, it's all Apple researchers.
And that's it.
And like I said,
That's usually only normal when you are announcing a new model and you put out a paper around a new model, right?
Otherwise, good research that changes the conversation on artificial intelligence would usually look like the paper on the right.
This paper is personhood credentials.
All right.
This was a pretty meaningful research paper that changed the narrative or at least tried to change the narrative.
on, you know, AIs that are trying to, you know, imitate humans.
This is what research looks like.
Because on this piece of research, you have researchers from multiple big companies.
You have them from OpenAI, Harvard, Microsoft, University of Oxford.
You know, a lot of other ones, my screen is actually a little blurry here.
But it's from dozens of companies in universities.
throughout the world.
That's what a normal research paper looks like, right?
You would see multiple companies, multiple research institutions.
On the left, that's what marketing looks like.
Only Apple researchers, nothing else.
All right.
Real quick, got to take a quick break for a word from our sponsors.
This podcast is supported by Google.
Hey everyone, David here, one of the product leads for Google Gemini.
Check out VO3, our state-of-the-art AI video generation model in the Gemini app,
which lets you create high-quality, eight-second videos with native audio generation.
Try it with a Google AI Pro plan or get the highest access with the ultra plan.
Sign up at Gemini.com to get started and show us what you create.
All right, let's get back into it and let's break down this paper a little more.
And I'll tell you this, the paper is out there. It doesn't take long to read. And I think enough people by now have already gone through the more technical side of this paper line by line. So I'm just going to more focus on some big picture ideologies and methodologies that were completely elementary and just defied logic. Not in a good way. Right. So let's start with their Apple's premise and these kind of flawed benchmarks.
Okay. So Apple claimed the need for this test, right? Like, why would they even come out,
or sorry, why would they come out with this research? They argued that standard AI tests for
math and coding are unreliable due to data contamination. Data contamination is like kind of saying,
like, hey, all these other, you know, studies that all these other researchers do from multiple
companies, multiple universities. Yeah, they got it wrong because, you know, their data's bad.
That's not good for a researcher. And that's why I also don't think that this is going to
turn out well for Apple because they essentially just kind of slapped a bunch of researchers silently
in the face and said, yeah, your research is, is, is, is rubbish because you didn't even know that
your data is contaminated. Not a good look. All right. So they said that also all these other, all like
every other single, you know, research paper, you know, it's just contaminated data. And in,
these models are essentially just memorizing and they're just, uh, they're not even reasoning.
They're just remembering, right? So it's a valid.
concern here. So, okay, we're still fine. And their proposed solution was to create a clean and
controllable environment to test what they said was a true unvarnished reasoning limits of modern AI.
All right. Sure. Let's see what you got, Apple. So they came up with their kind of reasoning lab.
They built what they said was a sterile testing environment using four classic logic puzzles, framing them as
pure test of logic. And each puzzle was paired with a simulator, an automated referee that
checked every single move the AI made and immediately flagged the first illegal one, ending the test
with a failure. So if a model got any of these four puzzles, a single move in any of these
four puzzles wrong, test over failure. So not good. That is hyper-strict, unforgiving. That's not
how large language models, especially reasoning models, would generally work. But okay,
sure, Apple, do your thing.
Not making sense, but let's keep going.
I do want to talk specifically about one of these kind of logic games that they used,
the Tower of Hanai.
So this is a very classic game.
And also all of these games are classic,
which were already disproving Apple's point that they were trying to prove
because they said all these other benchmarks out there were contaminated.
it. So they thought like, oh, well, we can use a game like Tower of Hainai that's, you know,
non-deterministic because it's a game. Wrong. All the solutions, the algorithm,
everything about this Tower of Hainai is on the internet. It's in the training data. So like
their original even reasoning for creating their games to test these reasoning models was
absolutely bonkers like no it's already wrong you're like you're already wrong and we haven't even
started all right uh so this was their thought so uh the tower of hanae is a classic computer
science problem you have to move discs between pegs never placing a larger pay a disc on top
of a smaller disk and there's uh kind of three towers uh all the discs start on the left tower
and you have to ultimately move them all the way over to the right tower with the largest disc on the bottom and the smallest disc on the top.
So, you know, if there's only three discs like this example I have on the screen, it's not terribly hard, right?
But as you add more disks, there's more complexity.
So, you know, as an example, they gave games like this, but we're just, there's three other ones.
Let's just talk about the Tower of Hainai.
And then they gave a system prompt and then a prompt to different reasoning large language models.
and then they had them output their text,
output their answer in text form, right?
And then they had a simulator essentially
and double-test, you know,
double-checked all of the AI models results.
Okay.
Sure.
They also did checker jumping, river crossing, and blocks world.
So let's talk about the actual models.
So they tested thinking versions of DeepSeek R1
and Claude 37 Sonnet.
They did a lot of more technical testing.
They technically tested some OpenAI models,
but OpenAI doesn't show the complete chain of thought,
whereas DeepSeek R1 and Claude 37 Sonnet thinking due in the API.
So, you know, they did the right thing there, right,
by making those the baseline models,
and they also tested them against the non-thinking versions of themselves,
which just adds some complexity.
That's not even what we're doing here.
But like I said, the scoring system is absolutely brutal.
because if you make one wrong mistake from getting it perfect, it's a zero.
So there's obviously when you talk about these puzzles, they're extremely complex, right?
And there's many different ways that you can solve them.
But also you have to think up the context window of these models, right?
And also the output limit for tokens, which we're going to talk about here in a second.
So this part is crucial.
All right.
Because, well, let's actually first look at the results.
So the results from what Apple reported, they said on easy puzzles, standard models did better or non-thinking models.
On medium puzzles, these reasoning or thinking models exceeded.
And then on hard puzzles, they said, all models completely failed.
They didn't even try.
Right.
And this is what they called the efforts collapse.
So this is where you saw.
And as a former journalist, when I read this on Saturday, I'm like, oh, gosh, the media is going to get,
because I saw it literally once it came out, right? Because it was trending on Twitter right away,
because you saw all these, you know, headlines like, oh, you know, reasoning models collapsing,
you know, the AI wall, right? Like all these AI doomsday articles. And like, I'm reading this and
I'm like, oh, gosh, like the media is going to completely fall for this, right? Being a former
journalist, nothing against, I, like, I go to all these conferences. I meet brilliant tech
journalists. And then there's some that, you know, are overwhelmed and you get all these press
releases and you're like, okay, this is a salacious headline. Okay, it looks factual. It's a research
paper. Sure. Let's go with it. Right. It's going to click. Right. We're going to get clicks.
Look at these headlines we can put on this thing. Right. So that's kind of what happened. And,
you know, they talked about this effort to collapse. And that was their headline finding that on the
hardest puzzles, the thinking models essentially would think less or even just give us.
generating fewer words before failing.
So they just said, oh, reasoning models give up.
And then also the algorithm failure.
Their supposed killer blow was in a separate test.
They gave the models, the step-by-step instructions or the algorithm, and it didn't help.
And they all still failed at some point.
So this is Apple's conclusion that, well, they just failed, right?
And I have a graph here.
I'm not going to spend five minutes to explain it.
But this just shows the complexity for the Tower of Hanai example and the number of disks.
So, you know, the more disk in that example, the much harder it gets, right?
I can solve it with three.
I could probably solve it with four, but I don't have time to waste.
You know, to solve it with anything more than that, you've got to be either like have a computer science, math, like crazy logical brain.
Or you have to just study this game, right?
It's how some people can do the Rubik's Cube, you know, while juggling in 10 seconds while, you know, spitting fire, you know, or whatever these, you know, incredible acts of, you know, athleticism and brainpower people do.
But, you know, for the most part, the average human might be able to solve this at, you know, four disks, five disks. If you're a genius, maybe longer.
But a human's not solving this at eight disks at nine, ten. Definitely not there, right? So essentially, it's not.
surprising necessarily that an AI couldn't, right? Because if you get the smartest humans in the world
and give them a 15 disk, are they going to be able to do it? I don't even know if it's possible, right?
Anyways, let's look a little bit here about what this actually means from an output token. That's
important because the models they chose aside from the fact they didn't allow them to use
code, which come on, they also surprisingly said that they only chose the models with a 64K
token output limit.
All right.
That's important to talk about because one of the requirements that the models had to do in
the output.
So Apple said that they weren't counting thinking tokens.
That's not usually how it works.
So that, you know, kind of chain of thought processing, which is a lot of what's happening
under the hood.
But they did require the model to spit out every single move.
And to put out a move, it's actually kind of complex.
It's not like B1.
It's not like chess.
You know, I don't know chess, but it's not like B2 to D2.
Right.
One move can be very complex and can eat up a lot of tokens.
So conservatively, right, I looked at the actual example moves that they gave.
They didn't obviously share their whole findings.
It was very little and saw that most moves were.
10 to 12 tokens. So conservatively, a 13 disk, right? A 13 disk problem of this Tower of
Hanai would require 65,000 output tokens. I'm going to repeat that. The study was not possible,
right? They did it all the way up to 15, 20 disks.
Can't do it.
If you require the model and Apple, hey, Apple researchers, next time, do what smart researchers do.
Yeah, I'm getting mad because I read a lot of research papers.
And this one I knew was marketing and that made me upset, right?
Not just right, because I do this every day, but because there's a scientific community that I think is disgusted by this and rightfully so.
this was a haphazard
terrible study.
Let me just say,
let me just say like how it actually is.
This is terrible study.
They didn't share any of their actual results.
They said,
here's this system prompt.
Here's an example of a prompt.
And here's our overall outputs, right?
You need to share.
Share exactly.
Here's what the chain of thought said.
Here was, you know, on the hardest,
on an eight, on a 10 disk.
Here's what the output was.
But going by how large language models work and the requirements in their own paper,
they would have to output every single move.
And if, and I'm being ultra conservative here, to solve a 13 disk would take more than 8,000 moves.
And to be able to spit those all out as required by the system prompt and the example in the system prompt,
it's not possible.
65,000 tokens.
Okay.
So Apple, you literally designed a test that you knew was going to fail at a certain point of complexity, at least according to kind of the laws and the math set forth.
So their conclusion at least, well, reasoning models are an illusion.
and the thinking that we see is a trick, right?
They're not actually thinking.
You know, they're just, they're just doing next token prediction.
You know, it's stage one thinking, not stage two, right?
I'm going to have a whole episode on this at some other point.
The concept of reasoning, right?
In like stage, like stage one and stage two thinking, right?
So what is?
what is reasoning, right?
I'd like to say it's just connecting stage one thinking anyways, right?
Stage one is quick, intuitive, automatic responses based on pattern recognition,
learn from data.
That's stage one.
And then stage two represents more deliberate analytical or conscious approach that involves
reasoning and planning.
So you could say the same thing about, you know, non-reasoning models and reasoning.
models.
Non-reasoning models, right?
These are the faster ones.
These are, you know, pattern recognition.
But all stage two reasoning thinking is, it's just stage one, but slower.
Right.
So I don't know.
Like even the concept of arguing against reasoning models seems a little bit illogical
when it's just really made up of stage one thinking anyways.
It's like, what is reasoning?
It's not a different language.
you're just taking more time doing stage one thinking, right?
Pattern recognition.
That's all reasoning is anyways in my head, right?
I don't touch the stove because it's hot, right?
But I've learned those different things that lead me to make that reason or to, you know,
think or plan ahead in a certain way, right?
If I'm planning for a big show like this, I spent many hours planning this show.
I'm using stage one thinking, right?
That's literally what I'm doing.
Pattern recognition.
I've done this so many times.
I recognize patterns.
I put them together, right?
That's planning.
It's a lot of, it's thousands or millions of neurons following in our,
firing off in our brain.
That's just quick, intuitive, automatic responses based on data and pattern recognition.
That's all it is.
Anyways, I'll save that for another day, another show.
So let's get back to this Apple study, right?
The other thing is Apple, not only did they cook the books beforehand.
Sorry, you did.
Unless you actually share the data and we can make an assumption otherwise or we can make a connection otherwise.
If we go by the math, if we look at exactly what happened and the fact that they literally decided the way that we're going to measure reasoning.
Well, they said the data is contaminated.
So they had this brilliant idea.
Let's use a non-deterministic game.
It's already all on the internet anyways.
It's already in the training data.
So you're already wrong to begin with.
And you find you cherry pick, right?
This is almost like they got results.
And it then seems like they just reverse engineer the entire study.
Right.
I'm not saying they did.
but in theory that could have happened.
The study makes no sense.
Go read it for yourself two or three times.
And then go talk with a large language model.
Don't lead a large language model.
Just ask, does this make sense?
Or ask your own self.
Does this make sense?
Anyways, I have an example study here, an iPhone study.
All right?
So the Apple researchers, you know, they went through 25 rounds.
Let's say I get 25 new iPhones.
And I turn off cellular data
and I turn off Wi-Fi, I turn off Bluetooth, I turn off everything,
but there's a new feature on iPhones called SOS, and it uses satellite.
All right.
And then I go on vacation, and I'm on satellite mode, and I'm testing the phone.
I'm testing, but I'm only testing a couple things, you know, just like Apple did.
I'm just going to test, you know, FaceTime and phone calls and getting on social media and using, you know, chat GBT and Google Gemini on my phone.
That's what I'm going to test.
Okay.
And then, well, turns out doesn't really work very well.
So now, instead of coming up with a specific report that says, hey, I'm.
reviewing this SOS satellite feature, which just sends messages to emergency response services.
Instead, I'm going to say, I'm going to put out, well, it's factual, right?
I can put out the facts.
I can say, hey, here's what I did.
And then at the very end, I'm going to say, hey, I restricted, you know, Wi-Fi and Bluetooth, right?
It's similarly the way that Apple set up this study.
they restricted tool use,
which is the way that any reasoning model would solve this thing.
And guess what?
I'm going to solve it here live in like 30 seconds.
I'm not going to solve it.
A large language model is going to solve it.
You're going to see when you give the model the tools that it needs,
it does the job.
So I don't know why Apple thought, oh, this is brilliant.
We'll just restrict its core capabilities.
We'll put it in this super refined box.
we'll sprinkle a bunch of, you know, big words on people,
we'll send it out to all the journalists,
and they're going to cover it.
Yay!
No.
Yeah, just wait until my report,
the illusion of iPhone connectivity drops, right?
FaceTime doesn't work.
So why does the paper fail?
Well, there's a lot of reasons.
I'm going to go through this quickly.
The data, it is precise,
but the interpretation is a spectacular failure of logic.
All right.
Let's look at my just, and I could go on for hours.
I'm going to try to go fast now.
But their clean test.
So let's go over.
I have five critiques here.
All right.
So first,
the test is raped.
Their clean test used different puzzle games.
One was Tower of Hanai,
but the solutions are plastered all over the internet anyways.
So,
uh,
the test punishes a creative AI for not being a perfect monotonous calculator.
All right.
Critique.
Two, it is designed to guarantee failure on the harder levels of this testing.
By doing no tool use, Apple didn't get the models tool use, and they couldn't write code,
which is the obvious and the only way that a reasoning model would actually solve the puzzle.
Also, they set these arbitrary limits.
They capped the AI, specifically Claude 3.7 thinking, which is the best model that they used in terms of thinking.
It was, you know, that deep seek.
They capped it at $64,000 output tokens when there is a $128K model available.
And also the absurd scoring, the one mistake in your out rule pretty much ensures failure.
All right.
And yeah, receipts.
All right.
So in this, when I'm reading the paper, like, when I'm seeing things that are verifiably false right away,
how can you take the rest of the paper seriously, right?
So, you know, Apple said in their report section A2, we didn't have to go to the bottom for this one.
They said, for Claude 3.7 Sonnet, thinking and non-thinking models, we used maximum generation budget of 64,000 tokens, access through the API interface.
And I literally went through and I looked at the day this was released, the day Claude 377,
thinking on the API was released.
I went to archive.org.
I got a screenshot.
And yeah, obviously, there's, the max token is a hundred and twenty-eight.
If you can't get the basic things right, why would anyone trust your outputs, let alone
a flawed methodology?
All right.
Three, mistaking intelligence for a flaw.
So they said, uh, giving up is actually a smarter strategy.
So in this case, the AI correctly identified.
an impossible brute force task and sought a shortcut.
That's the reality.
And the algorithm failure is a red herring.
It proves the AI is a complex mind, not a simple machine.
All right.
Critique four, well, this wall that they're talking about, it's imaginary because Apple's
wall was just an artifact of their own restrictive rules by cutting down the token output
and restricting tool use.
So yeah, let's look live.
What could go wrong here doing this life?
All right.
So I built a working tower of Hanai in Claude 3.7.
So I didn't use Claude 4.
I used Claude 3.7 with thinking.
All right.
So this is a working verifiable tower of Hanai that I just built.
Okay.
So again, I'm not going to take too long to go through this,
but the object is you have to move these three discs from Tower 1 on the left.
all the way to Tower 3 on the right.
And you can never have a wider.
So there's, it's kind of like a pyramid for a podcast audience.
So let's just say there's a skinny, a medium, and a thick, right, all the way
in the left.
So you can move them one by one and you can never set a wider one on top of a skinnier one.
So I'll go ahead and, well, maybe I'll solve this.
I did it earlier and I could solve it correctly.
Right.
So there's a certain, certain number of moves.
All right.
Luckily here I was able to solve it.
All right. So I solved it in seven moves and that is the optimal number.
So according to the study, if you make a wrong move, it's gone.
All right.
So now I can reset this and I'm going to go to 10 discs.
Okay.
And this is, well, actually, no, let me go to 13.
No, I'll, I'll do 10.
All right.
And I'm going to turn the solution speed on it very fast because I built this thing to have a
solve mode.
All right.
So I can just click solve and we'll see it might take a while.
All right.
So we'll check back on it.
But we're going to see the number of moves that this does.
All right.
So like I said, it might, it might take a while because the minimal solution,
if you get it perfect, is 1,000 and 23 moves.
All right.
Well, actually, we can wait because it's already at about 400.
So you'll see here for our podcast audience, this is literally going through this game
step by step.
The one that Apple researchers said is,
not possible. You'll see here, I'm not a computer scientist, right? I'm just a pretty smart person
that knows the difference between marketing and research. And this Apple paper is marketing.
It's not research because you'll see silly old me, random guy here, right? I mean, I'm not a random guy,
but, you know, I just built something in a loud, clawed, 3.7 sonnet with thinking.
to pass this, right?
Just for fun, I'm going to do this 13 one.
Yeah, optimal moves 8,000.
All right, we're not going to sit and watch this,
but maybe we'll check in on it at the end,
and then we'll talk about the number of tokens.
This is the example I gave.
The number of tokens exceeds 65, 64,000.
So, let's go.
All right.
So we'll check on that maybe at the very end.
So let's go through our critique, our end here.
And I think the real motive here, I'm sorry to use this word, but it is what it is.
This is corporate propaganda.
This paper isn't about science.
It's a brutal corporate strategy.
This is a textbook case of weaponizing research to confuse the public, to confuse the public
and to hopefully distract stock analysts enough to where you don't lose hundreds of billions of dollars in market cap,
because you're not pursuing AI at the rate at which you should.
That's exactly what Apple is doing here.
Because if Apple truly believed that they had sound research,
which they don't in this case, this isn't sound research,
they would have invited researchers from other companies,
competitive companies.
That's what people do.
Or, you know, partner companies from multiple outside research organizations.
Apple didn't do this.
Because this is marketing, right?
And no researcher at a prestigious university would have ever put their name on this study.
It's not sound.
There are more holes in this thing than Swiss cheese.
And this is really just this fits Apple's pattern of cynical research because this is not the first time they've done it.
They've done it multiple times, right?
They put out these papers that are essentially downplaying AI.
impact while they're still scrambling to figure AI out.
And this provides cover for their own AI weakness and their longstanding failure of
Siri.
I mean, let's talk about this.
How much has Apple invested in Siri?
Countless amounts.
Yet open AI, Google, uh, and, in other companies and even little startups have
smart AI, uh, assistance like Siri that run.
unlapse around Syria.
This is just Apple's pattern of failure.
And then we have to talk about this premeditated media strike, right?
Are you going to tell me who's actually believing this, right?
That this comes out hours before Apple's big WWDC announcement where, oh, we're actually
not really announcing anything revolutionary when it comes to AI.
Oh, well, hey, did you see our research paper?
this whole AI thing, we're not sure about it.
Look at this research.
The research paper is useless, right?
Go read my iPhone reports where I take my phone in a cave, you know, in a dark room and I write how the camera doesn't work.
And I cover the flash, right?
Now, anyone with a brain and who knows the basics of AI and takes the time to analytically read this report knows that this thing, ultimately, this is
damage control. This is PR. This is marketing. This is a pre-buttal designed to discredit the entire
field right before they were underwhelmed. The FUD, right, the FUD strategy. So the whole point was to
dampen, you know, with the fear, uncertainty, and doubt. It's just to dampen the competitor hype
and to make breakthroughs from Google and Open AI seem like an illusion. And Claude, you
you know, anthropic, and to lower expectation for themselves. And this is to posture themselves
as skeptics, not laggards and a desperate move, I think, from a company that is playing from
very far behind. So final verdict, as we wrap this up, right? Darn, my tower of Hanai 13 disc
may not finish in time unless I really draw this out, which I'm not going to do. This isn't a
research paper. It's not. This is cherry-picked science
at best that is meant to deceive the public.
And Apple accomplished that, right?
There's going to be clapback, right?
Because the scientific AI research community, they're pissed.
Go take a look on Twitter.
Go take a look on Reddit.
Like, researchers are not happy about this because essentially what Apple did by, you know,
kind of more or less saying that all previous research, the data is contaminated.
And then they do this, you know, little board game test where all the results are online anyways.
Like, researchers are not happy about this because this was a slap in the face to them.
And essentially, like essentially Apple was kind of invalidating some great research or trying to anyways by saying, oh, all prior research was invalid.
And actually us at Apple, we're just going to now set the tone.
and say, hey, these large reasoning models that are actually revolutionary,
that are out there literally curing diseases,
finding new drug discoveries, they're not that good.
They're actually bad.
It's an illusion.
Don't worry.
Trust us.
We're Apple.
But it's scientifically illogical.
Like, it's a biased test with a predetermined outcome.
I could be wrong there, but I think there's a reason that Apple didn't show their work.
Right.
Because then people would have quickly picked this apart before their WWDC announcement.
And I'm telling you, this is not over.
Whether it's in three months or three years, this thing is going to unravel.
And there is going to be well-rounded, scientifically sound research that takes this, quote, unquote, study, this piece of Apple marketing and just puts it in a shredder.
Right.
And Apple will.
take a huge black eye from it, and rightfully so.
This is strategically deceptive,
a ruthless and cynical marketplace
and the real illusion
inside this illusion of thinking paper.
The real illusion is the paper itself.
Period.
All right. I hope this is helpful, y'all.
That's it.
I lied again. I said like 30-ish minutes.
We went 50. I'm sorry, y'all.
I really wanted to do as much as I could to provide you depth because unfortunately what happens
a lot of times in this space when a big company comes out with a paper, you know, write around
an important event.
Sometimes the media just blindly writes about it.
And that shapes the public discourse.
And not everyone has a super watchful eye and can break this down at a really granular level
with context and tell you what it actually means.
So I hope this was helpful.
All right.
Little hot take Tuesday, extra spice for you.
If you haven't already, please go to Your EverydayAI.com.
Sign up for the free daily newsletter.
We're going to be recapping the short version of this show.
So maybe you weren't able to listen to it all.
That's okay.
It's going to be in the newsletter.
So make sure you go to Your EverydayaI.com.
Sign up for that.
Thank you for tuning in.
Hope to see you tomorrow.
And every day for more Everyday.
AI. Thanks y'all. Meet Firefly AI assistant. Now live in Adobe Firefly, the Allman One Creative
AI Studio. Just describe what you want to create in your own words and the assistant handles the rest,
orchestrating multi-step workflows across Adobe Creative Cloud apps, including Photoshop, Premiere Express,
and more in one conversational interface. You direct the outcome while the assistant accelerates
execution. Stand control with the ability to step in and refine at any time. See it today at
firefly.adobie.com.
And that's a wrap for today's edition of Everyday AI.
Thanks for joining us.
If you enjoyed this episode, please subscribe and leave us a rating.
It helps keep us going.
For a little more AI magic, visit Your EverydayAI.com
and sign up to our daily newsletter so you don't get left behind.
Go break some barriers and we'll see you next time.
