Everyday AI Podcast – An AI and ChatGPT Podcast - EP 543: Apple’s Weaponized Research: Inside its illusion of thinking paper

Starting point is 00:00:00 This is the Everyday AI Show, the Everyday Podcast where we simplify AI and bring its power to your fingertips. Listen daily for practical advice to boost your career, business, and everyday life. Meet Firefly AI Assistant, now live in Adobe Firefly, the All In One Creative AI Studio. Just describe what you want to create and the assistant handles the rest, orchestrating multi-step workflows across Photoshop, Premiere Express, and more in one conversational interface. You direct the outcome. The assistant accelerates execution. Apple's latest AI research paper has gone viral.

Starting point is 00:00:51 So viral, actually, it showed up in my wife's nightly business newsletter. She reads that usually has absolutely nothing to do with AI. So Apple's The Illusion of Thinking paper shows evidence, well, they say that large reasoning models slam into a wall the moment that tasks get too demanding. So sounds pretty fatal for AI, right? Maybe. But if you dig deeper, you'll find flawed logic, cherry-picked testing, and in all or nothing, a grading rule that would flunk Einstein. In other words, if you take enough time to deconstruct this study, you'll find it's not much of a study at all. It's marketing from Apple. And you shouldn't fall for it. So stick with me for the next 30 minutes or so.

Starting point is 00:01:45 And I'll expose this quote unquote research paper for what it is. It's strategic deception. It's cherry-picked science and it's weaponized research at best. All right. Hope you're excited for this one. I am. If you're new here, welcome to Everyday AI. My name is Jordan Wilson.

Starting point is 00:02:10 and I'm the host and we do this every single day. This is your daily live stream podcast and free daily newsletter, helping us all not just keep up with the world of AI, but how we can use all this information to get ahead to grow our companies and our careers. So sometimes the information like today's show can be a little confusing, and that's what we do. We break it down, whether it's myself or bringing on world-class experts. We do it every single day, and then we break it down in our free daily newsletter.

Starting point is 00:02:35 So if you haven't already, please go to your everyday AI.com and sign up for that free daily newsletter. We're going to be recapping today's show and a whole lot more everything you need to stay in the loop and stay ahead and be the smartest person in your company. And if that's what you're trying to do, then you are definitely in the right place. So some days, we start off by going over the AI news. I don't want to make this an accidental like 50 minute podcast. I'm actually trying to keep these things under 30 minutes. But we'll see because at least for today, it's hot take Tuesday. and I got takes. Last stream audience, it's good to see you.

Starting point is 00:03:12 Thanks for tuning in. Let me know. Should I take it nice? Should I ramp it up? I'm kind of feeling spicy. I hope that's okay with you. But let's just get into it. All right.

Starting point is 00:03:27 Let's deconstruct this paper. The illusion of thinking. All right. Like I said, It's been grabbing a lot of headlines recently. And let me also put my cards on the table, right? Because you may be thinking, okay, this Jordan guy, you know, he's obviously very pro AI.

Starting point is 00:03:52 Am I sure? Yeah. You could say that. If I'm being honest, I'm pro AI because I feel there's no real choice, right? I believe in large language models, the power of large language models. just the way the entire world is investing in them. There's really no other solution. And one other thing, I want to talk a little bit very quickly about my background.

Starting point is 00:04:18 So I mentioned it a couple times in our, you know, 540 plus episodes together. But I started my career as investigative reporter. I did okay. I was Pulitzer Fellow. I won ACP story of the year. So when I look at these things, I don't just read, the study. Yes, I read the study manually twice. I fed it to three separate large language models. I combined a bunch to have conversations with the paper. So I did it old school manual,

Starting point is 00:04:48 you know, and then using AI as well. So I want you to know, I'm not just blindly ever following any study, what it says or what it doesn't say, whether I have a preconceived notion on if I agree with it or not. But essentially what this study said is, hey, these large language models that reason or what they call large reasoning models, uh, they don't really think, right? This whole thinking thing, it's an illusion. Um, and I'm, I'm actually very excited to break this one down, uh, but I'm hoping to do it in a very concise way. So, uh, if I go on fewer tangents today, uh, that's probably why. So let's just start at a glimpse. All right. So maybe you don't have 30 minutes. Maybe you have five. Well, let me spend the next two minutes,

Starting point is 00:05:33 just giving this to you at a glimpse. Here's why. what's happened. And then for the rest of the episode, I'm going to lay it all down. Lay it all out for you. So, you know, I'm not going to keep you captive here just for 20 minutes just to get to what's actually happened. Okay, so Apple, Apple just released about four days ago, its illusion of thinking paper. And this was three days before their big conference, the worldwide developer conference. And in this paper, they publicly claim that advanced AI reasoning is more or less fake. They're saying that these large reasoning models don't actually think their entire experiment was fundamentally writ. All right.

Starting point is 00:06:11 And I'm going to show you why and show you how. But essentially, they said that AI reasoning models couldn't use code. What? Okay. Which is the single most effective way to solve the problems that the researchers were giving them. So, you know, the researchers are like, hey, here's all these problems. And normally a large language model would be like, yeah, I'm going to use code, right? I'm going to use the tools at my disposal.

Starting point is 00:06:34 But the Apple researchers that, nah, you can't. They also misrepresented the AI's intelligent decision to give up on impossible brute force tasks as a reasoning collapse, which I would say is not the case. And they were treating that as a feature versus a bug. Next, Apple failed to disclose that their hardest tests were technically physically physical. impossible for the AI to pass due to its handcuffed token limits. Yeah, more on that in a bit. And y'all, y'all know I bring receipts, all right? Also, the paper's timing reveal its true purpose, a strategic media strike to distract

Starting point is 00:07:21 from Apple's own AI weakness right before WWDC and their lack of AI. Because everyone knows it's no secret. Apple has failed. And I think this will probably go down as the biggest failure in business history, Apple's absolute failure to put together anything resemblance of artificial intelligence. Maybe that's why they called it Apple intelligence because they couldn't actually figure out artificial intelligence, right? And this also, this wasn't a good faith scientific study. It really wasn't.

Starting point is 00:07:52 It was a calculated act of corporate deception disguised as research. And I'm not blaming this on the researchers, right? I'm sure that there were some higher ups that were pulling some strings or maybe that, you know, passed this down like, hey, we need to, you know, get some research that's very hard on these large language models, these reasoning models. All right. So that's what we're going to be going over. But aside from what I just laid out, which we're going to go into more depth, I want to talk. There's at least two trillion other reasons Apple is putting out this quote unquote paper. all right and i have taken my time mainly because i do our hot takes on tuesday and this paper came out i believe it was a friday or saturday um but there's two trillion reasons and no one's talking about

Starting point is 00:08:41 this why is apple doing this why has apple put out multiple papers uh that literally go against the power and the capabilities of large language models and AI well one i kind of already answered they can't figure it out. But here's two trillion reasons why. All right. So this is for our podcast audience. I do have some visual slides on today's episode. I'm going to do my best as I always try to do to describe them to you.

Starting point is 00:09:12 But you can always check out the show notes or go to our website and watch the video version. So then you can see what I'm sharing on my screen. But essentially pre-generative AI. And this was in 2021. Pre-generative. AI. Apple was crushing the world. This was like 92 dream team kind of dominance for Olympic basketball. It wasn't close. Apple had a $2.1 trillion market cap in 2021. And the next closest company, Microsoft, had only a $1.6 billion. Now, I'm not the best at math, but that's not close. Having a half

Starting point is 00:09:56 billion or sorry half trillion dollar sorry that was 2.1 trillion market cap versus Microsoft's 1.6 trillion dollar market cap. So they had a half trillion dollar lead on the next biggest company in the world, which is not even close. They were blowing out the competition. Like I said, this is 92 dream team. This is, you know, 97 bulls, the 72 intent right there. No one's close. It is a blowout. Apple is blowing out. Apple is blowing out. the rest of the world in terms of we are the biggest, we are the best company and it's not even close. Fast forward to today. Yeah. Apple is the third biggest company in the U.S. by market cap, right? And now they are a half trillion dollars behind Microsoft, right, which I would say,

Starting point is 00:10:48 you know, depending on how you look at it, you could say it's Microsoft, you could say it's Google. I would probably say Microsoft is Apple's closest competitor, right? Because everyone's kind of changing in terms of where the revenue is coming from, you know, where they're trying to compete, et cetera. But you could say historically that Microsoft and Apple have been the two competing with each other. So let me say that again. In 2021, Apple was blowing Microsoft out, all right? To the tune of a half trillion dollars. Now Microsoft is blowing out Apple. And if you took the same growth rates, if you take the growth rate that Microsoft had from 2021 until today, going from about a $1.6 trillion market cap to a $3.5 trillion market cap, if Apple stayed on a similar

Starting point is 00:11:36 or that same growth trajectory as Microsoft did do the math, y'all, that means Apple is at about a $5 trillion valuation. Instead, they're staggering at $3 trillion. So essentially, if they would have made the similar moves that Microsoft did, presumably they would be a $5 trillion market cap company. So they have left, you can make the argument that they've left $2 trillion in market cap on the table by not figuring out AI. And it's not for not trying, right? We've seen a lot of reporting going back multiple years, a report from 2023, which I remember covering this report the day it came out on the everyday AI.

Starting point is 00:12:20 show. Yeah, we've been doing this thing for a while. And it said Apple is reportedly spending millions of dollars a day training its AI. And Apple internally at the time said that their internal model, which it was co-named Ajax, and it did come out under a similar name. They said it is the most advanced language model and it is more powerful than chat GPT. All right. Imagine spending millions of dollars a day just on training AI models if you're Apple. And when you finally quote unquote released it. You didn't even say it by name in the main keynote. It was almost like Apple was embarrassed by the language model that they released at last year's

Starting point is 00:13:00 WWDC. It was a small language model that lived on device. It's Ajax model. They didn't even say it by name in the main keynote. Right. Because if they would have, it would have been embarrassing. Right.

Starting point is 00:13:13 So it's almost like they didn't want to claim it because they had reportedly spent many, many, many millions of dollars. And by many millions, at that point, I mean, yo, millions of dollars a day back in 2023, do the math. That's potentially hundreds of millions or billions of dollars that they spent on AI that just didn't work. And like I said, Apple is the only big tech company that has failed to produce the most basic of AI offerings. Apple's produced nada. Nata that works at least.

Starting point is 00:13:50 All right. So some some headlines here from some different publications like payments, Axios, Bloomberg, PC Magazine, Computer World. Let's read some of these headlines, shall we, The Verge? This is a crisis. New Apple report claims will get no Siri upgrades at WWDC due to AI turmoil. Apple's AI headaches could lead to lukewarm revenue growth. Drama at Apple as AI. failures cause heads to roll. Apple sued for false advertising over Apple intelligence. Why Apple still

Starting point is 00:14:29 hasn't cracked AI. Two more class action lawsuits target misleading Apple intelligence claims. Yeah, Apple's rollout of AI was absolutely so bad that they have, are now facing multiple class action lawsuits because they couldn't deliver the simplest version of AI that they promoted, right? And I'm technically one of those people, right? I have to be honest. I'm recording this on an Apple Mac mini, the camera I'm using for the live stream here. It's the new iPhone. And one of the reasons I bought this new iPhone is because they're like, oh, we're going to have all this new AI on the iPhone. And here it is almost a year later. There's not a single thing on this iPhone that's quote unquote AI. There's not. Right? Like, I'm looking for it. I'm like, hey, Siri, find me the AI. And series, you know,

Starting point is 00:15:18 10 minutes later, would you like me to use? chat GPT for this query. So yeah, Apple has fumbled the bag harder than any company has ever fumbled the bag, I would say from a business perspective. Because when you think of the numbers, think of the numbers, I don't think that's an exaggeration because that, even though it's a hypothetical scenario I laid out, it was probably a realistic scenario that Apple should have traveled that path. They should have grown at the same rate that Microsoft grew over the last four years because

Starting point is 00:15:48 of generative AI, but they did it. They didn't, but they should have. Multiple trillion dollar market cap mistake from Apple, which would very likely, I think, qualify that to be the biggest business blunder ever. And it's probably not even close. So all of Apple's competitors have been cashing in on AI, and Apple is still failing. So with trillions of dollars on the line, Apple needed a red herring. The claim that AI reasoned,

Starting point is 00:16:21 is an illusion, right? Because all these other companies, even though Apple has their own edge AI, these are small language models that live on device. They don't have a large reasoning model. So essentially what's happening here is all these other companies are running away, you know, getting insane revenue from their AI offerings. And Apple's like, hmm, what if we just throw some deception and doubt and confusion in the ring here, right before our big event, right? So then people will not be mad at us if we don't release anything AI at WWDC this year. So that was yesterday on Monday. Apple had their WWDC event where they essentially took a quote unquote gap year. It was reportedly took a gap year on AI. They didn't really

Starting point is 00:17:11 release anything new. Whereas last year at their WWDC, they said AI every three seconds. They actually rebranded it because it's Apple. They're like, oh, it's not even artificial intelligence. It's Apple intelligence. Our AI is better than AI, right? And here we are a year later. And they're like, whoops, we're getting sued. We couldn't deliver.

Starting point is 00:17:30 So let's take a gap here. And instead, let's create some confusion. Let's get a huge viral study. Let's throw a, you know, a big smoke screen in front of everyone, cause some chaos. And then maybe people will temporarily forget that we stink at AI. and that we haven't been able to deliver on our promises. And maybe shareholders will look at this study and be like, oh, smart Apple.

Starting point is 00:17:54 Yeah, look. This great research shows that these large reasoning models don't work. Good thing. Apple hasn't figured it out. Wrong. So let's actually deconstruct this thing. Let's take it down. All right.

Starting point is 00:18:12 On my screen, I'm showing you the difference between Apple's quote unquote study, which is on the left. and what I think a real study should look like on the right. All right. And I've talked about the one on the right. Apple's study, quote unquote, is just Apple researchers, all right, which is not abnormal. Okay. I'll say this.

Starting point is 00:18:42 And I'm not saying this in a, how do I say this? Like a lot of people are throwing some shade at some of the Apple researchers because they're technically interns. I'm not going to do that because that's technically normal. Right. So when PhD candidates in computer science, right, are looking to complete some meaningful research, you know, a lot of times big tech companies will hire them on as interns or they were already interns there to begin with.

Starting point is 00:19:07 So I'm not going to go down that route because these people are very capable. But one thing I want you to look at, it's all Apple researchers. And that's it. And like I said, That's usually only normal when you are announcing a new model and you put out a paper around a new model, right? Otherwise, good research that changes the conversation on artificial intelligence would usually look like the paper on the right. This paper is personhood credentials. All right.

Starting point is 00:19:37 This was a pretty meaningful research paper that changed the narrative or at least tried to change the narrative. on, you know, AIs that are trying to, you know, imitate humans. This is what research looks like. Because on this piece of research, you have researchers from multiple big companies. You have them from OpenAI, Harvard, Microsoft, University of Oxford. You know, a lot of other ones, my screen is actually a little blurry here. But it's from dozens of companies in universities. throughout the world.

Starting point is 00:20:18 That's what a normal research paper looks like, right? You would see multiple companies, multiple research institutions. On the left, that's what marketing looks like. Only Apple researchers, nothing else. All right. Real quick, got to take a quick break for a word from our sponsors. This podcast is supported by Google. Hey everyone, David here, one of the product leads for Google Gemini.

Starting point is 00:20:51 Check out VO3, our state-of-the-art AI video generation model in the Gemini app, which lets you create high-quality, eight-second videos with native audio generation. Try it with a Google AI Pro plan or get the highest access with the ultra plan. Sign up at Gemini.com to get started and show us what you create. All right, let's get back into it and let's break down this paper a little more. And I'll tell you this, the paper is out there. It doesn't take long to read. And I think enough people by now have already gone through the more technical side of this paper line by line. So I'm just going to more focus on some big picture ideologies and methodologies that were completely elementary and just defied logic. Not in a good way. Right. So let's start with their Apple's premise and these kind of flawed benchmarks. Okay. So Apple claimed the need for this test, right? Like, why would they even come out, or sorry, why would they come out with this research? They argued that standard AI tests for

Starting point is 00:21:59 math and coding are unreliable due to data contamination. Data contamination is like kind of saying, like, hey, all these other, you know, studies that all these other researchers do from multiple companies, multiple universities. Yeah, they got it wrong because, you know, their data's bad. That's not good for a researcher. And that's why I also don't think that this is going to turn out well for Apple because they essentially just kind of slapped a bunch of researchers silently in the face and said, yeah, your research is, is, is, is rubbish because you didn't even know that your data is contaminated. Not a good look. All right. So they said that also all these other, all like every other single, you know, research paper, you know, it's just contaminated data. And in,

Starting point is 00:22:39 these models are essentially just memorizing and they're just, uh, they're not even reasoning. They're just remembering, right? So it's a valid. concern here. So, okay, we're still fine. And their proposed solution was to create a clean and controllable environment to test what they said was a true unvarnished reasoning limits of modern AI. All right. Sure. Let's see what you got, Apple. So they came up with their kind of reasoning lab. They built what they said was a sterile testing environment using four classic logic puzzles, framing them as pure test of logic. And each puzzle was paired with a simulator, an automated referee that checked every single move the AI made and immediately flagged the first illegal one, ending the test

Starting point is 00:23:29 with a failure. So if a model got any of these four puzzles, a single move in any of these four puzzles wrong, test over failure. So not good. That is hyper-strict, unforgiving. That's not how large language models, especially reasoning models, would generally work. But okay, sure, Apple, do your thing. Not making sense, but let's keep going. I do want to talk specifically about one of these kind of logic games that they used, the Tower of Hanai. So this is a very classic game.

Starting point is 00:24:04 And also all of these games are classic, which were already disproving Apple's point that they were trying to prove because they said all these other benchmarks out there were contaminated. it. So they thought like, oh, well, we can use a game like Tower of Hainai that's, you know, non-deterministic because it's a game. Wrong. All the solutions, the algorithm, everything about this Tower of Hainai is on the internet. It's in the training data. So like their original even reasoning for creating their games to test these reasoning models was absolutely bonkers like no it's already wrong you're like you're already wrong and we haven't even

Starting point is 00:24:50 started all right uh so this was their thought so uh the tower of hanae is a classic computer science problem you have to move discs between pegs never placing a larger pay a disc on top of a smaller disk and there's uh kind of three towers uh all the discs start on the left tower and you have to ultimately move them all the way over to the right tower with the largest disc on the bottom and the smallest disc on the top. So, you know, if there's only three discs like this example I have on the screen, it's not terribly hard, right? But as you add more disks, there's more complexity. So, you know, as an example, they gave games like this, but we're just, there's three other ones. Let's just talk about the Tower of Hainai.

Starting point is 00:25:35 And then they gave a system prompt and then a prompt to different reasoning large language models. and then they had them output their text, output their answer in text form, right? And then they had a simulator essentially and double-test, you know, double-checked all of the AI models results. Okay. Sure.

Starting point is 00:25:57 They also did checker jumping, river crossing, and blocks world. So let's talk about the actual models. So they tested thinking versions of DeepSeek R1 and Claude 37 Sonnet. They did a lot of more technical testing. They technically tested some OpenAI models, but OpenAI doesn't show the complete chain of thought, whereas DeepSeek R1 and Claude 37 Sonnet thinking due in the API.

Starting point is 00:26:24 So, you know, they did the right thing there, right, by making those the baseline models, and they also tested them against the non-thinking versions of themselves, which just adds some complexity. That's not even what we're doing here. But like I said, the scoring system is absolutely brutal. because if you make one wrong mistake from getting it perfect, it's a zero. So there's obviously when you talk about these puzzles, they're extremely complex, right?

Starting point is 00:26:51 And there's many different ways that you can solve them. But also you have to think up the context window of these models, right? And also the output limit for tokens, which we're going to talk about here in a second. So this part is crucial. All right. Because, well, let's actually first look at the results. So the results from what Apple reported, they said on easy puzzles, standard models did better or non-thinking models. On medium puzzles, these reasoning or thinking models exceeded.

Starting point is 00:27:30 And then on hard puzzles, they said, all models completely failed. They didn't even try. Right. And this is what they called the efforts collapse. So this is where you saw. And as a former journalist, when I read this on Saturday, I'm like, oh, gosh, the media is going to get, because I saw it literally once it came out, right? Because it was trending on Twitter right away, because you saw all these, you know, headlines like, oh, you know, reasoning models collapsing,

Starting point is 00:27:56 you know, the AI wall, right? Like all these AI doomsday articles. And like, I'm reading this and I'm like, oh, gosh, like the media is going to completely fall for this, right? Being a former journalist, nothing against, I, like, I go to all these conferences. I meet brilliant tech journalists. And then there's some that, you know, are overwhelmed and you get all these press releases and you're like, okay, this is a salacious headline. Okay, it looks factual. It's a research paper. Sure. Let's go with it. Right. It's going to click. Right. We're going to get clicks. Look at these headlines we can put on this thing. Right. So that's kind of what happened. And, you know, they talked about this effort to collapse. And that was their headline finding that on the

Starting point is 00:28:34 hardest puzzles, the thinking models essentially would think less or even just give us. generating fewer words before failing. So they just said, oh, reasoning models give up. And then also the algorithm failure. Their supposed killer blow was in a separate test. They gave the models, the step-by-step instructions or the algorithm, and it didn't help. And they all still failed at some point. So this is Apple's conclusion that, well, they just failed, right?

Starting point is 00:29:06 And I have a graph here. I'm not going to spend five minutes to explain it. But this just shows the complexity for the Tower of Hanai example and the number of disks. So, you know, the more disk in that example, the much harder it gets, right? I can solve it with three. I could probably solve it with four, but I don't have time to waste. You know, to solve it with anything more than that, you've got to be either like have a computer science, math, like crazy logical brain. Or you have to just study this game, right?

Starting point is 00:29:38 It's how some people can do the Rubik's Cube, you know, while juggling in 10 seconds while, you know, spitting fire, you know, or whatever these, you know, incredible acts of, you know, athleticism and brainpower people do. But, you know, for the most part, the average human might be able to solve this at, you know, four disks, five disks. If you're a genius, maybe longer. But a human's not solving this at eight disks at nine, ten. Definitely not there, right? So essentially, it's not. surprising necessarily that an AI couldn't, right? Because if you get the smartest humans in the world and give them a 15 disk, are they going to be able to do it? I don't even know if it's possible, right? Anyways, let's look a little bit here about what this actually means from an output token. That's important because the models they chose aside from the fact they didn't allow them to use code, which come on, they also surprisingly said that they only chose the models with a 64K

Starting point is 00:30:49 token output limit. All right. That's important to talk about because one of the requirements that the models had to do in the output. So Apple said that they weren't counting thinking tokens. That's not usually how it works. So that, you know, kind of chain of thought processing, which is a lot of what's happening under the hood.

Starting point is 00:31:06 But they did require the model to spit out every single move. And to put out a move, it's actually kind of complex. It's not like B1. It's not like chess. You know, I don't know chess, but it's not like B2 to D2. Right. One move can be very complex and can eat up a lot of tokens. So conservatively, right, I looked at the actual example moves that they gave.

Starting point is 00:31:29 They didn't obviously share their whole findings. It was very little and saw that most moves were. 10 to 12 tokens. So conservatively, a 13 disk, right? A 13 disk problem of this Tower of Hanai would require 65,000 output tokens. I'm going to repeat that. The study was not possible, right? They did it all the way up to 15, 20 disks. Can't do it. If you require the model and Apple, hey, Apple researchers, next time, do what smart researchers do. Yeah, I'm getting mad because I read a lot of research papers.

Starting point is 00:32:17 And this one I knew was marketing and that made me upset, right? Not just right, because I do this every day, but because there's a scientific community that I think is disgusted by this and rightfully so. this was a haphazard terrible study. Let me just say, let me just say like how it actually is. This is terrible study. They didn't share any of their actual results.

Starting point is 00:32:45 They said, here's this system prompt. Here's an example of a prompt. And here's our overall outputs, right? You need to share. Share exactly. Here's what the chain of thought said. Here was, you know, on the hardest,

Starting point is 00:32:57 on an eight, on a 10 disk. Here's what the output was. But going by how large language models work and the requirements in their own paper, they would have to output every single move. And if, and I'm being ultra conservative here, to solve a 13 disk would take more than 8,000 moves. And to be able to spit those all out as required by the system prompt and the example in the system prompt, it's not possible. 65,000 tokens.

Starting point is 00:33:34 Okay. So Apple, you literally designed a test that you knew was going to fail at a certain point of complexity, at least according to kind of the laws and the math set forth. So their conclusion at least, well, reasoning models are an illusion. and the thinking that we see is a trick, right? They're not actually thinking. You know, they're just, they're just doing next token prediction. You know, it's stage one thinking, not stage two, right? I'm going to have a whole episode on this at some other point.

Starting point is 00:34:12 The concept of reasoning, right? In like stage, like stage one and stage two thinking, right? So what is? what is reasoning, right? I'd like to say it's just connecting stage one thinking anyways, right? Stage one is quick, intuitive, automatic responses based on pattern recognition, learn from data. That's stage one.

Starting point is 00:34:46 And then stage two represents more deliberate analytical or conscious approach that involves reasoning and planning. So you could say the same thing about, you know, non-reasoning models and reasoning. models. Non-reasoning models, right? These are the faster ones. These are, you know, pattern recognition. But all stage two reasoning thinking is, it's just stage one, but slower.

Starting point is 00:35:13 Right. So I don't know. Like even the concept of arguing against reasoning models seems a little bit illogical when it's just really made up of stage one thinking anyways. It's like, what is reasoning? It's not a different language. you're just taking more time doing stage one thinking, right? Pattern recognition.

Starting point is 00:35:34 That's all reasoning is anyways in my head, right? I don't touch the stove because it's hot, right? But I've learned those different things that lead me to make that reason or to, you know, think or plan ahead in a certain way, right? If I'm planning for a big show like this, I spent many hours planning this show. I'm using stage one thinking, right? That's literally what I'm doing. Pattern recognition.

Starting point is 00:36:04 I've done this so many times. I recognize patterns. I put them together, right? That's planning. It's a lot of, it's thousands or millions of neurons following in our, firing off in our brain. That's just quick, intuitive, automatic responses based on data and pattern recognition. That's all it is.

Starting point is 00:36:23 Anyways, I'll save that for another day, another show. So let's get back to this Apple study, right? The other thing is Apple, not only did they cook the books beforehand. Sorry, you did. Unless you actually share the data and we can make an assumption otherwise or we can make a connection otherwise. If we go by the math, if we look at exactly what happened and the fact that they literally decided the way that we're going to measure reasoning. Well, they said the data is contaminated. So they had this brilliant idea.

Starting point is 00:36:59 Let's use a non-deterministic game. It's already all on the internet anyways. It's already in the training data. So you're already wrong to begin with. And you find you cherry pick, right? This is almost like they got results. And it then seems like they just reverse engineer the entire study. Right.

Starting point is 00:37:21 I'm not saying they did. but in theory that could have happened. The study makes no sense. Go read it for yourself two or three times. And then go talk with a large language model. Don't lead a large language model. Just ask, does this make sense? Or ask your own self.

Starting point is 00:37:36 Does this make sense? Anyways, I have an example study here, an iPhone study. All right? So the Apple researchers, you know, they went through 25 rounds. Let's say I get 25 new iPhones. And I turn off cellular data and I turn off Wi-Fi, I turn off Bluetooth, I turn off everything, but there's a new feature on iPhones called SOS, and it uses satellite.

Starting point is 00:38:03 All right. And then I go on vacation, and I'm on satellite mode, and I'm testing the phone. I'm testing, but I'm only testing a couple things, you know, just like Apple did. I'm just going to test, you know, FaceTime and phone calls and getting on social media and using, you know, chat GBT and Google Gemini on my phone. That's what I'm going to test. Okay. And then, well, turns out doesn't really work very well. So now, instead of coming up with a specific report that says, hey, I'm.

Starting point is 00:38:46 reviewing this SOS satellite feature, which just sends messages to emergency response services. Instead, I'm going to say, I'm going to put out, well, it's factual, right? I can put out the facts. I can say, hey, here's what I did. And then at the very end, I'm going to say, hey, I restricted, you know, Wi-Fi and Bluetooth, right? It's similarly the way that Apple set up this study. they restricted tool use, which is the way that any reasoning model would solve this thing.

Starting point is 00:39:21 And guess what? I'm going to solve it here live in like 30 seconds. I'm not going to solve it. A large language model is going to solve it. You're going to see when you give the model the tools that it needs, it does the job. So I don't know why Apple thought, oh, this is brilliant. We'll just restrict its core capabilities.

Starting point is 00:39:38 We'll put it in this super refined box. we'll sprinkle a bunch of, you know, big words on people, we'll send it out to all the journalists, and they're going to cover it. Yay! No. Yeah, just wait until my report, the illusion of iPhone connectivity drops, right?

Starting point is 00:39:57 FaceTime doesn't work. So why does the paper fail? Well, there's a lot of reasons. I'm going to go through this quickly. The data, it is precise, but the interpretation is a spectacular failure of logic. All right. Let's look at my just, and I could go on for hours.

Starting point is 00:40:15 I'm going to try to go fast now. But their clean test. So let's go over. I have five critiques here. All right. So first, the test is raped. Their clean test used different puzzle games.

Starting point is 00:40:26 One was Tower of Hanai, but the solutions are plastered all over the internet anyways. So, uh, the test punishes a creative AI for not being a perfect monotonous calculator. All right. Critique. Two, it is designed to guarantee failure on the harder levels of this testing.

Starting point is 00:40:46 By doing no tool use, Apple didn't get the models tool use, and they couldn't write code, which is the obvious and the only way that a reasoning model would actually solve the puzzle. Also, they set these arbitrary limits. They capped the AI, specifically Claude 3.7 thinking, which is the best model that they used in terms of thinking. It was, you know, that deep seek. They capped it at $64,000 output tokens when there is a $128K model available. And also the absurd scoring, the one mistake in your out rule pretty much ensures failure. All right.

Starting point is 00:41:25 And yeah, receipts. All right. So in this, when I'm reading the paper, like, when I'm seeing things that are verifiably false right away, how can you take the rest of the paper seriously, right? So, you know, Apple said in their report section A2, we didn't have to go to the bottom for this one. They said, for Claude 3.7 Sonnet, thinking and non-thinking models, we used maximum generation budget of 64,000 tokens, access through the API interface. And I literally went through and I looked at the day this was released, the day Claude 377, thinking on the API was released.

Starting point is 00:42:09 I went to archive.org. I got a screenshot. And yeah, obviously, there's, the max token is a hundred and twenty-eight. If you can't get the basic things right, why would anyone trust your outputs, let alone a flawed methodology? All right. Three, mistaking intelligence for a flaw. So they said, uh, giving up is actually a smarter strategy.

Starting point is 00:42:37 So in this case, the AI correctly identified. an impossible brute force task and sought a shortcut. That's the reality. And the algorithm failure is a red herring. It proves the AI is a complex mind, not a simple machine. All right. Critique four, well, this wall that they're talking about, it's imaginary because Apple's wall was just an artifact of their own restrictive rules by cutting down the token output

Starting point is 00:43:06 and restricting tool use. So yeah, let's look live. What could go wrong here doing this life? All right. So I built a working tower of Hanai in Claude 3.7. So I didn't use Claude 4. I used Claude 3.7 with thinking. All right.

Starting point is 00:43:24 So this is a working verifiable tower of Hanai that I just built. Okay. So again, I'm not going to take too long to go through this, but the object is you have to move these three discs from Tower 1 on the left. all the way to Tower 3 on the right. And you can never have a wider. So there's, it's kind of like a pyramid for a podcast audience. So let's just say there's a skinny, a medium, and a thick, right, all the way

Starting point is 00:43:51 in the left. So you can move them one by one and you can never set a wider one on top of a skinnier one. So I'll go ahead and, well, maybe I'll solve this. I did it earlier and I could solve it correctly. Right. So there's a certain, certain number of moves. All right. Luckily here I was able to solve it.

Starting point is 00:44:09 All right. So I solved it in seven moves and that is the optimal number. So according to the study, if you make a wrong move, it's gone. All right. So now I can reset this and I'm going to go to 10 discs. Okay. And this is, well, actually, no, let me go to 13. No, I'll, I'll do 10. All right.

Starting point is 00:44:28 And I'm going to turn the solution speed on it very fast because I built this thing to have a solve mode. All right. So I can just click solve and we'll see it might take a while. All right. So we'll check back on it. But we're going to see the number of moves that this does. All right.

Starting point is 00:44:45 So like I said, it might, it might take a while because the minimal solution, if you get it perfect, is 1,000 and 23 moves. All right. Well, actually, we can wait because it's already at about 400. So you'll see here for our podcast audience, this is literally going through this game step by step. The one that Apple researchers said is, not possible. You'll see here, I'm not a computer scientist, right? I'm just a pretty smart person

Starting point is 00:45:17 that knows the difference between marketing and research. And this Apple paper is marketing. It's not research because you'll see silly old me, random guy here, right? I mean, I'm not a random guy, but, you know, I just built something in a loud, clawed, 3.7 sonnet with thinking. to pass this, right? Just for fun, I'm going to do this 13 one. Yeah, optimal moves 8,000. All right, we're not going to sit and watch this, but maybe we'll check in on it at the end,

Starting point is 00:45:52 and then we'll talk about the number of tokens. This is the example I gave. The number of tokens exceeds 65, 64,000. So, let's go. All right. So we'll check on that maybe at the very end. So let's go through our critique, our end here. And I think the real motive here, I'm sorry to use this word, but it is what it is.

Starting point is 00:46:17 This is corporate propaganda. This paper isn't about science. It's a brutal corporate strategy. This is a textbook case of weaponizing research to confuse the public, to confuse the public and to hopefully distract stock analysts enough to where you don't lose hundreds of billions of dollars in market cap, because you're not pursuing AI at the rate at which you should. That's exactly what Apple is doing here. Because if Apple truly believed that they had sound research,

Starting point is 00:46:50 which they don't in this case, this isn't sound research, they would have invited researchers from other companies, competitive companies. That's what people do. Or, you know, partner companies from multiple outside research organizations. Apple didn't do this. Because this is marketing, right? And no researcher at a prestigious university would have ever put their name on this study.

Starting point is 00:47:19 It's not sound. There are more holes in this thing than Swiss cheese. And this is really just this fits Apple's pattern of cynical research because this is not the first time they've done it. They've done it multiple times, right? They put out these papers that are essentially downplaying AI. impact while they're still scrambling to figure AI out. And this provides cover for their own AI weakness and their longstanding failure of Siri.

Starting point is 00:47:49 I mean, let's talk about this. How much has Apple invested in Siri? Countless amounts. Yet open AI, Google, uh, and, in other companies and even little startups have smart AI, uh, assistance like Siri that run. unlapse around Syria. This is just Apple's pattern of failure. And then we have to talk about this premeditated media strike, right?

Starting point is 00:48:20 Are you going to tell me who's actually believing this, right? That this comes out hours before Apple's big WWDC announcement where, oh, we're actually not really announcing anything revolutionary when it comes to AI. Oh, well, hey, did you see our research paper? this whole AI thing, we're not sure about it. Look at this research. The research paper is useless, right? Go read my iPhone reports where I take my phone in a cave, you know, in a dark room and I write how the camera doesn't work.

Starting point is 00:48:52 And I cover the flash, right? Now, anyone with a brain and who knows the basics of AI and takes the time to analytically read this report knows that this thing, ultimately, this is damage control. This is PR. This is marketing. This is a pre-buttal designed to discredit the entire field right before they were underwhelmed. The FUD, right, the FUD strategy. So the whole point was to dampen, you know, with the fear, uncertainty, and doubt. It's just to dampen the competitor hype and to make breakthroughs from Google and Open AI seem like an illusion. And Claude, you you know, anthropic, and to lower expectation for themselves. And this is to posture themselves as skeptics, not laggards and a desperate move, I think, from a company that is playing from

Starting point is 00:49:48 very far behind. So final verdict, as we wrap this up, right? Darn, my tower of Hanai 13 disc may not finish in time unless I really draw this out, which I'm not going to do. This isn't a research paper. It's not. This is cherry-picked science at best that is meant to deceive the public. And Apple accomplished that, right? There's going to be clapback, right? Because the scientific AI research community, they're pissed. Go take a look on Twitter.

Starting point is 00:50:25 Go take a look on Reddit. Like, researchers are not happy about this because essentially what Apple did by, you know, kind of more or less saying that all previous research, the data is contaminated. And then they do this, you know, little board game test where all the results are online anyways. Like, researchers are not happy about this because this was a slap in the face to them. And essentially, like essentially Apple was kind of invalidating some great research or trying to anyways by saying, oh, all prior research was invalid. And actually us at Apple, we're just going to now set the tone. and say, hey, these large reasoning models that are actually revolutionary,

Starting point is 00:51:14 that are out there literally curing diseases, finding new drug discoveries, they're not that good. They're actually bad. It's an illusion. Don't worry. Trust us. We're Apple. But it's scientifically illogical.

Starting point is 00:51:29 Like, it's a biased test with a predetermined outcome. I could be wrong there, but I think there's a reason that Apple didn't show their work. Right. Because then people would have quickly picked this apart before their WWDC announcement. And I'm telling you, this is not over. Whether it's in three months or three years, this thing is going to unravel. And there is going to be well-rounded, scientifically sound research that takes this, quote, unquote, study, this piece of Apple marketing and just puts it in a shredder. Right.

Starting point is 00:52:07 And Apple will. take a huge black eye from it, and rightfully so. This is strategically deceptive, a ruthless and cynical marketplace and the real illusion inside this illusion of thinking paper. The real illusion is the paper itself. Period.

Starting point is 00:52:30 All right. I hope this is helpful, y'all. That's it. I lied again. I said like 30-ish minutes. We went 50. I'm sorry, y'all. I really wanted to do as much as I could to provide you depth because unfortunately what happens a lot of times in this space when a big company comes out with a paper, you know, write around an important event. Sometimes the media just blindly writes about it.

Starting point is 00:52:58 And that shapes the public discourse. And not everyone has a super watchful eye and can break this down at a really granular level with context and tell you what it actually means. So I hope this was helpful. All right. Little hot take Tuesday, extra spice for you. If you haven't already, please go to Your EverydayAI.com. Sign up for the free daily newsletter.

Starting point is 00:53:22 We're going to be recapping the short version of this show. So maybe you weren't able to listen to it all. That's okay. It's going to be in the newsletter. So make sure you go to Your EverydayaI.com. Sign up for that. Thank you for tuning in. Hope to see you tomorrow.

Starting point is 00:53:36 And every day for more Everyday. AI. Thanks y'all. Meet Firefly AI assistant. Now live in Adobe Firefly, the Allman One Creative AI Studio. Just describe what you want to create in your own words and the assistant handles the rest, orchestrating multi-step workflows across Adobe Creative Cloud apps, including Photoshop, Premiere Express, and more in one conversational interface. You direct the outcome while the assistant accelerates execution. Stand control with the ability to step in and refine at any time. See it today at firefly.adobie.com. And that's a wrap for today's edition of Everyday AI.

Starting point is 00:54:24 Thanks for joining us. If you enjoyed this episode, please subscribe and leave us a rating. It helps keep us going. For a little more AI magic, visit Your EverydayAI.com and sign up to our daily newsletter so you don't get left behind. Go break some barriers and we'll see you next time.

Everyday AI Podcast – An AI and ChatGPT Podcast - EP 543: Apple’s Weaponized Research: Inside its illusion of thinking paper

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.